Grocery Item Detection using TensorFlow Object Detection API

It’s a very tedious job to stand in a queue at the checkout side of retail shops. It is taking a long time to scan all the products one by one and then generate a bill. Why one needs to waste their time whereas we have a better solution.


In this generation of artificial intelligence, I come up with a new solution that can really reduce the time of checkout and billing by 50%. What if all products which the customer bought, come together and scanned in less than a minute. Yah! Sounds interesting… Let’s do it.

In this article, we will:

  • Perform object detection on custom images using Tensorflow Object Detection API
  • Use Google Colab free GPU for training and Google Drive to keep everything synced.
  • Detailed steps to tune, train, monitor, and use the model for inference using your local webcam.

I have created this Colab Notebook if you would like to start exploring. It has all the codes step by step for a single class object detection. I suggest looking at it after reading this tutorial.

Let’s get started!

  1. Collecting Images and Labeling them.
  2. Environment Setup.
  3. Installing Requirements.
  4. Preprocessing Images and Labels.
  5. Downloading Tensorflow model.
  6. Generating TFRecords.
  7. Selecting a Pre-trained model.
  8. Configuring the Training Pipeline.
  9. Tensorboard.
  10. Training.

I will be using pictures of soft drinks. The dataset contains 800 pictures of MUG beer in various positions, rotations, and backgrounds. I have used the Labelimg tool for annotations. You may use your own images or use the dataset I am using here!

If you have your own images collected, great!. If not, you can collect images from google or you can take pictures from your mobile phone too, depending on your problem.

3 things to take care of while collecting your own images:

  1. At least 50 images for each class. The more, the better! Get even more if you are detecting only one class.
  2. Images with random objects in the background.
  3. Various background conditions; dark, light, in/outdoor, etc.

Save your images in a folder namedimages.

Once you have your images gathered, it’s time to label them. There are many tools that can help you with labeling your images. Perhaps, LabelImg is the most popular and easiest to use. Using the instructions from the Github repo, download and install it on your local machine.

Using LabelImg is easy, just remember to:

  1. Create a new directory for the labels, I will name it annotations
  2. In LabelImg, Click on Change Save Dir and select the annotations folder. This is where the labels/annotations will be saved.
  3. Click on Open Dir and select the images folder.
annotations using Labelimg

Each image will have one .xml file that has its labels. If there is more than one class or one label in an image, that .xml file will include them all.

Setup your google colab notebook.

  1. Create a new Notebook.
  2. Change runtime type to GPU from hardware accelerator.

Upload your dataset and annotations.

You will have to zip the images & annotationsfolders and simply move them to your notebook.

Structure of directories:

Structure of directories

Google Colab has most of the packages pre-installed already; Python, Tensorflow, pandas, etc.

These are the packages we will need and they don’t get pre-installed by default. Install them by running:

!apt-get update!apt-get install -qq protobuf-compiler python-pil python-lxml python-tk!pip install -qq Cython contextlib2 pillow lxml matplotlib!pip install -qq pycocotools

Importing Libraries:

%tensorflow_version 1.xfrom __future__ import division, print_function, absolute_importimport pandas as pd
import numpy as np
import csv
import re
import cv2
import os
import glob
import xml.etree.ElementTree as ET
import io
import tensorflow.compat.v1 as tf
from PIL import Image
from collections import namedtuple, OrderedDict
import shutil
import urllib.request
import tarfile
from google.colab import files

Here, We need version 1.15.0 of TensorFlow to run a pre-trained model ssd_mobilenet_v2.

Splitting the images into training & testing:

Depending on how large your dataset is, you might want to split your data manually. If you have a lot of pictures, you might want to use something like this to split your data randomly.

# creating a directory to store the training and testing data
!mkdir data
# folders for the training and testing data.
!mkdir data/images data/train_labels data/test_labels
# combining the images and annotation in the training folder:
# moves the images to data folder
!mv MBeer/* data/images
# moves the annotations to data folder
!mv annotations/* data/train_labels
# Moves the first 400/2000 labels (20% of the labels) to the testing dir: `test_labels`
!ls data/train_labels/* | sort -R | head -168 | xargs -I{} mv {} data/test_labels

Now we need to create two csv files for the .xml files. It will contain each image’s file name, the label /box position, etc. Also, more than one row is created for the same picture if there is more than one class or label for it.

We need one pbtxt file that will contain the label map for each class. This file will tell the model what each object is by defining a mapping of class names to class ID numbers.

Make sure that all the images are in .jpg format.

#adjusted from:
def xml_to_csv(path):
classes_names = []
xml_list = []

for xml_file in glob.glob(path + '/*.xml'):
tree = ET.parse(xml_file)
root = tree.getroot()
for member in root.findall('object'):
value = (root.find('filename').text + '.jpg',
column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
xml_df = pd.DataFrame(xml_list, columns=column_name)
classes_names = list(set(classes_names))
return xml_df, classes_names

for label_path in ['train_labels', 'test_labels']:
image_path = os.path.join(os.getcwd(), label_path)
xml_df, classes = xml_to_csv(label_path)
xml_df.to_csv(f'{label_path}.csv', index=None)
print(f'Successfully converted {label_path} xml to csv.')

label_map_path = os.path.join("label_map.pbtxt")
pbtxt_content = ""

for i, class_name in enumerate(classes):
pbtxt_content = (
+ "item {{\n id: {0}\n name: '{1}'\n}}\n\n".format(i + 1, class_name)
pbtxt_content = pbtxt_content.strip()
with open(label_map_path, "w") as f:

Working directory at this point:

Tensorflow model contains the object detection API we are interested in. We will get it from the official repo.

# downloads the models
!git clone --q

Next, we need to compile the proto buffers

# compils the proto buffers
!protoc object_detection/protos/*.proto --python_out=.
# exports PYTHONPATH environment var with research and slim paths
os.environ['PYTHONPATH'] += ':./:./slim/'

Finally, run a quick test to confirm that the model builder is working properly:

# testing the model builder
!python3 object_detection/builders/

If it gives you an “OK” after executing, then everything is going great!

Tensorflow accepts the data as TFRecords data.record. TFRecord is a binary file that runs fast with low memory usage. It contains all the images and labels in one file.

In our case, we will have two TFRecords; one for testing and another for training. To make this work, we need to make sure that:

  • The CSVs file names is matched:train_labels.csv and test_labels.csv (or change them in the code below)
  • Current directory is object_detection/models/research
  • Add your custom object text in the function class_text_to_int below by changing the row_label variable (This is the text that will appear on the detected object). Add more labels if you have more than one object.
  • Check if the path to data/ directory is the same asdata_base_url below.
from object_detection.utils import dataset_util%cd /content/gun_detection/models/DATA_BASE_PATH = '/content/gun_detection/data/'
image_dir = DATA_BASE_PATH +'images/'
def class_text_to_int(row_label):
if row_label == '':
return 1
def split(df, group):
data = namedtuple('data', ['filename', 'object'])
gb = df.groupby(group)
return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)]
def create_tf_example(group, path):with,'{}'.format(group.filename)), 'rb') as fid:
encoded_jpg =
encoded_jpg_io = io.BytesIO(encoded_jpg)
image =
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),'image/width': dataset_util.int64_feature(width),'image/filename': dataset_util.bytes_feature(filename),'image/source_id': dataset_util.bytes_feature(filename),'image/encoded': dataset_util.bytes_feature(encoded_jpg),'image/format': dataset_util.bytes_feature(image_format),'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),'image/object/class/text': dataset_util.bytes_list_feature(classes_text),'image/object/class/label':dataset_util.int64_list_feature(classes),}))
return tf_example
for csv in ['train_labels', 'test_labels']:
writer = + csv + '.record')
path = os.path.join(image_dir)
examples = pd.read_csv(DATA_BASE_PATH + csv + '.csv')
grouped = split(examples, 'filename')
for group in grouped:
tf_example = create_tf_example(group, path)
output_path = os.path.join(os.getcwd(), DATA_BASE_PATH + csv + '.record')
print('Successfully created the TFRecords: {}'.format(DATA_BASE_PATH +csv + '.record'))

A pre-trained model simply means that it has been trained on another dataset. That model has seen thousands or millions of images and objects.
COCO (Common Objects in Context) is a dataset of 330,000 images that contains 1.5 million objects for 80 different classes. Such as dogs, cats, cars, bananas, etc.

I will be using ssd_mobilenet_v2_coco model for my project. You could use any pre-trained model you prefer.

Let’s start with selecting a pretrained model:

# Some models to train on
'ssd_mobilenet_v2': {
'model_name': 'ssd_mobilenet_v2_coco_2018_03_29',
'faster_rcnn_inception_v2': {
'model_name': 'faster_rcnn_inception_v2_coco_2018_01_28',

# Select a model from `MODELS_CONFIG`.
# I chose ssd_mobilenet_v2 for this project, you could choose any
selected_model = 'ssd_mobilenet_v2'

Download the selected Pre-Trained Model:

%cd /content/gun_detection/models/research# Name of the object detection model to use.
MODEL = MODELS_CONFIG[selected_model]['model_name']

#selecting the model
MODEL_FILE = MODEL + '.tar.gz'

#creating the downlaod link for the model selected

#checks if the model has already been downloaded, download it otherwise
if not (os.path.exists(MODEL_FILE)):
urllib.request.urlretrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)

#unzipping the model and extracting its content
tar =

# creating an output file to save the model while training
if (os.path.exists(DEST_DIR)):
os.rename(MODEL, DEST_DIR)

While training, the model will get autosaved every 600 seconds by default. The logs and graphs, such as, the mAP, loss and AR, will also get saved constantly. create a folder for all of them to be saved in during training:

  • Create a folder called training inside object_detection/model/research/

Tensorflow Object Detection API model comes with many sample config files. For each model, there is a config file that is ‘almost’ ready to be used.

Required edits to the config file:

  1. model {} > ssd {}: change num_classes to the number of classes you have.
  2. train_config {}: change fine_tune_checkpoint to the checkpoint file path.
  3. train_input_reader {}: set the path to the train_labels.record and the label map pbtxt file.
  4. eval_input_reader {}: set the path to the test_labels.record and the label map pbtxt file.
%%writefile {model_pipline}
model {
ssd {
num_classes: 1 # number of classes to be detected
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
similarity_calculator {
iou_similarity {
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
# all images will be resized to the below W x H.
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
#use_dropout: false
use_dropout: true # to counter over fitting. you can also try tweaking its probability below
dropout_keep_probability: 0.8
kernel_size: 1
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
# weight: 0.00004
weight: 0.001 # higher regularizition to counter overfitting
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
feature_extractor {
type: 'ssd_mobilenet_v2'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
# weight: 0.00004
weight: 0.001 # higher regularizition to counter overfitting
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
loss {
classification_loss {
weighted_sigmoid {
localization_loss {
weighted_smooth_l1 {
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.95
max_negatives_per_positive: 3
min_negatives_per_image: 3
classification_weight: 1.0
localization_weight: 1.0
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6

#adjust this to the max number of objects per class.
# ex, in my case, i have one pistol in most of the images.
# . there are some images with more than one up to 16.
max_detections_per_class: 1
# max number of detections among all classes. I have 1 class only so
max_total_detections: 1
score_converter: SIGMOID
train_config: {
batch_size: 16 # training batch size
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.003
decay_steps: 800720
decay_factor: 0.95
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
#the path to the pretrained model.
fine_tune_checkpoint: "/content/gun_detection/models/research/pretrained_model/model.ckpt"
fine_tune_checkpoint_type: "detection"
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 60000
#data augmentaion is done here, you can remove or add more.
# They will help the model generalize but the training time will increase greatly by using more data augmentation.
# Check this link to add more image augmentation:

data_augmentation_options {
random_horizontal_flip {
data_augmentation_options {
random_adjust_contrast {
data_augmentation_options {
ssd_random_crop {
train_input_reader: {
tf_record_input_reader {
#path to the training TFRecord
input_path: "/content/gun_detection/data/train_labels.record"
#path to the label map
label_map_path: "/content/gun_detection/data/label_map.pbtxt"
eval_config: {
# the number of images in your "testing" data (was 168 but we removed one above :) )
num_examples: 167
# the number of images to disply in Tensorboard while training
num_visualizations: 20
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
eval_input_reader: {
tf_record_input_reader {

#path to the testing TFRecord
input_path: "/content/gun_detection/data/test_labels.record"
#path to the label map
label_map_path: "/content/gun_detection/data/label_map.pbtxt"
shuffle: false
num_readers: 1

Here you can visualize everything that’s happening during training. You can monitor the loss, mAP, AR and many more.

Visualization at Tensorboard

To use Tensorboard on Colab, we need to use it through ngrok. Get it by running:

!unzip -o

Next, we specify where the log files are stored and we configure a link to view Tensorboard:

#the logs that are created while training 
LOG_DIR = "training/"
'tensorboard --logdir {} --host --port 6006 &'
get_ipython().system_raw('./ngrok http 6006 &')
#The link to tensorboard.
#works after the training starts.
!curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

When you run the code above, at the end of the output there will be a url where you can access Tensorboard.

10. Finally… It’s Training!

It’s the simplest step if you have done all the above things correctly😉. We just need to give it the following 3 lines of code:

Now set back and watch your model’s performance on Tensorboard.

!python3 /content/object_detection/models/research/object_detection/ \
--model_dir={model_dir} \
--alsologtostderr \

After the successful completion of training, you need to export and download the trained model.

by executing the following lines of code you will be able to export your model and then you can download it.

#the location where the exported model will be saved in.
output_directory = '/content/object_detection/models/research/fine_tuned_model'
lst = os.listdir(model_dir)
lst = [l for l in lst if 'model.ckpt-' in l and '.meta' in l]
steps=np.array([int(re.findall('\d+', l)[0]) for l in lst])
last_model = lst[steps.argmax()].replace('.meta', '')
last_model_path = os.path.join(model_dir, last_model)
#exports the model specifed and inference graph
!python /content/object_detection/models/research/object_detection/ \
--input_type=image_tensor \
--pipeline_config_path={model_pipline} \
--output_directory={output_directory} \

Results & Conclusion:

successfully detected soda botel

So, In this tutorial, I have tried to cover all the required steps to detect any object using Tensorflow Object Detection API for a single class. Here I am able to detect one grocery item which is MUG root beer soda. I will be looking forward to increasing the number of classes and try to add as many as objects at a single time. I hope you find this article helpful.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store