Microsoft Word - ETASR_V12_N3_pp8803-8808


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8803 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
The New Dataset MITWPU-1K for Object 

Recognition and Image Captioning Tasks 
 

Madhuri Bhalekar 

School of Computer Engineering and Technology  
Dr. Vishwanath Karad MIT World Peace University 

Pune, India 

madhuri.bhalekar@mitwpu.edu.in 

Mangesh Bedekar 

School of Computer Engineering and Technology  
Dr. Vishwanath Karad MIT World Peace University 

Pune, India 

mangesh.bedekar@mitwpu.edu.in 

Received: 5 May 2022 | Revised: 19 May 2022 | Accepted: 20 May 2022 

 
Abstract-In the domain of image captioning, many pre-trained 

datasets are available. Using these datasets, models can be 

trained to automatically generate image descriptions regarding 
the contents of an image. Researchers usually do not spend much 

time in creating and training the new dataset before using it for a 

specific application, instead, they simply use existing pre-trained 

datasets. MS COCO, Flicker, and Pascal VOC, are well-known 

datasets that are widely used in the task of generating image 

captions. In most available image captioning datasets, image 

textual information, which can play a vital role in generating 
more precise image descriptions, is missing. This paper presents 

the process of creating a new dataset that consists of images along 

with text and captions. Images of the nearby vicinity of the 

campus of MIT World Peace University-MITWPU, India, were 

taken for the new dataset named MITWPU-1K. This dataset can 

be used in object detection and caption generation of images. The 
objective of this paper is to highlight the steps required for 

creating a new dataset. This necessitated a review of the existing 

dataset models prior to creating the new dataset. A sequential 

convolutional model for detecting objects on a new dataset is also 

presented. The process of creating a new image captioning 
dataset and the gained insights are described. 

Keywords-convolutional model; dataset; image captioning; 

image labelling; object detection 

I. INTRODUCTION  

Many image datasets (e.g. MS COCO, Flicker, PASCAL) 
are available and can be used in image captioning systems. We 
wanted to explore the background and preprocessing required 
while creating such a dataset. For this purpose, we started to 
create a new dataset, called MITWPU-1K, of our university, 
MIT World Peace University, campus. Currently, the dataset 
consists of around 1500 images which are object labeled and 
captioned manually. These labels are classes of objects present 
in the image. Our dataset model is trained using the CNN 
(Convolutional Neural Network) architecture to detect objects 
present in an image. The validation while consideration of 
image selection is also presented. Currently, the MITWPU-1K 
dataset is having 4500 image descriptions, i.e. for each image, 
an average of three image annotations is provided. This dataset 
has a total of 68 objects and we are still working on its 

expansion. Before creating a new dataset, we studied some of 
the prominent existing datasets used in image captioning.  

II. RELATED WORKS 

Many architectures have been proposed for generating 
captions for a given image. The reviewed research papers 
belong to the domains of Machine Learning and Neural 
Networks. Most of the datasets which are used in object 
recognition tasks are developed by considering the task of 
object classification, detection, and labeling. In the image 
caption domain, the MS-COCO [1] dataset is considered a 
benchmark, in which object recognition is enhanced by 
advancing scene understanding and by piping the annotation 
via picture labeling, instance spotting, and instance 
segmentation pipelines. In the MS-COCO dataset, there are 91 
frequent object types with 25,00,000 labeled instances in 
3,28,000 images. The training set consists of 82,783 photos, 
whereas the validation set contains 40,504 images, and the 
testing set 81,434 images. The Pascal VOC [2] dataset consists 
of 11,000 pictures divided into 20 object groups, having 2,501 
images in the training set, 2,510 images in the validation set, 
and 4,952 images in the testing set. Imagenet [3] contains 
14,197,122 images and almost 21,000 object classes. ImageNet 
is built upon the hierarchical structure provided by WordNet. 
CIFAR-10 [4] dataset and contains 100 categories with almost 
60,000 images. They have performed object classification on 
tiny images of size 32×32. The authors explain how they have 
trained a two-layer convolutional Deep Belief Network (DBN) 
on a 1.6 million tiny images dataset.  

While analyzing these different dataset models, we come 
across the Yolo architecture [5] and Caffe framework [6]. Yolo 
architecture is used for object detection. Classification is done 
using CNNs along with localization using regression. Caffe 
convolutional architecture is used in Region-Based CNNs 
(RCNNs) for quick feature embedding framework. Caffe has 
already been employed in several academic research projects. 
Along with these papers, we also find some latest review 
papers on object detection methods used in deep learning [7, 8] 
which provide the metrics used for object detection along with 
the datasets used. The detailed review of many object detection 
survey papers is summarized in [7]. The authors observed that 

Corresponding author: Madhuri Bhalekar


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8804 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
for object detection, deep learning methods provide a 
prominent approach. The authors also highlight the future work 
that can be done in visual object detection like multi-domain 
object detection, silent object detection, and unsupervised 
object detection using a deep learning approach. In [8], authors 
provide the categorization of existing image captioning systems 
and commonly used datasets for image captioning. They 
provided the detailed statistics of the following datasets: MS 
COCO, Flickr30K, Flickr8K, Visual Genome, IAPR TC-12, 
Stock3M, and MIT-Adobe FiveK dataset. Table I presents the 
details of some of the existing datasets mostly used in image 
captioning.  

TABLE I. SUMMARY OF PROMINENT EXISTING DATASETS USED IN 
IMAGE CAPTIONING 

Dataset Images 
Objects / 

classes 

Training 

set 

Validation 

set 

Testing 

set 

MS-COCO 

[1] 
3,28,000 

80 

categories 
82,783 40,504 81,434 

Pascal VOC 

[2] 
11,000 20 2,501 2,510 4,952 

ImageNet 

[3] 
14,197,122 

21,000 

classes 
1.2 million 150,000 - 

CIFAR-10 

[4] 
60,000 100 - - - 

 
III. METHODOLOGY USED IN DATASET CREATION 

After analyzing the existing dataset models, we have started 
creating our new dataset which consists of images of the nearby 

vicinity of the MIT World Peace University campus. We are 
creating a new dataset which can be further used in object 
recognition with labeled data and captions for the image. The 
flow of the creation of the dataset is shown in Figure 1 and can 
be summarized as: 

• Image collection and validation 

• Interface to provide image description/annotations 

• Labeling objects 

These steps are explained in the following sections. 

A. Image Collection 

For creating the new dataset, we collected images having 
diversity. The main objective of collecting diverse images was 
to proper train the dataset to avoid the problem of over fitting 
or under fitting during training. So, we tried to collect diverse 
images, like images with different brightness levels, having 
different foreground and background, containing multiple 
objects, etc. We took images that include different activities 
and events carried out on the MITWPU campus. While 
building this dataset we considered the images of posters and 
banners of various events on campus which contain some text 
information. The main goal was to use these textual data from 
the image to provide more detailed image descriptions during 
the caption generation task. Initially, we collected the images 
with the use of a smartphone and a professional camera, due to 
which the gathered images have different resolution, size, and 
orientation. After collecting several images, we found that all 
images could not be considered for the new dataset, so the 
rectification of these collected images are carried out as 
mentioned below. 

 
Fig. 1.  System flow of dataset creation. 

B. Validation 

Images from different sources were collected (via a 
professional camera, smart phones, etc.) from which many had 
low resolution, poor brightness, whereas some images were 
redundant, i.e. captured with multiple clicks on the same view. 
While preparing the dataset we observed that images should 
provide a good representation of the classes which can further 
help in the classification process. To achieve this, these images 
were removed using manual filtration. After validating the 
selected images at present, we had 1500 images in the dataset. 
Further, to speed up the process, the images were resized. 
Some sample images are shown in Figure 2.  

We also considered images containing text, like banners 
and posters, so that in the future we can develop an image 
captioning system for generating the description of an image 
along with text extraction as shown in Figure 12. 

C. Interface to Provide Image Description  

As the dataset that can be utilized in an image captioning 
system was developed, we added descriptions for each image. 
To accomplish this task, a user interface was created in python 
as shown in Figure 3. Using this interface manually we 
provided captions for every image, which were stored in the 
form of csv and json files, and can be further used to train our 
dataset for image captioning (Figure 4). While providing the 


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8805 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
manual captions we took care to provide different captions for 
the same image considering all visual elements of the image. 
For this, we provided a minimum of three captions for each 
images. Finally, along with each new image, an image 
description file containing the image annotations was provided. 

 
Fig. 2.  Sample mages from the dataset. 

 
Fig. 3.  Dataset creation module. 

 
Fig. 4.  Interface providing image description. 

D. Image Description Format 

With the image identification number we keep track of its 
associated descriptions or annotations. We maintained track of 
image file as {"file_name": "IMG_100.jpg","id": 100}, and the 
included annotations for each image were in the format 
mentioned below in either csv or json file format: 

• {"image_id": 100, "id": 1,"caption": "Description 1"}, 

• {"image_id": 100,"id": 2,"caption": " Description 2"}, 

• {"image_id": 100,"id": 3,"caption": " Description 3"}, 

This way, in the new dataset two folders were maintained, 
one which contains all the images and another that includes the 
manually assigned annotations for each image. 

For creating the MS COCO dataset [1], huge 
crowdsourcing was involved in the annotation task and by 
using the Precision-Recall metric the quality of the annotation 
task was measured. We measured the quality of annotations of 
our dataset by applying the same concept. The quality of 
annotations was analyzed by a group of people including the 
authors of the paper. In some cases, we observed low precision 
and recall value, so we tried to search the false positive and 
false negative patterns and again perform the annotation 
correction task. For validation purposes, after assigning 
descriptions to all the images, we manually checked for 
redundancy the descriptions and verified them. Presently, the 
new dataset contains 1500 images with 4500 
descriptions/annotations. Some image samples with the 
assigned captions are shown in Figure 5.  

 
Fig. 5.  Sample images with assigned captions from the created dataset. 

E. Object Labelling 

After acquiring images and their associated descriptions, 
the next task was to perform object detection, but before that, 


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8806 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
we performed labelling on the images as shown in Figure 6. 
When multiple appearances of the same object occur in one 
image, only one label will be considered. Before using the new 
dataset, we performed object detection. Identifying objects 
from an image becomes complex when the image contains 
multiple objects. For this, we used the TensorFlow object 
detector [9] and done the review process used during the 
ImageNet dataset creation for detecting objects present in the 
image (Figure 7) [3]. The process involves visually checking 
the objects assigned with the images in order to check the 
classification accuracy of the new dataset. 

 
Fig. 6.  Labelling the objects of an image. 

 
Fig. 7.  Review method of identifying objects present in the image. 

After performing the review method thoroughly, a complete 
list of classes in the new dataset was acquired. At present, our 
dataset contains 68 different object classes, such as person, 
chair, room, garden, plant, car, road, door, window, building, 
globe, printer, pen, notebook, bag, cupboard, book, bench, 
table, staircase, etc. 

IV. BUILDING AND TRAINING THE CLASSIFICATION MODEL  

To perform image classification, there are many existing 
pre-trained architectures available which are trained on huge 
size image datasets like MS COCO [1] and ImageNet [3]. 
Instead of using these available pre-trained architectures, we 
have come up with a simple convolutional classifier model 
which was trained on the new dataset. The main reason to 
come up with a different convolutional classifier was to avoid 
the over fitting problem because currently our dataset size is 
limited. We have created a sequential convolutional model, 
which gives the flexibility to keep adding different layers to the 
model as per requirements. The basic steps we followed while 
creating our classifier model which was trained on the new 
dataset are: 

• Build the model 

• Compile the model  

• Train the model with the new dataset 

• Validate the model 

• Measure the model's performance in terms of accuracy 

The classifier model summary is shown in Figure 8. As 
already mentioned, in order to detect the 68 different object 
types present in our dataset, we have used a dense layer size 
(68). The model is trained to identify more than one object in 
an image. Some samples of object detection are shown in 
Figure 9. We initially started with bigger size objects, then 
proceeded towards smaller size objects. Our trained 
classification model detected objects like monitor, mouse, and 
keyboard in the given image shown in Figure 9. 

 
Fig. 8.  Model summary of the sequential classifier.  

 
Fig. 9.  Results of multiple detected objects. 

A. Results 

The new dataset was created for image captioning tasks, but 
it can also be used in object detection. To perform object 
detection, we performed labelling on the images which are 
further used during the training phase. Instead of using any 
pretrained classifier model, we presented the sequential 
classifier model. To check and compare the accuracy of the 


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8807 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
presented sequential model, we performed the review method 
of [3]. Object detection with the sequential classifier model 
provided 84% accuracy. The comparative analysis graph plot 
of the review method and the sequential classifier model is 
shown in Figure 10.  

 
Fig. 10.  Object detection accuracy comparison. 

TABLE II. MITWPU-1K DATASET SUMMARY  

Dataset Images Objects Image descriptions/annotations 

MITWPU-1K 1500 68 4500 
 

Fig. 11.  Sample images of identified objects in the MITWPU-1K dataset. 

 
Fig. 12.  Sample images containing poster/banner. 

Οur dataset contains 1500 images with 68 different objects 
and 4500 image descriptions (Table II). These objects are 
mostly present in any college or university campus. Some of 
the samples of the identified objects in the created dataset are 

shown in Figure 11. The dataset contains some images which 
contain text in the form of a banner or poster as shown in 
Figure 12. Using this dataset, we proposed a new deep learning 
model [10] for performing image captioning and text 
extraction. The obtained results of image captioning, including 
textual information present in the image, were satisfactory, 
giving accuracy up to 83% and performed as well as the state 
of the art methods. 

The new dataset MITWPU-1K can be used in a wide range 
of applications such as object detection, caption generation, 
text detection, etc. Regarding future work, the size of the 
dataset will be increased by adding some domain-specific 
images along with captions. 

V. CONCLUSIONS  

This paper presented the followed process while creating a 
new image dataset for image captioning generation. The main 
purpose was to explore the processing behind creating a dataset 
that can be used in generating image captions. Before and 
during the creation of this new dataset, existing dataset models 
were studied in detail. At present, the new dataset consists of 
1500 images and its name is MITWPU-1K in line with the 
current trend to name the dataset in a particular manner. The 
dataset includes 4500 annotations as we have assigned the 
minimum of three annotations for each image. The dataset has 
a subset of images that include text, so along with image 
description, text extraction can be done to further extend its 
application in domains like image captioning or other similar 
tasks.  

Using a sequential convolutional model, we performed 
object detection tasks on the MITWP-1K dataset. The dataset 
creation model can be extended to include all images of a 
particular domain such as sports, night vision images, animals, 
automobiles, computers, etc. In the application of image 
captioning, all available datasets are mostly generic. The 
accuracy of such systems can be improved by creating or using 
domain-specific image datasets. 

ACKNOWLEDGMENT 

The authors would like to thank the authorities of Dr. 
Vishwanath Karad MIT World Peace University for their 
support. 

REFERENCES 

[1] T.-Y. Lin et al., "Microsoft COCO: Common Objects in Context," in 

Computer Vision – ECCV 2014, 2014, pp. 740–755, https://doi.org/ 
10.1007/978-3-319-10602-1_48. 

[2] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. 
Winn, and A. Zisserman, "The Pascal Visual Object Classes Challenge: 

A Retrospective," International Journal of Computer Vision, vol. 111, 
no. 1, pp. 98–136, Jan. 2015, https://doi.org/10.1007/s11263-014-0733-

5. 

[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: 
A large-scale hierarchical image database," in 2009 IEEE Conference on 

Computer Vision and Pattern Recognition, Miami, FL, USA, Jun. 2009, 
pp. 248–255, https://doi.org/10.1109/CVPR.2009.5206848. 

[4] R. Doon, T. Kumar Rawat, and S. Gautam, "Cifar-10 Classification 

using Deep Convolutional Neural Network," in 2018 IEEE Punecon, 
Pune, India, Aug. 2018, https://doi.org/10.1109/PUNECON.2018. 

8745428. 


Engineering, Technology & Applied Science Research Vol. 12, No. 4, 2022, 8803-8808 8808 
 

www.etasr.com Bhalekar & Bedekar: The New Dataset MITWPU-1K for Object Recognition and Image Captioning Tasks 

 
[5] J. Redmon and A. Farhadi, "YOLO9000: Better, Faster, Stronger," in 
2017 IEEE Conference on Computer Vision and Pattern Recognition 

(CVPR), Honolulu, HI, USA, Jul. 2017, pp. 6517–6525, https://doi.org/ 
10.1109/CVPR.2017.690. 

[6] Y. Jia et al., "Caffe: Convolutional Architecture for Fast Feature 

Embedding," in MM ’14: Proceedings of the 22nd ACM international 
conference on Multimedia, New York, NY, USA, Aug. 2014, pp. 675–

678, https://doi.org/10.1145/2647868.2654889. 

[7] V. Sharma and R. N. Mir, "A comprehensive and systematic look up into 
deep learning based object detection techniques: A review," Computer 

Science Review, vol. 38, Nov. 2020, Art. no. 100301, https://doi.org/ 
10.1016/j.cosrev.2020.100301. 

[8] M. D. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, "A 
Comprehensive Survey of Deep Learning for Image Captioning," ACM 

Computing Surveys, vol. 51, no. 6, pp. 118:1-118:36, Oct. 2019, 
https://doi.org/10.1145/3295748. 

[9] G. Tanner, "Creating your own object detector," Towards Data Science, 

Feb. 06, 2019. https://towardsdatascience.com/creating-your-own-
object-detector-ad69dda69c85. 

[10] M. Bhalekar and M. Bedekar, "D-CNN: A New model for Generating 

Image Captions with Text Extraction Using Deep Learning for Visually 
Challenged Individuals," Engineering, Technology & Applied Science 

Research, vol. 12, no. 2, pp. 8366–8373, Apr. 2022, https://doi.org/ 
10.48084/etasr.4772. 

[11] B. Ahmed, G. Ali, A. Hussain, A. Baseer, and J. Ahmed, "Analysis of 

Text Feature Extractors using Deep Learning on Fake News," 
Engineering, Technology & Applied Science Research, vol. 11, no. 2, 

pp. 7001–7005, Apr. 2021, https://doi.org/10.48084/etasr.4069. 

[12] S. Nuanmeesri, "A Hybrid Deep Learning and Optimized Machine 
Learning Approach for Rose Leaf Disease Classification," Engineering, 

Technology & Applied Science Research, vol. 11, no. 5, pp. 7678–7683, 
Oct. 2021, https://doi.org/10.48084/etasr.4455.