Mathematical Problems of Computer Science 45, 138–142, 2016.

Image Visual Similarity Based on High Level Features

of Convolutional Neural Networks

Aghasi S. Poghosyan, Hakob G. Sarukhanyan

Institute for Informatics and Automation Problems of NAS RA

e-mail: agasy18@gmail.com, hakop@ipia.sci.am

Abstract

Nowadays, the task of similar content retrieval is one of the central topics of interest
in academic and industrial worlds. There are numerous techniques that are both deal-
ing good with structured data and unstructured such as texts, respectively. However,
in this paper we present a technique for retrieval of similar image content. We embed
images to N dimensional feature space using convolutional neural networks and perform
the nearest neighbor search afterwards. At the end, several distance metrics and their
influence on the outcome are discussed. We are rather interested in the proportion of
related content than in the additional ranking. Thus, the evaluation of results is based
on precision and recall. We have selected 6 major categories from ImageNet dataset
to assess the performance.

Keywords: Image retrieval, Convolutional neural networks, Distance metrics.

1. Introduction

The Content-Based Image Retrieval(CBIR) [1] has a broad application from professional
database searching to search engines in the Internet. In these applications one uses a sam-
ple image as a query item and the CBIR system responds with the list of relevant items.
Usually, the resulting list uses underlying several factors. First, it uses a method to derive a
knowledge about the contents of the image, such as interest point-based detection methods
(SIFT [2] SURF [3]). Secondly, it uses extracted features and a distance metric to measure
similarity between the query image and the images in the database. At last, one can derive
a ranking method in order to show the query result.
We perform the steps desrcibed above using the following procedure. In order to embed
the image into the N dimensional feature space we use the feature vector extracted from
pool5/7x7 s1 layer of GoogleNet [5] convolutional neural network (CNN). GoogleNet is
trained on the ImageNet [4] dataset and highly effective in object classification and local-
ization task. The pool5/7x7 s1 layer is the last layer in the CNN on which the classification
task is performed. In other words, it consists of high level features of the image. Later
we use and compare the performance of twelve distance measures to assess the similarity.
We do not use any additional ranking methods for sorting the result. In the current case,
results are sorted according to the similarity between the query image and the images in the
database.

138



Ag. Poghosyan, H. Sarukhanyan 139

The rest of the paper is organized as follows. In the second section we discuss the feature
layer of GoogleNet we used. Later, we give an overview to the twelve distance metrics
we mentioned above. Afterwards, we show our evaluation results based on precision/recall
metrics. At the end we discuss advantages and disadvantages of this method and further
improvements.

2. High Level Feature Extraction

The GoogleNet is a CNN that aims to localize and classify objects in images. It is trained
on the ImageNet dataset and shows 6.6% top-5 accuracy error on image classification task.
In GoogleNet the last layer before the classification layer is 1024-dimensional vector and
called pool5/7x7 s1. This layer extracts the highest level features of the image and gives
the most descriptive information about the objects within it. Thus, the features extracted
by pool5/7x7 s1 cover the majority information contained within the image. We structure
our content by propagating each image through GoogleNet and storing pool5/7x7 s1 layer
information to feature vector database (FVD). From now on by database we mean FVD.
For query images we follow the same procedure, except the saving part. Specifically, we
propagate the query image through GoolgeNet to embed the extracted feature vector to the
same 1024-dimensional space.

3. Distance Metrics

The next important step for our task is to choose a distance metric that will maximize the
performance. Now we briefly discuss each metric under our consideration below. For a pair
of N-dimensional real-valued vectors u and v the distances are defined as follows:∑

i

|ui − vi|, (Manhattan) (1)

||u − v||2, (Euclidean) (2)

1 −
u · v

||u||2||v||2
, (Cosine) (3)

where u · v is the dot product of u and v.∑
|ui − vi|/

∑
|ui + vi|, (Bray − Curtis) (4)

d(u, v) =
∑
i

|ui − vi|
|ui| + |vi|

, (Canberra) (5)

where ui and vi are 0 for given i, then the fraction 0/0 = 0 is used in the calculation.

max
i

|ui − vi|, (Chebyshev) (6)

1 −
(u − ū) · (v − v̄)

||(u − ū)||2||v − v̄||2
, (Correlation) (7)

where ū is the mean of the elements of u and x.



140 Image Visual Similarity Based on High Level Features of Convolutional Neural Networks

We use a subset Image-Net [4] dataset to assess the retrieval performance. We extract
6 major categories in total size of 5500 images. In Fig. 1 the frequency distribution per
category is depicted. In order to increase the precision of our measurements we limit each
category to have 440 images. In this way we can ensure that recall metric will not depend
on the sample size of the category and will provide more reliable results.

Fig. 1. Image-Net subset per categories image count bar.

4. Evaluation

Precision and recall are common measures on evaluating performance of information retrieval
tasks. Precision is defined as the percentage of relevant items in retrieved items. And recall
as the percentage of relevant items that were retrieved in total relevant items. Specifically

precision =
|{relevant items} ∩ {retrieved items}|

|{retrieved items}|
(8)

recall =
|{relevant items} ∩ {retrieved items}|

|{relevant items}|
(9)

In our experiment we fixed then the number of retrieved items (neighbors) to 100. Per
image category average precision has been computed by:

Pq
′ =

∑
kϵAq

P(ik)

|Aq|
, q = 1, 2, ...N, (10)



Ag. Poghosyan, H. Sarukhanyan 141

where N is the number of categories, Aq is qth category, P(ik) is the precision for k th image.
Per image category average recall has been computed by:

Rq
′ =

∑
kϵAq

R(ik)

|Aq|
, q = 1, 2, ...N, (11)

where N is the number of categories, Aq is qth category, R(ik) is recall for k th image.
The average precision and average recall are given by:

P ′ =
N∑
q=1

Pq
′

N
, (12)

and

R′ =
N∑
q=1

Rq
′

N
. (13)

In order to evaluate our method we constructed a similarity matrix for the image
database. For each image within the image database we retrieved 100 neighbors and cal-
culated precision and recall for each of them. First we have evaluated results per distance
metric and averaged them across image categories: Tables 1, 2. We can see that on aver-
age correlation metric performs better than the other metrics. However, there are image
categories where correlation shows lower results than the other metrics.

Table 1. Metrics comparison precisions.

Manhattan Cosine Euclidean Braycurtis Canberra Chebyshev Correlation

Construction 0.862 0.928 0.852 0.924 0.912 0.635 0.932
Vertebrate 0.603 0.781 0.561 0.782 0.813 0.440 0.800
Device 0.814 0.835 0.768 0.836 0.856 0.429 0.845
Food 0.974 0.949 0.963 0.945 0.948 0.717 0.956
Invertebrate 0.699 0.851 0.654 0.839 0.838 0.364 0.865
Tree 0.953 0.944 0.960 0.944 0.917 0.795 0.940

Mean 0.818 0.881 0.793 0.878 0.881 0.563 0.890

Table 2. Metrics comparison recall.

Manhattan Cosine Euclidean Braycurtis Canberra Chebyshev Correlation

Construction 0.196 0.211 0.193 0.210 0.207 0.144 0.212
Vertebrate 0.137 0.177 0.127 0.177 0.184 0.100 0.181
Device 0.185 0.189 0.174 0.190 0.194 0.097 0.192
Food 0.221 0.215 0.219 0.214 0.215 0.163 0.217
Invertebrate 0.159 0.193 0.148 0.190 0.190 0.082 0.196
Tree 0.216 0.214 0.218 0.214 0.208 0.180 0.213

Mean 0.185 0.200 0.180 0.199 0.200 0.128 0.202

5. Conclusion

This paper investigated another technique for CBIR and evaluated the proposed method
against different similarity measures. We have shown that using high level features from
GoogleNet in combination with correlation distance metric can lead to promising results.



1 4 2 Image Visual Similarity Based on High Level Features of Convolutional Neural Networks

Refer ences

[1 ] R . D a t t a , J. L i a n d J. Z. W a n g , \ Co n t e n t -b a s e d im a g e r e t r ie va l - a p p r o a c h e s a n d t r e n d s
o f t h e n e w a g e " , The P ennsylvania State University, University P ark, P A 16802, 2 0 0 5 .

[2 ] D . G. L o we , \ D is t in c t ive im a g e fe a t u r e s fr o m s c a le -in va r ia n t ke yp o in t s " , International
J ournal of Computer Vision, vo l. 6 0 , n o . 2 , p p . 9 1 { 1 1 0 , 2 0 0 4 .

[3 ] R . Fu n a ya m a , H . Y a n a g ih a r a , L . V a n Go o l, T. Tu yt e la a r s a n d H . B a y, \ R o b u s t In t e r e s t
P o in t D e t e c t o r a n d D e s c r ip t o r " , US P atent o± ce, n o . 8 1 6 5 4 0 1 , 2 0 0 9 .

[4 ] J. D e n g , W . D o n g , R . S o c h e r , L .-J. L i, K . L i a n d L . Fe i-Fe i, " Im a g e N e t : A L a r g e -S c a le
H ie r a r c h ic a l Im a g e D a t a b a s e " , IE E E Computer Vision and P attern R ecognition, 2 0 0 9 .

[5 ] C. S z e g e d y, W . L iu , Y . Jia , P . S e r m a n e t , S . R e e d , D . A n g u e lo v, D . E r h a n , V . V a n h o u c ke
a n d A . R a b in o vic h , \ Go in g D e e p e r wit h Co n vo lu t io n s " , arXiv, n o . 1 4 0 9 .4 8 4 2 , 2 0 1 4 .

Submitted 15.09.2015, accepted 25.01.2016

ä³ïÏ»ñÝ»ñÇ ï»ëáÕ³Ï³Ý ÝÙ³ÝáõÃÛáõÝÁ ÑÇÙÝí³Í ÷³ÃáõÛÃ³ÛÇÝ
Ý»ÛñáÝ³ÛÇÝ ó³Ýó»ñÇ µ³ñÓñ Ï³ñ·Ç Ñ³ïÏáõÃÛáõÝÝ»ñÇ íñ³

². äáÕáëÛ³Ý, Ð. ê³ñáõË³ÝÛ³Ý

²Ù÷á÷áõÙ

²ßË³ï³ÝùáõÙ Ý»ñÏ³Û³óí³Í ¿ ÷³ÃáõÛÃ³ÛÇÝ Ý»ÛñáÝ³ÛÇÝ ó³Ýó»ñÇ` µ³ñÓñ Ï³ñ·Ç
Ñ³ïÏáõÃÛáõÝÝ»ñÇ íñ³ ÑÇÙÝí³Í ï»ëáÕ³Ï³Ý ÝÙ³Ý å³ïÏ»ñÝ»ñÇ áñáÝÙ³Ý Ñ³Ù³Ï³ñ·Ç
í»ñÉáõÍáõÃÛáõÝ: òáõÛó ¿ ïñíáõÙ, áñ GoogleNet-Ç µ³ñÓñ Ï³ñ·Ç Ñ³ïÏáõÃÛáõÝÝ»ñÇ
í»ÏïáñÝ»ñÇ Ñ³Ù³ñ Ñ»é³íáñáõÃÛ³Ý ýáõÝÏóÇ³ÛÇ ÁÝïñáõÃÛáõÝÁ Ù»Í ³½¹»óáõÃÛáõÝ áõÝÇ
áñáÝáÕ Ñ³Ù³Ï³ñ·Ç ×ß·ñïáõÃÛ³Ý íñ³, ¨ í»ñçÇÝÇë Ñ³Ù³ñ É³í³·áõÛÝ ³ñ¹ÛáõÝùÝ»ñÁ
ëï³óíáõÙ »Ý, »ñµ Ïáé»ÉÛ³óÇ³Ý ¹Çï³ñÏíáõÙ ¿ áñå»ë Ñ»é³íáñáõÃÛ³Ý ýáõÝÏóÇ³:

Âèçóàëüíàÿ ñõîæåñòü èçîáðàæåíèé, îñíîâàííàÿ
íà ñâîéñòâàõ âûñøåãî óðîâíÿ êîíâîëþöèîííûõ

íåéðîííûõ ñåòåé
À. Ïîãîñÿí, À. Ñàðóõàíÿí

Àííîòàöèÿ

Â äàííîé ðàáîòå ïðåäñòàâëåíî èññëåäîâàíèå ñèñòåìû ïîèñêà âèçóàëüíî
ïîõîæèõ èçîáðàæåíèé, îñíîâàííîé íà ñâîéñòâàõ âûñøåãî óðîâíÿ êîíâîëþöèîí-
íûõ íåéðîííûõ ñåòåé. Ïîêàçàíî, ÷òî âûáîð ôóíêöèè ðàññòîÿíèÿ äëÿ ñâîéñòâ
âûñøåãî óðîâíÿ GoogleNet èìååò áîëüøîå âëèÿíèå íà òî÷íîñòü ïîèñêîâîé
ñèñòåìû. Äëÿ ïîëó÷åíèÿ ëó÷øèõ ðåçóëüòàòîâ èç ïðîñìîòðåííûõ ôóíêöèé
âûäåëÿåòñÿ ôóíêöèÿ êîððåëÿöèè.


	17.pdf (p.1-4)
	Aghasi_Abstract.pdf (p.5)