Mathematical Problems of Computer Science 45, 138–142, 2016. Image Visual Similarity Based on High Level Features of Convolutional Neural Networks Aghasi S. Poghosyan, Hakob G. Sarukhanyan Institute for Informatics and Automation Problems of NAS RA e-mail: agasy18@gmail.com, hakop@ipia.sci.am Abstract Nowadays, the task of similar content retrieval is one of the central topics of interest in academic and industrial worlds. There are numerous techniques that are both deal- ing good with structured data and unstructured such as texts, respectively. However, in this paper we present a technique for retrieval of similar image content. We embed images to N dimensional feature space using convolutional neural networks and perform the nearest neighbor search afterwards. At the end, several distance metrics and their influence on the outcome are discussed. We are rather interested in the proportion of related content than in the additional ranking. Thus, the evaluation of results is based on precision and recall. We have selected 6 major categories from ImageNet dataset to assess the performance. Keywords: Image retrieval, Convolutional neural networks, Distance metrics. 1. Introduction The Content-Based Image Retrieval(CBIR) [1] has a broad application from professional database searching to search engines in the Internet. In these applications one uses a sam- ple image as a query item and the CBIR system responds with the list of relevant items. Usually, the resulting list uses underlying several factors. First, it uses a method to derive a knowledge about the contents of the image, such as interest point-based detection methods (SIFT [2] SURF [3]). Secondly, it uses extracted features and a distance metric to measure similarity between the query image and the images in the database. At last, one can derive a ranking method in order to show the query result. We perform the steps desrcibed above using the following procedure. In order to embed the image into the N dimensional feature space we use the feature vector extracted from pool5/7x7 s1 layer of GoogleNet [5] convolutional neural network (CNN). GoogleNet is trained on the ImageNet [4] dataset and highly effective in object classification and local- ization task. The pool5/7x7 s1 layer is the last layer in the CNN on which the classification task is performed. In other words, it consists of high level features of the image. Later we use and compare the performance of twelve distance measures to assess the similarity. We do not use any additional ranking methods for sorting the result. In the current case, results are sorted according to the similarity between the query image and the images in the database. 138 Ag. Poghosyan, H. Sarukhanyan 139 The rest of the paper is organized as follows. In the second section we discuss the feature layer of GoogleNet we used. Later, we give an overview to the twelve distance metrics we mentioned above. Afterwards, we show our evaluation results based on precision/recall metrics. At the end we discuss advantages and disadvantages of this method and further improvements. 2. High Level Feature Extraction The GoogleNet is a CNN that aims to localize and classify objects in images. It is trained on the ImageNet dataset and shows 6.6% top-5 accuracy error on image classification task. In GoogleNet the last layer before the classification layer is 1024-dimensional vector and called pool5/7x7 s1. This layer extracts the highest level features of the image and gives the most descriptive information about the objects within it. Thus, the features extracted by pool5/7x7 s1 cover the majority information contained within the image. We structure our content by propagating each image through GoogleNet and storing pool5/7x7 s1 layer information to feature vector database (FVD). From now on by database we mean FVD. For query images we follow the same procedure, except the saving part. Specifically, we propagate the query image through GoolgeNet to embed the extracted feature vector to the same 1024-dimensional space. 3. Distance Metrics The next important step for our task is to choose a distance metric that will maximize the performance. Now we briefly discuss each metric under our consideration below. For a pair of N-dimensional real-valued vectors u and v the distances are defined as follows:∑ i |ui − vi|, (Manhattan) (1) ||u − v||2, (Euclidean) (2) 1 − u · v ||u||2||v||2 , (Cosine) (3) where u · v is the dot product of u and v.∑ |ui − vi|/ ∑ |ui + vi|, (Bray − Curtis) (4) d(u, v) = ∑ i |ui − vi| |ui| + |vi| , (Canberra) (5) where ui and vi are 0 for given i, then the fraction 0/0 = 0 is used in the calculation. max i |ui − vi|, (Chebyshev) (6) 1 − (u − ū) · (v − v̄) ||(u − ū)||2||v − v̄||2 , (Correlation) (7) where ū is the mean of the elements of u and x. 140 Image Visual Similarity Based on High Level Features of Convolutional Neural Networks We use a subset Image-Net [4] dataset to assess the retrieval performance. We extract 6 major categories in total size of 5500 images. In Fig. 1 the frequency distribution per category is depicted. In order to increase the precision of our measurements we limit each category to have 440 images. In this way we can ensure that recall metric will not depend on the sample size of the category and will provide more reliable results. Fig. 1. Image-Net subset per categories image count bar. 4. Evaluation Precision and recall are common measures on evaluating performance of information retrieval tasks. Precision is defined as the percentage of relevant items in retrieved items. And recall as the percentage of relevant items that were retrieved in total relevant items. Specifically precision = |{relevant items} ∩ {retrieved items}| |{retrieved items}| (8) recall = |{relevant items} ∩ {retrieved items}| |{relevant items}| (9) In our experiment we fixed then the number of retrieved items (neighbors) to 100. Per image category average precision has been computed by: Pq ′ = ∑ kϵAq P(ik) |Aq| , q = 1, 2, ...N, (10) Ag. Poghosyan, H. Sarukhanyan 141 where N is the number of categories, Aq is qth category, P(ik) is the precision for k th image. Per image category average recall has been computed by: Rq ′ = ∑ kϵAq R(ik) |Aq| , q = 1, 2, ...N, (11) where N is the number of categories, Aq is qth category, R(ik) is recall for k th image. The average precision and average recall are given by: P ′ = N∑ q=1 Pq ′ N , (12) and R′ = N∑ q=1 Rq ′ N . (13) In order to evaluate our method we constructed a similarity matrix for the image database. For each image within the image database we retrieved 100 neighbors and cal- culated precision and recall for each of them. First we have evaluated results per distance metric and averaged them across image categories: Tables 1, 2. We can see that on aver- age correlation metric performs better than the other metrics. However, there are image categories where correlation shows lower results than the other metrics. Table 1. Metrics comparison precisions. Manhattan Cosine Euclidean Braycurtis Canberra Chebyshev Correlation Construction 0.862 0.928 0.852 0.924 0.912 0.635 0.932 Vertebrate 0.603 0.781 0.561 0.782 0.813 0.440 0.800 Device 0.814 0.835 0.768 0.836 0.856 0.429 0.845 Food 0.974 0.949 0.963 0.945 0.948 0.717 0.956 Invertebrate 0.699 0.851 0.654 0.839 0.838 0.364 0.865 Tree 0.953 0.944 0.960 0.944 0.917 0.795 0.940 Mean 0.818 0.881 0.793 0.878 0.881 0.563 0.890 Table 2. Metrics comparison recall. Manhattan Cosine Euclidean Braycurtis Canberra Chebyshev Correlation Construction 0.196 0.211 0.193 0.210 0.207 0.144 0.212 Vertebrate 0.137 0.177 0.127 0.177 0.184 0.100 0.181 Device 0.185 0.189 0.174 0.190 0.194 0.097 0.192 Food 0.221 0.215 0.219 0.214 0.215 0.163 0.217 Invertebrate 0.159 0.193 0.148 0.190 0.190 0.082 0.196 Tree 0.216 0.214 0.218 0.214 0.208 0.180 0.213 Mean 0.185 0.200 0.180 0.199 0.200 0.128 0.202 5. Conclusion This paper investigated another technique for CBIR and evaluated the proposed method against different similarity measures. We have shown that using high level features from GoogleNet in combination with correlation distance metric can lead to promising results. 1 4 2 Image Visual Similarity Based on High Level Features of Convolutional Neural Networks Refer ences [1 ] R . D a t t a , J. L i a n d J. Z. W a n g , \ Co n t e n t -b a s e d im a g e r e t r ie va l - a p p r o a c h e s a n d t r e n d s o f t h e n e w a g e " , The P ennsylvania State University, University P ark, P A 16802, 2 0 0 5 . [2 ] D . G. L o we , \ D is t in c t ive im a g e fe a t u r e s fr o m s c a le -in va r ia n t ke yp o in t s " , International J ournal of Computer Vision, vo l. 6 0 , n o . 2 , p p . 9 1 { 1 1 0 , 2 0 0 4 . [3 ] R . Fu n a ya m a , H . Y a n a g ih a r a , L . V a n Go o l, T. Tu yt e la a r s a n d H . B a y, \ R o b u s t In t e r e s t P o in t D e t e c t o r a n d D e s c r ip t o r " , US P atent o± ce, n o . 8 1 6 5 4 0 1 , 2 0 0 9 . [4 ] J. D e n g , W . D o n g , R . S o c h e r , L .-J. L i, K . L i a n d L . Fe i-Fe i, " Im a g e N e t : A L a r g e -S c a le H ie r a r c h ic a l Im a g e D a t a b a s e " , IE E E Computer Vision and P attern R ecognition, 2 0 0 9 . [5 ] C. S z e g e d y, W . L iu , Y . Jia , P . S e r m a n e t , S . R e e d , D . A n g u e lo v, D . E r h a n , V . V a n h o u c ke a n d A . R a b in o vic h , \ Go in g D e e p e r wit h Co n vo lu t io n s " , arXiv, n o . 1 4 0 9 .4 8 4 2 , 2 0 1 4 . Submitted 15.09.2015, accepted 25.01.2016 ä³ïÏ»ñÝ»ñÇ ï»ëáÕ³Ï³Ý ÝÙ³ÝáõÃÛáõÝÁ ÑÇÙÝí³Í ÷³ÃáõÛóÛÇÝ Ý»ÛñáݳÛÇÝ ó³Ýó»ñÇ µ³ñÓñ ϳñ·Ç ѳïÏáõÃÛáõÝÝ»ñÇ íñ³ ². äáÕáëÛ³Ý, Ð. ê³ñáõ˳ÝÛ³Ý ²Ù÷á÷áõÙ ²ß˳ï³ÝùáõÙ Ý»ñϳ۳óí³Í ¿ ÷³ÃáõÛóÛÇÝ Ý»ÛñáݳÛÇÝ ó³Ýó»ñÇ` µ³ñÓñ ϳñ·Ç ѳïÏáõÃÛáõÝÝ»ñÇ íñ³ ÑÇÙÝí³Í ï»ëáÕ³Ï³Ý ÝÙ³Ý å³ïÏ»ñÝ»ñÇ áñáÝÙ³Ý Ñ³Ù³Ï³ñ·Ç í»ñÉáõÍáõÃÛáõÝ: òáõÛó ¿ ïñíáõÙ, áñ GoogleNet-Ç µ³ñÓñ ϳñ·Ç ѳïÏáõÃÛáõÝÝ»ñÇ í»ÏïáñÝ»ñÇ Ñ³Ù³ñ Ñ»é³íáñáõÃÛ³Ý ýáõÝÏódzÛÇ ÁÝïñáõÃÛáõÝÁ Ù»Í ³½¹»óáõÃÛáõÝ áõÝÇ áñáÝáÕ Ñ³Ù³Ï³ñ·Ç ×ß·ñïáõÃÛ³Ý íñ³, ¨ í»ñçÇÝÇë ѳٳñ É³í³·áõÛÝ ³ñ¹ÛáõÝùÝ»ñÁ ëï³óíáõÙ »Ý, »ñµ Ïáé»ÉÛ³óÇ³Ý ¹Çï³ñÏíáõÙ ¿ áñå»ë Ñ»é³íáñáõÃÛ³Ý ýáõÝÏódz: Âèçóàëüíàÿ ñõîæåñòü èçîáðàæåíèé, îñíîâàííàÿ íà ñâîéñòâàõ âûñøåãî óðîâíÿ êîíâîëþöèîííûõ íåéðîííûõ ñåòåé À. Ïîãîñÿí, À. Ñàðóõàíÿí Àííîòàöèÿ  äàííîé ðàáîòå ïðåäñòàâëåíî èññëåäîâàíèå ñèñòåìû ïîèñêà âèçóàëüíî ïîõîæèõ èçîáðàæåíèé, îñíîâàííîé íà ñâîéñòâàõ âûñøåãî óðîâíÿ êîíâîëþöèîí- íûõ íåéðîííûõ ñåòåé. Ïîêàçàíî, ÷òî âûáîð ôóíêöèè ðàññòîÿíèÿ äëÿ ñâîéñòâ âûñøåãî óðîâíÿ GoogleNet èìååò áîëüøîå âëèÿíèå íà òî÷íîñòü ïîèñêîâîé ñèñòåìû. Äëÿ ïîëó÷åíèÿ ëó÷øèõ ðåçóëüòàòîâ èç ïðîñìîòðåííûõ ôóíêöèé âûäåëÿåòñÿ ôóíêöèÿ êîððåëÿöèè. 17.pdf (p.1-4) Aghasi_Abstract.pdf (p.5)