77The Development of Indoor Object Recognition.....(Rhio Sutoyo; Andry Chowanda)      

THE DEVELOPMENT OF INDOOR OBJECT RECOGNITION TOOL 
FOR PEOPLE WITH LOW VISION AND BLINDNESS

Rhio Sutoyo1; Andry Chowanda2

1,2,Computer Science Department, School of Computer Science, Bina Nusantara University
Jln. K.H. Syahdan No 9, DKI Jakarta 11480, Indonesia

1rsutoyo@binus.edu; 2achowanda@binus.edu

Received: 8th February 2017/ Revised: 7th March 2017/ Accepted: 13th March 2017

Abstract - The purpose of this research was to 
develop methods and algorithms that could be applied as 
the underlying base for developing an object recognition 
tools. The method implemented in this research was initial 
problem identification, methods and algorithms testing 
and development, image database modeling, system 
development, and training and testing. As a result, the 
system can perform with 93,46% of accuracy for indoor 
object recognition. Even though the system achieves 
relatively high accuracy in recognizing objects, it is still 
limited to a single object detection and not able to recognize 
the object in real time.

Keywords: object recognition, computer vision, tool, 
blindness, low vision

I. INTRODUCTION

According to Indonesia Blind Union (PERTUNI -  
Persatuan Tunanetra Indonesia) in 2009, there were more 
than 1,5 million blind people and more than 1,2 million 
people who had low vision (weak sight) in Indonesia. 
Then, the number is increasing over time. In addition, at 
the end of 2012, Indonesia was rated as the second highest 
rate of blindness in the world (Jakarta Globe, 2012). Across 
the globe, the number of visually impared people or blind 
people is predicted to increase by 20% within the next 50 
years (Bujacz & Strumillo, 2006). The reason is because the 
aging demographic around the world. With a sight problem, 
daily activities such as walking, reading, recognizing object 
and people, and visiting places can be quite troublesome for 
those people.

There have been many researches to help visually 
impaired people, escpecially in  the field of computer vision. 
For example, computer vision could be used to create an 
“artificial eye” for blind or visually impared people. An 
artifical eye would allows sightless-human to move to some 
destinations or places, read books (with the help of hearing 
or Braille), recognize objects and faces, and many more. 
According to a survey from previous works by Manduchi 
and Coughlan (2012), the most useful tools were Braille note 
takers, text magnifiers and screen readers, and document 
scanners with Optical Character Recognition (OCR). With 
the help of technology, Braille notetaker such as BrailleNote 
Touch can help blind or visually impared people to take 
notes. Furthermore, text magnifiers and screen readers 
have become the standard feature in modern laptops and 
smartphones nowadays (i.e. VoiceOver feature in Apple’s 
iOS, TalkBack feature in Google’s Android). Meanwhile, 
the OCR technology helps character recognition in images. 

After the scan and recognition processes have completed, 
the recognized text can be readed back to the users. In short, 
Braille and OCR  technology help blind and visually impared 
people to read without using their sight. Besides character 
recognition, the other problem of people with bad eyesight 
is object recognition. Bujacz and Strumillo (2006) utilized 
stereophonic to detect whether there was an object nearby. 
The research used high sound to represent objects that were 
close to the users, low sound to show objects that were far 
from the user, and no noise for no objects near the user. The 
representation of the environment was generated by several 
scanning process of 5°x5° sections of the environment view. 
The type of the scans included basic (front) scan, wide 
(vertical) side-scan, and horizontal side-scan. The test were 
conducted with 10 volunteers participated in three types of 
trials which were mobility, orientation, and a combination 
of both. The results indicated that the users were able to 
imagine indoor structure based on the casted sound from the 
indoor-objects. However, this research had weaknesses in 
recognizing objects and difficulty in implementation in the 
room that had a complicated structure. 

Then, Ran et al. (2004) developed a tool to help blind 
people to walk and know the surrounding objects by using 
ultrasound sensors, wireless, and Global Positioning System 
(GPS). This tool could be used in both indoors and outdoors. 
In indoor, Ran et al. (2004) used an ultrasound sensor that 
was attached to the house and also to the user. The tests were 
conducted by using 4 HE900M pilots, 2 HE900T beacons, 
and a RS485/RS232 converter. The pilots were placed 
on the four corners of the house. With the choice of the 
placement, the pilots provided complete map of the house. 
Moreover, the beacons were attached to the user to receive 
the ultrasound signals from the pilots. The genereated 
outputs were position (i.e. the distance to the refrigerator 
was 2 meters) and orientation (i.e. turn right 45 degrees and 
walk ahead). Thus, users could know the location of the 
object. Meanwhile, in outdoor, they used GPS and wireless 
sensors. The tests were conducted by using a Trimble 
PROXRS, 12 channel integrated GPS/Beacon/Satellite 
receiver with multi-path rejection technology. Users were 
able to use voice guidance to determine whether they were 
out of the lane or not. Even though these approaches were 
useful, it required expensive setup and could only be used 
in the rooms that had sensors installed. Moreover, more 
sensors were needed to achieve a better precision result or 
the room should be bigger. 

On the contrary, Anjum (2012) used a map of 
existing objects combined with the constructed dataset. 
This research helped blind and visually impared people to 
reach his/her destination by providing direction. In order to 
get the destionation route, users needed to send an image of 
their current locations to the trained system for localization. 


78 ComTech, Vol. 8 No. 2 June 2017, 77-82

After the localization process had finished, the system 
generated the path to the destination. This research used 
128-dimensional Scale-Invariant Feature Transform (SIFT) 
method for classifying objects in the room. This method was 
invariant to rotation, luminosity, and scale. Thus, it could 
determine the location of the room. While the system showed 
promising results for recognizing space and navigation, it 
was only limited to rooms that already had well-built maps. 
In this research, computer vision technique was utilized to 
achieve affordable setup cost with the precise result.

According to Pinto et al. (2008), the fundamental 
problem for the recognition of the object was the unlimited 
number as input (e.g. position, scale, pose, illumination, 
background images, and others) from a 2D image captured 
by the retina or camera, and position changes of the image 
captured by the camera. To this day, it is still a major 
problem to recognize real-world objects. Moreover, several 
researchers have been successfully addressing the problem 
of object recognition. For example, Gu et al. (2009) used 
region extractions to overcome the problem of estimation 
combination toward the bag of regions derived from a region 
tree. Regions were described by using a rich set of cues 
(i.e. shape, color, and texture) inside them. Furthermore, 
a discriminative max-margin framework were used to 
measure the region weights. Then, Hough voting scheme 
was used to generate object locations, scales, and support 
scenario possibilities. The 2D image was processed by 
using a region tree and bag of regions. Thus, detection and 
image segmentation could be done. The evaluation were 
done by using ETH Zurich (ETHZ) shape and the Caltech 
101 databases. They claimed that the algorithms were able 
to reduce the total number of estimation combination. In 
addition, Pinto et al. (2011) used several measurement 
such as Pyramid Histogram of Gradients (PHOG), Pyramid 
Histogram of Visual Words (PHOW), Geometric Blur, 
Sparse Localized Features (SLF), Scale-Invariant Feature 
Transform (SIFT), Pixels, V1-like for object recognition. 
Caltech-101 image categorization task were used as a point 
of reference. Based on the conducted experiment, V1-like 
measurement had the highest performance among the others.

Similarly, Farhadi et al. (2009) used attributes that 
were specific to an object to improve the accuracy of object 
recognition. Their research utilized common features to 
classify the object and compared it with the specific features 
resulting in the accuracy in object recognition. Hence, it 
could not only recognize and name familiar objects, but also 
report unusual aspects of an object (i.e. “spotty cat” instead 
of “cat”), describe about unfamiliar objects ( “skinny and 
four-legged” instead of “unknown”), and learn about new 
objects even without visual examples. Unfortunately, their 
datasets were limited only to few objects. Moreover, this 
research was not perfect as there were multiple class objects 
and efficiency issues in the process. The disadvantages of 
this research could be answered by Hochbaum and Singh 
(2009). They adopted co-segmentation by using Markov 
Random Field (MRF). With this technique, the segmentation 
process for similar object’s regions could be done on similar 
(or same) object that appears in two different images. It 
increased efficiency and accuracy of the object recognition. 
Although the algorithm  run similary on normal images, it 
gives time-saving process for larger images.

In addition, Li, Crandall, and Huttenlocher (2009) 
constructed datasets of images obtained from the internet 
such as Facebook and Flikr using geo-tagging. The datasets 
were used to detect place name by using Scale-Invariant 
Feature Transform-based bag-of-word features. This 

research showed a good performance when it processed 
approximately 30 million existing images. Meanwhile, 
Li, Socher, and Fei-Fei (2009) utilized classification, 
annotation, and segmentation by finding the relationship 
between objects so that it could be concluded that there 
was a likelihood scene. Unfortunately, these methods 
required complete datasets. Next, Carreira et al. (2012) 
used a ranking system towards object segmentation results. 
Their research used figure-ground segments generated 
by bottom-up computational processes. The method was 
conducted by using three steps. Those were producing a set 
of figure-ground segmentation hypotheses for each image 
using the combinatorial CPMC segmentation algorithm, 
scoring each object category to assess the likelihood that 
a segment hypothesis belongs to that class, and sorting the 
segment hypotheses by their scores and also consecutively 
making detection and segmentation decisions depending 
on a weighted combination of responses collected at top-
level segments. It showed more accurate results of object 
recognition. Nevertheless, this research still indicated an 
error in some cases such as boundary objects that were not 
clear and objects with a similar class. Furthermore, Divvala 
et al. (2009) analyzed the object detection methods based 
on the existing variables. It was the changes in confusion 
matrices and accuracies with respect to size and occlusion, 
and analysis of sources and uses of context. This research 
explored the importance of contextual reasoning for 
object recognition. The contextual reasoning reduced the 
detection of errors and also made the remaining errors more 
reasonable. This research was used as reference for object 
detection, segmentation, and scene recognition.

This research presents a model for categorization 
and recognition limited to indoor objects as the first step 
to develop a tool to help blind and low vision people. 
Then, the aims are to take the advantages of the computer 
vision technique to build a tool that can help those people. 
With the help of this “artifical eye” tool, visually troubled 
people are able to recognize everyday object easier. With 
computer vision technique, the researchers can build tool 
that can recognize people (Chowanda et al., 2014, 2016) 
and their expressions (Sutoyo, Harefa, & Chowanda, 2016), 
object (Farhadi, Endres, Hoiem, & Forsyth, 2009), places 
(Li, Crandall, & Huttenlocher, 2009), presentation process 
(Sutoyo et al., 2015, 2017), and others.

 
II. METHODS

Figure 1 illustrates the overall system architecture. 
The camera captures all the information from the world 
in 2D format. The captured image will be preprocessed 
(e.g. image enhancement) before the researchers extract 
the features of the image. Then, the system classifies and 
recognizes the object in the processed image based on 
the trained models by using existing datasets. Finally, the 
system will provide a sound feedback using Text-to-Speech 
to the user.

The first stage is image processing and enhancement. 
In this stage, images captured by the camera are processed 
by the feature extraction stage. The images will be converted 
into gray-scale images with the help of OpenCV with the 
command:

IAT_Rgb2Gray * R2G = new 

IAT_Rgb2Gray();


79The Development of Indoor Object Recognition.....(Rhio Sutoyo; Andry Chowanda)      

Then, they will be processed by using canny edge 
detector with the command:

cvCanny (img1, img1, 10, 50, 3);

Lastly, those images will be converted into HSV 
(Hue, Saturation, Value) mode with the help of OpenCV 
with the command:

cvCvtColor(img, Img1, CV_RGB2HSV);

After the image-processing stage has been done, the 
processed images will be sent to the feature extraction step.

The second stage is feature extraction. In this step, 
image separation will be done before the feature extraction. 
This process takes the work of Kim et al. (2010) where 
they separate the image into five regions or area block. The 
idea of this approach is because the object of interest in 
photos is mostly located in the center of the images (e.g. 
people’s faces). Thus, for indoor or outdoor, the border 
image classification has more information than the center. 
The example for image separation can be seen in Figure 2.

Image feature is calculated for each block by setting 
Region of Interests (ROI) with a particular block. The next 
process is edge features extraction which used equation (1):

     (1)

Ei, n is the value of an angle n, and block i. ε is a 
small integer. Moreover, the feature is calculated for eight 
quantised corners of each block that provides 40-D feature 
vector. After edge feature extraction has been done, the next 
process is color feature extraction which is implemented by 
using equation (2).

                       (2)

Figure 1 System Architecture

Figure 2 Image Division in Five Areas or Regions
Source: Kim et al. (2010)

This process is similar to edge feature extraction 
process by using Edge Oriented Histogram (EOH). The 
vector of features obtained from the previous process (i.e. 
edge extraction features, color feature extraction) is initially 
stored in an Extensible Markup Language (XML) file. 
Then, these features are combined and stored in a Comma-
Separated Values (CSV) file so the data can be read easily 
for Support Vector Machine (SVM) classifier training. The 
process to create Edge and Color Orientation Histogram 
(ECOH) can be seen in Figure 3. After all the process 
have been done, the feature extraction phase is complete. 
Furthermore, the data obtained from this stage will be used 
at SVM image classification step.

Figure 3 Feature Extraction Process 
(Source: Kim et al., 2010)

The third stage is SVM Classification. To train SVM 
image classification (i.e. data training), all features in the 
CSV file from the previous step are converted to the form 
of a matrix. At the end of each attribute of the vector, the 
feature is placed to determine an image category. Once the 
matrix has a parameter, data practice must be organized to 
train the SVM classifier. SVM image classification will be 
implemented by using the LIBSVM. LIBSVM is a software 
developed by Chih Chung Chang and Chih Jen Lin that 
has General Public License (GPL) (Chang, 2011). Next, 
the classifier will be stored in XML file format. The XML 
file created is used to predict the class of new images. New 
images must be classified. In addition, all features needed 
for ECOH are calculated and predicted from the image. 
Then, the result will be stored in a matrix. Next, the matrix 
will be given to the SVM classifier to determine the image 
class. The result depends on the feature score which given it 


80 ComTech, Vol. 8 No. 2 June 2017, 77-82

as an output. These results are used for the final step, which 
is object detection and recognition.

The final stage is object detection and recognition. 
This stage requires the sampling process. Sample creation 
can be done with the help of OpenCV library which 
generates a vector file (VEC) of a positive description file. 
The sample is created by using sub-image marked from 
the real image. Width and height of this image are 24 x 24 
samples (pixels). However, that size can be changed by 
the input parameter. To run it, the sample must be placed 
in folder bin in OpenCV where the application is located 
in the file directory. If it is success, this process produces 
the amount of sample which is identified successfully and 
unsuccessfully by inserting the existing dataset.

Figure 4 Successful Detected Objects

The classification in this research can identify several 
objects such as chairs, clocks, and other indoor objects. The 
example can be seen in Figure 4. The image is collected 
from a database of images on the website of the California 
Institute of Technology (CALTECH, 2017). The dataset 
used in this research can be seen in Table 1. Moreover, the 
researchers model an image database for training purposes 
by combining three existing datasets. Then, by using the 
image database, the researchers train the SVM classifier 
for object recognitions using algorithm described (SVM 
Classification).

Table 1 Datasets

Name Number of Images Providers

House Dataset 1000 jpg California Institute 
of Technology 
(Philip & Updike, 2001)

LHI 8 Image 1200 jpg Yao, Yang, and Zhu (2007)
MIT Image 4289 jpg MIT Media Lab 

(Oliva & Torralba, 2001)

In Figure 5, the example of an image as the 
identification results of the system prototype is shown. 
The current system still uses 2D images and are no able to 
do the real-time identification. The image is taken from a 
database of images on the website of the California Institute 
of Technology. The monitor has 78% level of confidence, 
the glass has 83% level of confidence, and the watch has 
78% level of confidence.

Figure 5  Identification Results 
of the System Prototype

III. RESULTS AND DISCUSSIONS

The evaluation is done by using the image database 
from California Institute of Technology website and MIT 
website. About 107 images are used in the testing with a 
wide variety of objects from the provided dataset. The 
result of object recognition achieves a high percentage 
of detection such as 93,46%. The accuracy or confidence 
level represents the number of images that are correctly 
detected by the system during the evaluation (i.e. number of 
true positive divided by all images used in the evaluation). 
Hence, it is:

100/107  × 100% = 93,46%

The high percentage of detection in this system is 
affected by a combination of the three datasets used for 
the process of object recognition. There is some limitation 
to the system. This system can only detect one object 
(single-detection). Moreover, objects with the same shape 
or similar features can produce several estimations. The 
examples can be seen in Figure 6 with different percentage 
levels of confidence (i.e. bed: 52% confidence, table: 32% 
confidence). The current system cannot be used for real-
time conditions and is still context-based. Hence, if there are 
a form and features that are not common to certain objects, 
the system will be difficult to recognize those objects.

Table 2 represents the results of the evaluation by 
involving 107 images for object detection. The number 
of images represents the number of sampling images 
in each object. Correct is the number of times that the 
object is successfully and properly identified. Max is the 
percentage of the highest confidence ever achieved by an 
object. Meanwhile, min is the percentage with the lowest 
confidence ever achieved by an object. The mean is the 
average confidence for the introduction of an object. 
Then, detect is the success percentage for an object which 
is successfully detected. It can be seen from Table 2 that 
five objects manage to get 100% detection by the system. 
Moreover, six objects achieve relatively high confidence 
level, which is more than 70%. Last, only three objects have 
low confidence level (glass, watch, and bed).

Figure 6 Objects with the Same Shape or Similar Features


81The Development of Indoor Object Recognition.....(Rhio Sutoyo; Andry Chowanda)      

IV. CONCLUSIONS

This research presents an initial development for 
object recognition as the first step to build a tool to help 
people with low vision and blindness. The results show that 
the tool can achieve a high percentage of detection ( 93,46%). 
This remarkable precision of the system is affected by the 
combination of three datasets that are used for the process 
of object recognition. Nevertheless, this approach makes 
the system runs slow. Hence, it is almost impossible to be 
implemented as a real-time system for now. In addition, 
this system only able to detect one object (single-detection). 
Objects with almost identical figure and the feature will 
produce several estimation results. For example, Figure 6 
have different levels of confidence percentage. Moreover, 
the current system is still limited to the context-based 
problem. Therefore, if there are a figure and feature that is 
not common on a certain object, the system will be difficult 
to recognize. For future works, the development towards 
scene recognition system will be interesting to complete 
this project. Moreover, a prediction in real time will be the 
focus in the next step of research.

REFERENCES

Anjum, S. (2012). Place recognition for indoor blind 
navigation. Retrieved February 23rd, 2017 from 
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mjs/ftp/
thesis-program/2011/theses/qatar-anjum.pdf

Bujacz, M., & Strumillo, P. (2006). Stereophonic 
representation of virtual 3D scenes-a simulated 
mobility aid for the blind. New Trends in Audio and 
Video, 1, 157-162.

CALTECH. (2017). Caltech256: Image datasets. Retrieved 
February 22nd, 2017 from http://www.vision.caltech.
edu/Image_Datasets/Caltech256/images/

Carreira, J., Li, F., & Sminchisescu, C. (2012). Object 
recognition by sequential figure-ground ranking. 
International Journal of Computer Vision, 98(3), 
243-262.

Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for 
support vector machines. ACM Transactions on 
Intelligent Systems and Technology (TIST), 2(3), 27.

Chowanda, A., Blanchfield, P., Flintham, M., & Valstar, 
M. (2014). Erisa: Building emotionally realistic 
social game-agents companions. In International 
Conference on Intelligent Virtual Agents (pp. 134-
143). Springer International Publishing.

Chowanda, A., Blanchfield, P., Flintham, M., & Valstar, 
M. (2016). Computational models of emotion, 
personality, and social relationships for interactions 
in games. In Proceedings of the 2016 International 
Conference on Autonomous Agents & Multiagent 
Systems (pp. 1343-1344). International Foundation 
for Autonomous Agents and Multiagent Systems.

Divvala, S. K., Hoiem, D., Hays, J. H., Efros, A. A., & 
Hebert, M. (2009). An empirical study of context in 
object detection. In IEEE Conference on Computer 
Vision and Pattern Recognition, 2009. CVPR 2009. 
(pp. 1271-1278).

Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). 
Describing objects by their attributes. In Computer 
Vision and Pattern Recognition, 2009. CVPR 2009. 
IEEE Conference on 20-25 June 2009 (pp. 1778-
1785).

Gu, C., Lim, J. J., Arbeláez, P., & Malik, J. (2009). 
Recognition using regions. In Computer Vision 
and Pattern Recognition, 2009. CVPR 2009. IEEE 
Conference on 20-25 June 2009 (pp. 1030-1037).

Jakarta Globe. (2012). Indonesia has second-highest rate 
of blindness in world. Retrieved February 23rd, 2017 
from http://jakartaglobe.id/archive/indonesia-has-
second-highest-rate-of-blindness-in-world/

Hochbaum, D. S., & Singh, V. (2009). An efficient algorithm 
for co-segmentation. In Computer Vision, 2009 
IEEE 12th International Conference on 29 Sept.-2 
Oct. 2009 (pp. 269-276).

Kim, W., Park, J., & Kim, C. (2010). A novel method 
for efficient indoor-outdoor image classification. 
Journal of Signal Processing Systems, 61(3), 251-
258.

Li, L. J., Socher, R., & Fei-Fei, L. (2009). Towards total 
scene understanding: Classification, annotation 
and segmentation in an automatic framework. In 
Computer Vision and Pattern Recognition, 2009. 
CVPR 2009. IEEE Conference on 20-25 June 2009 
(pp. 2036-2043).

Li, Y., Crandall, D. J., & Huttenlocher, D. P. (2009). 
Landmark classification in large-scale image 
collections. In Computer vision, IEEE 12th 
International Conference on 29 Sept.-2 Oct. 2009 
(pp. 1957-1964).

Manduchi, R., & Coughlan, J. (2012). (Computer) vision 
without sight. Communications of the ACM, 55(1), 
96-104.

Table 2 The Result Summary of Image Detection

Images Correct Incorrect Max Min Mean Detect
Table 15 15 0 89 30 75 100%
Chair 16 16 0 88 35 78 100%
Monitor 14 14 0 91 32 77 100%
PC 16 16 0 91 42 85 100%
Lamp 4 2 2 85 52 83 50%
Bottle 16 16 0 89 62 77 100%
Glass 14 12 2 96 41 64 85,7143%
Watch 5 3 2 70 20 58 60%
Bed 7 6 1 72 54 68 85,7143%


82 ComTech, Vol. 8 No. 2 June 2017, 77-82

Oliva, A., & Torralba, A. (2001). Modeling the shape of 
the scene: A holistic representation of the spatial 
envelope. International Journal of Computer 
Vision, 42(3), 145-175.

PERTUNI. (2009). Resolusi Munas VII PERTUNI 2009. 
Retrieved November 23rd, 2016 from http://pertuni.
idp-europe.org/Resolusi2009/ 

Philip, B., & Updike, P. (2001). California Institute of 
Technology SURF project for summer. Retrieved 
February 23rd, 2017 from http://www.vision.caltech.
edu/html-files/archive.html

Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is real-
world visual object recognition hard? PLoS Comput 
Biol, 4(1), e27.

Pinto, N., Barhomi, Y., Cox, D. D., & DiCarlo, J. J. (2011). 
Comparing state-of-the-art visual features on 
invariant object recognition tasks. In 2011 IEEE 
workshop on Applications of Computer Vision 
(WACV) (pp. 463-470).

Ran, L., Helal, S., & Moore, S. (2004). Drishti: an integrated 
indoor/outdoor blind navigation system and service. 
In Pervasive Computing and Communications, 

2004. PerCom 2004. Proceedings of the Second 
IEEE Annual Conference on 17-17 March 2004 (pp. 
23-30).

Sutoyo, R., Prayoga, B., Suryani, D., & Shodiq, M. 
(2015). The implementation of hand detection and 
recognition to help presentation processes. Procedia 
Computer Science, 59, 550-558.

Sutoyo, R., Harefa, J., & Chowanda, A. (2016). Unlock 
screen application design using face expression 
on android smartphone. In MATEC Web of 
Conferences (Vol. 54). EDP Sciences.

Sutoyo, R., Lesmana, T. F., & Susanto, E. (2017). 
KINECTATION (Kinect for Presentation): Control 
presentation with interactive board and record 
presentation with live capture tools. Journal of 
Physics: Conference Series, 8(1), 1-6.

Yao, B., Yang, X., & Zhu, S. C. (2007). Introduction 
to a large-scale general purpose ground truth 
database: Methodology, annotation tool and 
benchmarks. In International Workshop on Energy 
Minimization Methods in Computer Vision and 
Pattern Recognition (pp. 169-183). Springer Berlin 
Heidelberg.