Acta Polytechnica CTU Proceedings


doi:10.14311/APP.2015.1.0045
Acta Polytechnica CTU Proceedings 2:45–50, 2015 © Czech Technical University in Prague, 2015

available online at http://ojs.cvut.cz/ojs/index.php/app

IMPACT ASSESSMENT OF IMAGE FEATURE EXTRACTORS ON
THE PERFORMANCE OF SLAM SYSTEMS

Taihú Pirea, ∗, Thomas Fischera, Jan Faiglb

a University of Buenos Aires, Intendente Güiraldes 2160, Ciudad Autónoma de Buenos Aires, Argentina
b Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague,

Technická 2, 166 27, Prague, Czech Republic
∗ corresponding author: tpire@dc.uba.ar

Abstract. This work evaluates an impact of image feature extractors on the performance of a visual
SLAM method in terms of pose accuracy and computational requirements. In particular, the S-PTAM
(Stereo Parallel Tracking and Mapping) method is considered as the visual SLAM framework for which
both the feature detector and feature descriptor are parametrized. The evaluation was performed
with a standard dataset with ground-truth information and six feature detectors and four descriptors.
The presented results indicate that the combination of the GFTT detector and the BRIEF descriptor
provides the best trade-off between the localization precision and computational requirements among
the evaluated combinations of the detectors and descriptors.

Keywords: image features, visual SLAM, stereo vision.

1. Introduction
During the last decade, the Simultaneous Localization
and Mapping (SLAM) problem has been one of the
main research interests in mobile robotics. Partic-
ularly, the use of cameras as the main sensors has
been given a special attention [1] [2] [3] [4] because of
their benefits such as low-cost and passive sensing. In
vision-based SLAM approaches, local image features
are used to build a map and simultaneously estimate
the robot pose using the environment landmarks rep-
resented as the image features. In this way, the map
is represented as a sparse point cloud, where each
point results from triangulating salient points (image
features) matched from a pair of stereo images.

Currently, there exist several local image feature ex-
tractors in the literature. A feature extractor is a com-
bination of a salient point (called keypoint) detection
procedure and a computation of a unique signature
(called descriptor) for each such a detected point. The
most commonly used detectors are SIFT [5], SURF
[6], STAR [7], GFTT [8], FAST [9], and relatively
recently proposed ORB [10], while among the most
used descriptors we can mention SIFT, SURF, ORB,
BRIEF [11], and BRISK [12].
In Visual SLAM systems, the feature extraction

process has a huge impact on the accuracy of the
whole system. On one hand, the precision of the robot
localization is heavily correlated to the sparsity of
features in images and the ability to track them for
a long period during the robot navigation, even from
different points of view. On the other hand, if the
number of points in the map grows too quickly, it may
slow down the whole system. To be able to keep the
response of the system under real-time constraints,
images have to be dropped or other parts of the system,

like optimization routines, need lower computational
requirements.
In this work, we evaluate the impact of different

state-of-the-art feature extractors on the performance
of the Visual SLAM localization method. In par-
ticular, the evaluation is based on the stereo Visual
SLAM approach S-PTAM introduced in [4]. The pre-
sented results indicated that the combination of the
GFTT detector and BRIEF descriptor is the most
reliable choice for our SLAM system among the other
evaluated combinations.
The rest of the paper is organized as follows. Sec-

tion 2 presents overview of the related work while
Section 3 summarizes the most used feature detectors
and descriptors in the Visual SLAM literature. Sec-
tion 4 briefly comments the considered stereo Visual
SLAM system using for the evaluation. In Section 5,
we present the evaluation of the features extractors
and the achieved results. Section 6 is dedicated to the
conclusions and future work.

2. Related Work
Several evaluations of features extractors can be found
in literature. Each of them is driven by the particular
application or issue at the hand they are aimed to
address. For example, in [13], authors evaluate sev-
eral features extractors in the context of autonomous
navigation in outdoor environments under seasonal
changes. They came to a conclusion that the best
performing method is the STAR–BRIEF combination
of the detector–descriptor, which outperforms SIFT
by more than thirty percentage points. In addition,
they argued that the STAR–BRIEF extractor is also
less computationally demanding than other extrac-
tors and thus it seems to be the most suitable feature

45

http://dx.doi.org/10.14311/APP.2015.1.0045
http://ojs.cvut.cz/ojs/index.php/app


T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings

detector–descriptor for navigational purposes.
On the other hand, authors of [14] provide a per-

formance comparison of feature extractors against
illumination changes in outdoor scenes in the context
of the visual navigation. They concluded that the con-
figuration of the FAST–SURF is the optimal in their
setup. Besides, they report that this combination pro-
vides an effective computational time per image, which
is favorable for the real-time vision-based navigation
application.
The work [15] compares contemporary point fea-

tures detector and descriptor pairs in order to de-
termine the best combination for the robot visual
navigation. The authors concluded that the FAST–
BRIEF combination is a good choice when processing
speed is an important parameter of the system setup.
They also argued that under camera movement condi-
tions, additional computational cost—needed for the
descriptors and detectors that are robust to in-plane
rotation and large scaling— seems to be unjustified.
However, they do not tested the method in a real
SLAM application.
Regarding the aforementioned evaluation of the

detectors and descriptors, the work presented in this
paper is within the context of the full 6DOF SLAM.

3. Local Image Features
An image feature extractor consists of detection and
description phases. The feature detector serves to
locate salient areas of the image while the feature
descriptor captures information about the local neigh-
borhood of the detected area. Here, we provide a
brief overview of the considered feature extractor and
descriptor algorithms in this evaluation study.

SIFT – Scale Invariant Feature Transform [5]. An
established feature detector with a high precision
and good robustness, which is known to be compu-
tationally demanding.

SURF – Speeded Up Robust Features [6] is a similar
to SIFT, but it is computationally less demanding
due to approximations.

STAR – A modified version of the CenSurE (Center
Surrounded Extrema) [7] detector, which is compu-
tationally less demanding at the expense of a lower
precision.

BRIEF – Binary Robust Independent Elementary
Features [11] is a descriptor that describes an image
area using a number of intensity comparisons of
random pixel pairs. It is saved as a binary string,
which reduces the computational complexity of the
subsequent matching.

FAST – Features from Accelerated Segment Test
[9] is a feature detector focused on lowering the
computational cost.

BRISK – Binary Robust Invariant Scalable Key-
points [12] is a scale and rotation invariant version

of BRIEF, but unlike BRIEF, it uses a deterministic
comparison pattern.

ORB – Oriented FAST and Rotated BRIEF [10]
is another attempt to achieve a scale and rotation
invariant BRIEF, as a computationally efficient al-
ternative to SIFT and SURF. It uses the FAST
detector to achieve low computational requirements.

GFTT – A detector focused on selecting features rel-
evant to motion tracking by analyzing the amount of
information they provide for that particular task [8].
The SURF and SIFT descriptors rely on the their

own detectors, which are also considered in the pre-
sented evaluation. However, for the BRIEF and
BRISK binary descriptors the considered detectors
are the GFTT, FAST and STAR which results in the
additional six combinations of the detector–descriptor
pairs in the presented evaluation.

4. Overview of S-PTAM
S-PTAM [4] is a stereo Visual SLAM method for a
large scale map navigation based on the monocular
Parallel Tracking and Mapping (PTAM) method in-
troduced in [1]. The method consists of two processes
working in parallel: 1) the tracking of the detected
features and; 2) creating a map of the features (map-
ping). During a robot navigation, the method works
as follows.

S-PTAM extracts features from the incoming stereo
images to match and construct a virtual map of the
environment. The newly extracted feature descriptors
are matched against descriptors of the points stored
in the map according to the estimated field of view.
The matches may then be used to refine the estimated
camera pose using an iterative least squares minimiza-
tion method, e.g., using the Levenberg-Marquardt
algorithm. The particular stereo matches between
the features that cannot be matched to the map are
triangulated and inserted as new map points, for the
tracking of future frames. In parallel, a map refine-
ment algorithm is running. It is also based on the
Levenberg-Marquardt optimization that continuously
performs the Bundle Adjustment on the current local
portion of the map.
In [4], S-PTAM uses the GFTT feature detector

and the BRIEF descriptor extractor. In this work, we
consider other combinations of the detector–descriptor
to evaluate an impact of the combination to the per-
formance of the localization and mapping processes.

5. Evaluation
The KITTI Vision Benchmark Suite [16] is used to
evaluate S-PTAM for each type of considered detector–
descriptor configuration. In particular, we present the
results obtained for the sequence 00, shown in Fig-
ure 1. The sequence records the stereo camera frames
captured by a moving car in an urban scenario for
almost 4 km long path. The particular parameters of

46


vol. 2/2015 Impact Assessment of Image Feature Extractors on SLAM

the evaluated feature extractors have been selected
in such a way that allows S-PTAM to run without
ever loosing localization. They have been tuned from
a strong restrictive value and then relaxed until the
method completes the whole sequence. The parame-
ters are listed in Table 1.

Detector / Parameter ValueDescriptor

SIFT nOctaveLayers 1
L2NormThreshold 100

SURF
hessianThreshold 1000
nOctaves 1
L2NormThreshold 0.2

STAR responseThreshold 20

BRIEF bytes 32
hammingThreshold 25

FAST threshold 60

BRISK hammingThreshold 100

ORB
nfeatures 2000
nLevels 1
hammingThreshold 50

GFTT nfeatures 2000
minDistance 15.0

Table 1. Parameters used for feature detectors and
descriptors. The parameters which do not appear
in the list use the default value in the OpenCV im-
plementation. In the case of the binary descriptors,
the Hamming distance is used to compute the valid
matches while the L2 norm is used for the SURF and
SIFT descriptors.

The evaluation has been performed using an Intel
Core i7 processor with 4 cores running at 2.2 GHz.
Although S-PTAM strongly exploits parallelism, the
experiments were run in a sequential fashion that
allow us to simulate ideal conditions and abstract from
the limitations of the available computational power.
This ensures that no frames are dropped and that
the iterative optimization routines always converge or
reach a maximum threshold of iterations.
Nevertheless, the tracking process performs pose

optimization using an iterative algorithm; so, the
less time is used in the features extraction, the more
iterations the method can compute. Figure 2 shows
a characterization of the total tracking time for each
pair of frames, as achieved by using the evaluated
extractors.
Moreover, the iterative least-squares optimization,

which is utilized in the mapping and tracking pro-
cesses, depends linearly on the number of tracked
points (the density of the map). Thus, regarding the
computational burden, the map should be as small as
possible while the map points should contain strong
enough features to support a robust tracking of the

Figure 1. Path tracked by every method run under
different extractors, against the ground truth. The
path is nearly 4 km long. The shown distances at the
axes are in meters.

Figure 2. Total tracking time achieved by each con-
figuration

frames. Table 2 shows the final number of points
contained in the map after finishing each trial for a
particular combination of feature detector and descrip-
tor. In Figure 3, we can see how the map size impacts
directly on the temporal performance of the tracking
process. Combinations of the detector–descriptor that
build the most dense maps also take the longest time
to compute.

Differences in the map size for the evaluated descrip-
tor in the feature extractors with the same detector
can have two reasons. The first reason is that new

47


T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings

Figure 3. Tracking time without taking into account
feature extraction

points are created from the stereo features only if these
features are not matched to the map. The second rea-
son is that the points marked as outliers during the
refinement processes are discarded. In the first case,
this can be caused by descriptors that are not robust
enough to be matched to the map for a long time. In
the second case, the descriptor matching may be too
permissive and it allows bad matches that are later
discarded as outliers.

Extractor Final map size

GFTT / BRIEF 990 455

GFTT / BRISK 1 314 356

SIFT / SIFT 1 581 876

STAR / BRIEF 1 893 372

SURF / SURF 2 059 879

FAST / BRIEF 2 420 652

STAR / BRISK 2 447 418

FAST / BRISK 3 207 003

ORB / ORB 5 192 885

Table 2. The number of points contained in the
map after completing the sequence for each evaluated
extractor, in ascending order.

Since the goal of this work is to assess the impact
of the feature extractor choice also on the accuracy
of the SLAM method, the achieved performance is
presented as two independent relative errors for each
estimated pose: �t for the translation error and; �θ for
the orientation. Let xk be the estimated pose at the
frame k, which can be decomposed as the translation
tk and the rotation Rk. Let x∗k be the reference pose,
which can be decomposed in the same fashion. The
aforementioned errors are computed as

�t,k+1 = ‖(tk 	 tk+1) 	
(
t∗k 	 t

∗
k+1

)
‖,

�θ,k+1 = angle
(
(Rk 	 Rk+1) 	

(
R∗k 	 R

∗
k+1

))
,

where 	 is the inverse of the standard motion com-
position operator. For pure translations, we can
rewrite t1 	t2 = t2 −t1, and for the pure rotations as
R1 	R2 = Rt1R2. ‖x‖ stands for the Euclidean norm
and angle (R) extracts the magnitude of the rotation.
The computed errors are shown in Figure 4 and

Figure 5, respectively. Although the angular deviation
to the ground truth, shown in Figure 5, seems to be
similar in all methods, the same is not true for the
translation error, as it can be seen in Figure 4. The
BRISK descriptor seems to be a more reliable with the
FAST detector, while the same holds for the BRIEF
descriptor with the GFTT detector.

Figure 4. Relative translation error

Figure 5. Relative orientation error

For completion, the absolute errors

�′t,k = ‖tk 	 t∗k‖
�′θ,k = angle (Rk 	 R∗k)

are shown in Figure 6 and Figure 7.

48


vol. 2/2015 Impact Assessment of Image Feature Extractors on SLAM

Figure 6. Absolute translation error

Figure 7. Absolute orientation error

6. Conclusions
In this paper, we present an evaluation of the impact
of different state-of-the-art image feature extractors
on the performance of the SLAM method proposed
in [4]. The KITTI Benchmark Suite dataset with
a ground truth is used to evaluate the achievable
precision of the method for different feature extractors.
Based on the presented results, the main conclusion
is that the GFTT detector is the most suitable choice
for the best performance in the evaluated dataset.
The GFTT (accompanied with the BRIEF or BRISK
descriptors) outperforms the other methods in terms of
the required computational time and the map quality.
Although the map density is far smaller, the computed
translation error is similar, even slightly better, than
the one achieved by other extractors. This insight can
be interpreted as the most useful features (regarding
the navigation) are extracted while the descriptor also
support efficient matching resulting in a more precise
localization.
Recently, a novel stereo feature extractors have

been proposed, e.g., [17], which motivates us to con-

sider them in S-PTAM. An evaluation of the novel
extractors is a subject of our future work.

Acknowledgements
This work is a direct result of the bilateral cooperation
program between the Czech and Argentinian Republics
support by the Argentinian project ARC/14/06 and travel
support of the Czech Ministry of Education under the
project No. 7AMB15AR029. The work of J. Faigl is
supported by the Czech Science Foundation (GAČR) under
the research project No. GJ15-09600Y.

References
[1] G. Klein, D. Murray. Parallel Tracking and Mapping
for Small AR Workspaces. In ISMAR, pp. 1–10. IEEE
Computer Society, Washington, DC, USA, 2007.
doi:10.1109/ISMAR.2007.4538852.

[2] C. Mei, G. Sibley, M. Cummins, et al. Rslam: A
system for large-scale mapping in constant-time using
stereo. International Journal of Computer Vision
94(2):198–214, 2011. doi:10.1007/s11263-010-0361-7.

[3] R. Mur-Artal, J. M. M. Montiel, J. D. Tardós.
ORB-SLAM: a versatile and accurate monocular SLAM
system. CoRR abs/1502.00956, 2015.
doi:10.1109/TRO.2015.2463671.

[4] T. Pire, T. Fischer, J. Civera, et al. Stereo parallel
tracking and mapping for robot localization. In IROS.
2015. (to appear).

[5] D. Lowe. Distinctive image features from
scale-invariant keypoints. International Journal of
Computer Vision 60(2):91–110, 2004.
doi:10.1023/B:VISI.0000029664.99615.94.

[6] H. Bay, T. Tuytelaars, L. Van Gool. Surf: Speeded up
robust features. In ECCV, vol. 3951 of Lecture Notes in
Computer Science, pp. 404–417. Springer Berlin
Heidelberg, 2006. doi:10.1007/11744023_32.

[7] M. Agrawal, K. Konolige, M. Blas. Censure: Center
surround extremas for realtime feature detection and
matching. In ECCV, vol. 5305 of Lecture Notes in
Computer Science, pp. 102–115. Springer Berlin
Heidelberg, 2008. doi:10.1007/978-3-540-88693-8_8.

[8] J. Shi, C. Tomasi. Good features to track. In CVPR,
pp. 593–600. 1994. doi:10.1109/CVPR.1994.323794.

[9] E. Rosten, T. Drummond. Machine learning for high-
speed corner detection. In ECCV, vol. 3951 of Lecture
Notes in Computer Science, pp. 430–443. Springer
Berlin Heidelberg, 2006. doi:10.1007/11744023_34.

[10] E. Rublee, V. Rabaud, K. Konolige, G. Bradski. Orb:
An efficient alternative to sift or surf. In ICCV, pp.
2564–2571. 2011. doi:10.1109/ICCV.2011.6126544.

[11] M. Calonder, V. Lepetit, C. Strecha, P. Fua. Brief:
Binary robust independent elementary features. In
ECCV, vol. 6314 of Lecture Notes in Computer Science,
pp. 778–792. Springer Berlin Heidelberg, 2010.
doi:10.1007/978-3-642-15561-1_56.

[12] S. Leutenegger, M. Chli, R. Siegwart. Brisk: Binary
robust invariant scalable keypoints. In ICCV, pp.
2548–2555. 2011. doi:10.1109/ICCV.2011.6126542.

49

http://dx.doi.org/10.1109/ISMAR.2007.4538852
http://dx.doi.org/10.1007/s11263-010-0361-7
http://dx.doi.org/10.1109/TRO.2015.2463671
http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94
http://dx.doi.org/10.1007/11744023_32
http://dx.doi.org/10.1007/978-3-540-88693-8_8
http://dx.doi.org/10.1109/CVPR.1994.323794
http://dx.doi.org/10.1007/11744023_34
http://dx.doi.org/10.1109/ICCV.2011.6126544
http://dx.doi.org/10.1007/978-3-642-15561-1_56
http://dx.doi.org/10.1109/ICCV.2011.6126542


T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings

[13] T. Krajník, P. de Cristóforis, M. Nitche, et al. Image
features and seasons revisited. In European Conference
on Mobile Robotics (ECMR). 2015. (to appear).

[14] Dzulfahmi, N. Ohta. Performance evaluation of
image feature detectors and descriptors for
outdoor-scene visual navigation. In ACPR, pp. 872–876.
2013. doi:10.1109/ACPR.2013.159.

[15] A. Schmidt, M. Kraft, M. Fularz, Z. Domagala.
Comparative assessment of point feature detectors in
the context of robot navigation. Journal of Automation,
Mobile Robotics and Intelligent Systems 7(1):11–20,
2013.

[16] A. Geiger, P. Lenz, C. Stiller, R. Urtasun. Vision
meets robotics: The kitti dataset. IJRR 32(11):1231–
1237, 2013. doi:10.1177/0278364913491297.

[17] R. Arroyo, P. Alcantarilla, L. Bergasa, et al. Fast
and effective visual place recognition using binary codes
and disparity information. In IROS, pp. 3089–3094.
2014. doi:10.1109/IROS.2014.6942989.

50

http://dx.doi.org/10.1109/ACPR.2013.159
http://dx.doi.org/10.1177/0278364913491297
http://dx.doi.org/10.1109/IROS.2014.6942989

	Acta Polytechnica CTU Proceedings 2:45–50, 2015
	1 Introduction
	2 Related Work
	3 Local Image Features
	4 Overview of S-PTAM
	5 Evaluation
	6 Conclusions
	Acknowledgements
	References