Acta Polytechnica CTU Proceedings doi:10.14311/APP.2015.1.0045 Acta Polytechnica CTU Proceedings 2:45–50, 2015 © Czech Technical University in Prague, 2015 available online at http://ojs.cvut.cz/ojs/index.php/app IMPACT ASSESSMENT OF IMAGE FEATURE EXTRACTORS ON THE PERFORMANCE OF SLAM SYSTEMS Taihú Pirea, ∗, Thomas Fischera, Jan Faiglb a University of Buenos Aires, Intendente Güiraldes 2160, Ciudad Autónoma de Buenos Aires, Argentina b Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Technická 2, 166 27, Prague, Czech Republic ∗ corresponding author: tpire@dc.uba.ar Abstract. This work evaluates an impact of image feature extractors on the performance of a visual SLAM method in terms of pose accuracy and computational requirements. In particular, the S-PTAM (Stereo Parallel Tracking and Mapping) method is considered as the visual SLAM framework for which both the feature detector and feature descriptor are parametrized. The evaluation was performed with a standard dataset with ground-truth information and six feature detectors and four descriptors. The presented results indicate that the combination of the GFTT detector and the BRIEF descriptor provides the best trade-off between the localization precision and computational requirements among the evaluated combinations of the detectors and descriptors. Keywords: image features, visual SLAM, stereo vision. 1. Introduction During the last decade, the Simultaneous Localization and Mapping (SLAM) problem has been one of the main research interests in mobile robotics. Partic- ularly, the use of cameras as the main sensors has been given a special attention [1] [2] [3] [4] because of their benefits such as low-cost and passive sensing. In vision-based SLAM approaches, local image features are used to build a map and simultaneously estimate the robot pose using the environment landmarks rep- resented as the image features. In this way, the map is represented as a sparse point cloud, where each point results from triangulating salient points (image features) matched from a pair of stereo images. Currently, there exist several local image feature ex- tractors in the literature. A feature extractor is a com- bination of a salient point (called keypoint) detection procedure and a computation of a unique signature (called descriptor) for each such a detected point. The most commonly used detectors are SIFT [5], SURF [6], STAR [7], GFTT [8], FAST [9], and relatively recently proposed ORB [10], while among the most used descriptors we can mention SIFT, SURF, ORB, BRIEF [11], and BRISK [12]. In Visual SLAM systems, the feature extraction process has a huge impact on the accuracy of the whole system. On one hand, the precision of the robot localization is heavily correlated to the sparsity of features in images and the ability to track them for a long period during the robot navigation, even from different points of view. On the other hand, if the number of points in the map grows too quickly, it may slow down the whole system. To be able to keep the response of the system under real-time constraints, images have to be dropped or other parts of the system, like optimization routines, need lower computational requirements. In this work, we evaluate the impact of different state-of-the-art feature extractors on the performance of the Visual SLAM localization method. In par- ticular, the evaluation is based on the stereo Visual SLAM approach S-PTAM introduced in [4]. The pre- sented results indicated that the combination of the GFTT detector and BRIEF descriptor is the most reliable choice for our SLAM system among the other evaluated combinations. The rest of the paper is organized as follows. Sec- tion 2 presents overview of the related work while Section 3 summarizes the most used feature detectors and descriptors in the Visual SLAM literature. Sec- tion 4 briefly comments the considered stereo Visual SLAM system using for the evaluation. In Section 5, we present the evaluation of the features extractors and the achieved results. Section 6 is dedicated to the conclusions and future work. 2. Related Work Several evaluations of features extractors can be found in literature. Each of them is driven by the particular application or issue at the hand they are aimed to address. For example, in [13], authors evaluate sev- eral features extractors in the context of autonomous navigation in outdoor environments under seasonal changes. They came to a conclusion that the best performing method is the STAR–BRIEF combination of the detector–descriptor, which outperforms SIFT by more than thirty percentage points. In addition, they argued that the STAR–BRIEF extractor is also less computationally demanding than other extrac- tors and thus it seems to be the most suitable feature 45 http://dx.doi.org/10.14311/APP.2015.1.0045 http://ojs.cvut.cz/ojs/index.php/app T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings detector–descriptor for navigational purposes. On the other hand, authors of [14] provide a per- formance comparison of feature extractors against illumination changes in outdoor scenes in the context of the visual navigation. They concluded that the con- figuration of the FAST–SURF is the optimal in their setup. Besides, they report that this combination pro- vides an effective computational time per image, which is favorable for the real-time vision-based navigation application. The work [15] compares contemporary point fea- tures detector and descriptor pairs in order to de- termine the best combination for the robot visual navigation. The authors concluded that the FAST– BRIEF combination is a good choice when processing speed is an important parameter of the system setup. They also argued that under camera movement condi- tions, additional computational cost—needed for the descriptors and detectors that are robust to in-plane rotation and large scaling— seems to be unjustified. However, they do not tested the method in a real SLAM application. Regarding the aforementioned evaluation of the detectors and descriptors, the work presented in this paper is within the context of the full 6DOF SLAM. 3. Local Image Features An image feature extractor consists of detection and description phases. The feature detector serves to locate salient areas of the image while the feature descriptor captures information about the local neigh- borhood of the detected area. Here, we provide a brief overview of the considered feature extractor and descriptor algorithms in this evaluation study. SIFT – Scale Invariant Feature Transform [5]. An established feature detector with a high precision and good robustness, which is known to be compu- tationally demanding. SURF – Speeded Up Robust Features [6] is a similar to SIFT, but it is computationally less demanding due to approximations. STAR – A modified version of the CenSurE (Center Surrounded Extrema) [7] detector, which is compu- tationally less demanding at the expense of a lower precision. BRIEF – Binary Robust Independent Elementary Features [11] is a descriptor that describes an image area using a number of intensity comparisons of random pixel pairs. It is saved as a binary string, which reduces the computational complexity of the subsequent matching. FAST – Features from Accelerated Segment Test [9] is a feature detector focused on lowering the computational cost. BRISK – Binary Robust Invariant Scalable Key- points [12] is a scale and rotation invariant version of BRIEF, but unlike BRIEF, it uses a deterministic comparison pattern. ORB – Oriented FAST and Rotated BRIEF [10] is another attempt to achieve a scale and rotation invariant BRIEF, as a computationally efficient al- ternative to SIFT and SURF. It uses the FAST detector to achieve low computational requirements. GFTT – A detector focused on selecting features rel- evant to motion tracking by analyzing the amount of information they provide for that particular task [8]. The SURF and SIFT descriptors rely on the their own detectors, which are also considered in the pre- sented evaluation. However, for the BRIEF and BRISK binary descriptors the considered detectors are the GFTT, FAST and STAR which results in the additional six combinations of the detector–descriptor pairs in the presented evaluation. 4. Overview of S-PTAM S-PTAM [4] is a stereo Visual SLAM method for a large scale map navigation based on the monocular Parallel Tracking and Mapping (PTAM) method in- troduced in [1]. The method consists of two processes working in parallel: 1) the tracking of the detected features and; 2) creating a map of the features (map- ping). During a robot navigation, the method works as follows. S-PTAM extracts features from the incoming stereo images to match and construct a virtual map of the environment. The newly extracted feature descriptors are matched against descriptors of the points stored in the map according to the estimated field of view. The matches may then be used to refine the estimated camera pose using an iterative least squares minimiza- tion method, e.g., using the Levenberg-Marquardt algorithm. The particular stereo matches between the features that cannot be matched to the map are triangulated and inserted as new map points, for the tracking of future frames. In parallel, a map refine- ment algorithm is running. It is also based on the Levenberg-Marquardt optimization that continuously performs the Bundle Adjustment on the current local portion of the map. In [4], S-PTAM uses the GFTT feature detector and the BRIEF descriptor extractor. In this work, we consider other combinations of the detector–descriptor to evaluate an impact of the combination to the per- formance of the localization and mapping processes. 5. Evaluation The KITTI Vision Benchmark Suite [16] is used to evaluate S-PTAM for each type of considered detector– descriptor configuration. In particular, we present the results obtained for the sequence 00, shown in Fig- ure 1. The sequence records the stereo camera frames captured by a moving car in an urban scenario for almost 4 km long path. The particular parameters of 46 vol. 2/2015 Impact Assessment of Image Feature Extractors on SLAM the evaluated feature extractors have been selected in such a way that allows S-PTAM to run without ever loosing localization. They have been tuned from a strong restrictive value and then relaxed until the method completes the whole sequence. The parame- ters are listed in Table 1. Detector / Parameter ValueDescriptor SIFT nOctaveLayers 1 L2NormThreshold 100 SURF hessianThreshold 1000 nOctaves 1 L2NormThreshold 0.2 STAR responseThreshold 20 BRIEF bytes 32 hammingThreshold 25 FAST threshold 60 BRISK hammingThreshold 100 ORB nfeatures 2000 nLevels 1 hammingThreshold 50 GFTT nfeatures 2000 minDistance 15.0 Table 1. Parameters used for feature detectors and descriptors. The parameters which do not appear in the list use the default value in the OpenCV im- plementation. In the case of the binary descriptors, the Hamming distance is used to compute the valid matches while the L2 norm is used for the SURF and SIFT descriptors. The evaluation has been performed using an Intel Core i7 processor with 4 cores running at 2.2 GHz. Although S-PTAM strongly exploits parallelism, the experiments were run in a sequential fashion that allow us to simulate ideal conditions and abstract from the limitations of the available computational power. This ensures that no frames are dropped and that the iterative optimization routines always converge or reach a maximum threshold of iterations. Nevertheless, the tracking process performs pose optimization using an iterative algorithm; so, the less time is used in the features extraction, the more iterations the method can compute. Figure 2 shows a characterization of the total tracking time for each pair of frames, as achieved by using the evaluated extractors. Moreover, the iterative least-squares optimization, which is utilized in the mapping and tracking pro- cesses, depends linearly on the number of tracked points (the density of the map). Thus, regarding the computational burden, the map should be as small as possible while the map points should contain strong enough features to support a robust tracking of the Figure 1. Path tracked by every method run under different extractors, against the ground truth. The path is nearly 4 km long. The shown distances at the axes are in meters. Figure 2. Total tracking time achieved by each con- figuration frames. Table 2 shows the final number of points contained in the map after finishing each trial for a particular combination of feature detector and descrip- tor. In Figure 3, we can see how the map size impacts directly on the temporal performance of the tracking process. Combinations of the detector–descriptor that build the most dense maps also take the longest time to compute. Differences in the map size for the evaluated descrip- tor in the feature extractors with the same detector can have two reasons. The first reason is that new 47 T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings Figure 3. Tracking time without taking into account feature extraction points are created from the stereo features only if these features are not matched to the map. The second rea- son is that the points marked as outliers during the refinement processes are discarded. In the first case, this can be caused by descriptors that are not robust enough to be matched to the map for a long time. In the second case, the descriptor matching may be too permissive and it allows bad matches that are later discarded as outliers. Extractor Final map size GFTT / BRIEF 990 455 GFTT / BRISK 1 314 356 SIFT / SIFT 1 581 876 STAR / BRIEF 1 893 372 SURF / SURF 2 059 879 FAST / BRIEF 2 420 652 STAR / BRISK 2 447 418 FAST / BRISK 3 207 003 ORB / ORB 5 192 885 Table 2. The number of points contained in the map after completing the sequence for each evaluated extractor, in ascending order. Since the goal of this work is to assess the impact of the feature extractor choice also on the accuracy of the SLAM method, the achieved performance is presented as two independent relative errors for each estimated pose: �t for the translation error and; �θ for the orientation. Let xk be the estimated pose at the frame k, which can be decomposed as the translation tk and the rotation Rk. Let x∗k be the reference pose, which can be decomposed in the same fashion. The aforementioned errors are computed as �t,k+1 = ‖(tk tk+1) ( t∗k t ∗ k+1 ) ‖, �θ,k+1 = angle ( (Rk Rk+1) ( R∗k R ∗ k+1 )) , where is the inverse of the standard motion com- position operator. For pure translations, we can rewrite t1 t2 = t2 −t1, and for the pure rotations as R1 R2 = Rt1R2. ‖x‖ stands for the Euclidean norm and angle (R) extracts the magnitude of the rotation. The computed errors are shown in Figure 4 and Figure 5, respectively. Although the angular deviation to the ground truth, shown in Figure 5, seems to be similar in all methods, the same is not true for the translation error, as it can be seen in Figure 4. The BRISK descriptor seems to be a more reliable with the FAST detector, while the same holds for the BRIEF descriptor with the GFTT detector. Figure 4. Relative translation error Figure 5. Relative orientation error For completion, the absolute errors �′t,k = ‖tk t∗k‖ �′θ,k = angle (Rk R∗k) are shown in Figure 6 and Figure 7. 48 vol. 2/2015 Impact Assessment of Image Feature Extractors on SLAM Figure 6. Absolute translation error Figure 7. Absolute orientation error 6. Conclusions In this paper, we present an evaluation of the impact of different state-of-the-art image feature extractors on the performance of the SLAM method proposed in [4]. The KITTI Benchmark Suite dataset with a ground truth is used to evaluate the achievable precision of the method for different feature extractors. Based on the presented results, the main conclusion is that the GFTT detector is the most suitable choice for the best performance in the evaluated dataset. The GFTT (accompanied with the BRIEF or BRISK descriptors) outperforms the other methods in terms of the required computational time and the map quality. Although the map density is far smaller, the computed translation error is similar, even slightly better, than the one achieved by other extractors. This insight can be interpreted as the most useful features (regarding the navigation) are extracted while the descriptor also support efficient matching resulting in a more precise localization. Recently, a novel stereo feature extractors have been proposed, e.g., [17], which motivates us to con- sider them in S-PTAM. An evaluation of the novel extractors is a subject of our future work. Acknowledgements This work is a direct result of the bilateral cooperation program between the Czech and Argentinian Republics support by the Argentinian project ARC/14/06 and travel support of the Czech Ministry of Education under the project No. 7AMB15AR029. The work of J. Faigl is supported by the Czech Science Foundation (GAČR) under the research project No. GJ15-09600Y. References [1] G. Klein, D. Murray. Parallel Tracking and Mapping for Small AR Workspaces. In ISMAR, pp. 1–10. IEEE Computer Society, Washington, DC, USA, 2007. doi:10.1109/ISMAR.2007.4538852. [2] C. Mei, G. Sibley, M. Cummins, et al. Rslam: A system for large-scale mapping in constant-time using stereo. International Journal of Computer Vision 94(2):198–214, 2011. doi:10.1007/s11263-010-0361-7. [3] R. Mur-Artal, J. M. M. Montiel, J. D. Tardós. ORB-SLAM: a versatile and accurate monocular SLAM system. CoRR abs/1502.00956, 2015. doi:10.1109/TRO.2015.2463671. [4] T. Pire, T. Fischer, J. Civera, et al. Stereo parallel tracking and mapping for robot localization. In IROS. 2015. (to appear). [5] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2):91–110, 2004. doi:10.1023/B:VISI.0000029664.99615.94. [6] H. Bay, T. Tuytelaars, L. Van Gool. Surf: Speeded up robust features. In ECCV, vol. 3951 of Lecture Notes in Computer Science, pp. 404–417. Springer Berlin Heidelberg, 2006. doi:10.1007/11744023_32. [7] M. Agrawal, K. Konolige, M. Blas. Censure: Center surround extremas for realtime feature detection and matching. In ECCV, vol. 5305 of Lecture Notes in Computer Science, pp. 102–115. Springer Berlin Heidelberg, 2008. doi:10.1007/978-3-540-88693-8_8. [8] J. Shi, C. Tomasi. Good features to track. In CVPR, pp. 593–600. 1994. doi:10.1109/CVPR.1994.323794. [9] E. Rosten, T. Drummond. Machine learning for high- speed corner detection. In ECCV, vol. 3951 of Lecture Notes in Computer Science, pp. 430–443. Springer Berlin Heidelberg, 2006. doi:10.1007/11744023_34. [10] E. Rublee, V. Rabaud, K. Konolige, G. Bradski. Orb: An efficient alternative to sift or surf. In ICCV, pp. 2564–2571. 2011. doi:10.1109/ICCV.2011.6126544. [11] M. Calonder, V. Lepetit, C. Strecha, P. Fua. Brief: Binary robust independent elementary features. In ECCV, vol. 6314 of Lecture Notes in Computer Science, pp. 778–792. Springer Berlin Heidelberg, 2010. doi:10.1007/978-3-642-15561-1_56. [12] S. Leutenegger, M. Chli, R. Siegwart. Brisk: Binary robust invariant scalable keypoints. In ICCV, pp. 2548–2555. 2011. doi:10.1109/ICCV.2011.6126542. 49 http://dx.doi.org/10.1109/ISMAR.2007.4538852 http://dx.doi.org/10.1007/s11263-010-0361-7 http://dx.doi.org/10.1109/TRO.2015.2463671 http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94 http://dx.doi.org/10.1007/11744023_32 http://dx.doi.org/10.1007/978-3-540-88693-8_8 http://dx.doi.org/10.1109/CVPR.1994.323794 http://dx.doi.org/10.1007/11744023_34 http://dx.doi.org/10.1109/ICCV.2011.6126544 http://dx.doi.org/10.1007/978-3-642-15561-1_56 http://dx.doi.org/10.1109/ICCV.2011.6126542 T. Pire, T. Fischer, J. Faigl Acta Polytechnica CTU Proceedings [13] T. Krajník, P. de Cristóforis, M. Nitche, et al. Image features and seasons revisited. In European Conference on Mobile Robotics (ECMR). 2015. (to appear). [14] Dzulfahmi, N. Ohta. Performance evaluation of image feature detectors and descriptors for outdoor-scene visual navigation. In ACPR, pp. 872–876. 2013. doi:10.1109/ACPR.2013.159. [15] A. Schmidt, M. Kraft, M. Fularz, Z. Domagala. Comparative assessment of point feature detectors in the context of robot navigation. Journal of Automation, Mobile Robotics and Intelligent Systems 7(1):11–20, 2013. [16] A. Geiger, P. Lenz, C. Stiller, R. Urtasun. Vision meets robotics: The kitti dataset. IJRR 32(11):1231– 1237, 2013. doi:10.1177/0278364913491297. [17] R. Arroyo, P. Alcantarilla, L. Bergasa, et al. Fast and effective visual place recognition using binary codes and disparity information. In IROS, pp. 3089–3094. 2014. doi:10.1109/IROS.2014.6942989. 50 http://dx.doi.org/10.1109/ACPR.2013.159 http://dx.doi.org/10.1177/0278364913491297 http://dx.doi.org/10.1109/IROS.2014.6942989 Acta Polytechnica CTU Proceedings 2:45–50, 2015 1 Introduction 2 Related Work 3 Local Image Features 4 Overview of S-PTAM 5 Evaluation 6 Conclusions Acknowledgements References