Acta Polytechnica CTU Proceedings doi:10.14311/APP.2016.6.0001 Acta Polytechnica CTU Proceedings 6:1–5, 2016 © Czech Technical University in Prague, 2016 available online at http://ojs.cvut.cz/ojs/index.php/app ON CONSTRUCTION OF A RELIABLE GROUND TRUTH FOR EVALUATION OF VISUAL SLAM ALGORITHMS Jan Bayer∗, Petr Čížek, Jan Faigl Czech Technical University, Faculty of Electrical Engineering, Technicka 2, Prague, Czech Republic ∗ corresponding author: bayerja1@fel.cvut.cz Abstract. In this work we are concerning the problem of localization accuracy evaluation of visual-based Simultaneous Localization and Mapping (SLAM) techniques. Quantitative evaluation of the SLAM algorithm performance is usually done using the established metrics of Relative pose error and Absolute trajectory error which require a precise and reliable ground truth. Such a ground truth is usually hard to obtain, while it requires an expensive external localization system. In this work we are proposing to use the SLAM algorithm itself to construct a reliable ground truth by offline frame-by-frame processing. The generated ground truth is suitable for evaluation of different SLAM systems, as well as for tuning the parametrization of the on-line SLAM. The presented practical experimental results indicate the feasibility of the proposed approach. Keywords: RGB-D SLAM, localization, ground truth construction, legged robot, rough terrain traversal, adaptive motion gait. 1. Introduction Reliable localization is an essential prerequisite in many practical robotic scenarios, thus its accuracy is the most important parameter in the evaluation of newly developed localization techniques. There are numerous methods based on various principles that can localize the robot in its operational environ- ment. However, when benchmarking such localization methods it is necessary to use a reliable ground truth together with some established metrics (e.g. [1, 2]). In this work, we address the problem of localiza- tion accuracy assessment in a visual Simultaneous Localization and Mapping (SLAM) task [3]. In prin- ciple, it is possible to use publicly available datasets, e.g., the Freiburg Uni dataset [1], the RGB-D SLAM dataset [2], or the KITTY dataset [4], to evaluate the accuracy of SLAM method. However, our particular environment, setup, and sensory equipment might not exhibit similar properties as the ones used in these standardized datasets. Therefore we need to construct our own dataset and acquire a ground truth to assess the localization accuracy. All the above-mentioned datasets are supplied with a centimeter-level precision ground truth provided by an external localization system like a differential GPS [4], which is based on a standard GPS enhanced with ground reference stations with known geospatial positions. Other types of external localization systems used in publicly available datasets are motion capture system [2] or a total station [5]. The former utilizes the data captured by camera sensors to localize the robot marked with special patterns well-distinguishable in the environment. The later measures distance between the observed object and the static ground station using time of flight principle. A 3-DOF position of the robot in a global reference frame is obtained from Figure 1. Experimental environment the distance and the angles measured by the total station. However, acquisition of a reliable ground truth using any of the mentioned localization techniques might be a considerable problem while all of the aforementioned principles are very expensive. Thus we are proposing to use the dataset itself for the ground truth con- struction by offline frame-by-frame processing of the dataset using the SLAM algorithm. In such a case, we are not interested in an on-line performance of the algorithm rather than the reliability and accuracy of the constructed trajectory estimate to be used as a ground truth. Such a processing is assumed to pro- vide the best trajectory estimate for the given dataset and localization technique. Moreover, by taking into account certain factors in the dataset construction., which are discussed further in this paper, it is possi- ble to improve the reliability of the generated ground truth. 1 http://dx.doi.org/10.14311/APP.2016.6.0001 http://ojs.cvut.cz/ojs/index.php/app J. Bayer, P. Čížek, J. Faigl Acta Polytechnica CTU Proceedings To verify our hypothesis, we are reporting on the results of such a ground truth construction evaluation based on the RGB-D SLAM method [6] compared to the ground truth provided by the WhyCon motion capture system [7]. We evaluated the proposed ap- proach on a challenging dataset obtained with the RGB-D camera mounted on a hexapod walking robot crawling in rough terrain (see Fig. 1). Such a scenario requires full 6-DOF position estimation, while the camera is subjected to unpredictable motion and the resulting dataset is very different in comparison to all the standardized datasets obtained mainly by wheel platforms. The presented approach to ground truth construc- tion is somehow similar to the task of structure-from- motion studied in computer vision community. In [8], the authors use color images from the RGB-D Mi- crosoft Kinect camera to recover its pose, but for the scene and object recognition purposes. However, in our work we are focusing solely on the localization accuracy rather than the quality of the model (map) of the environment. The paper is structured as follows. Section 2 ex- plains our method of getting the ground truth for the evaluation and the used evaluation metrics. The set-up of a practical experiment together with the evaluation results are described in Section 3. Con- clusion and remarks for future work are dedicated to Section 4. 2. Proposed solution The main motivation of our work is to provide a reli- able ground truth estimate whenever a precise external localization is not available to be used as the ground truth. For this reason, we are proposing to use the captured dataset and process it with a SLAM system offline frame-by-frame with special settings of param- eters to minimize the trajectory error. Afterward, the resulting trajectory estimate can be used as a ground truth to compare to the online generated trajectories, while in deployment scenarios we are interested in the online performance of SLAM methods. Such an esti- mated ground truth can be used to compare different SLAM methods or to find the best parametrization for the online SLAM trajectory estimation. 2.1. Evaluation metrics The SLAM algorithms can be compared either by the quality of the generated map or the accuracy of the estimated trajectory. As the evaluation metrics, we use the established metrics of ATE (absolute trajec- tory error) and RPE (relative pose error) presented in [2]. ATE is given as: Fi = Q−1i SPi, (1) where Q−1i is transformation matrix, which maps (i + 1)th point of ground truth to ith point of ground truth. Pi is a similar matrix, which maps ith point of the estimated trajectory to the (i + 1)th point of the trajectory. S is a transformation matrix, which aligns the estimated trajectory to the ground truth. ATE measures the absolute error of all estimated 6-DOF positions, which means that ATE can be repre- sented by ATEt and ATEφ, where ATEt are euclidean distances between estimated poses and corresponding ground truth positions and ATEφ is the absolute error of the estimated orientations. RPE measures the drift of the estimated trajectory (error in shape of the trajectory estimate) and it is given by the equation: Ei = (Q−1i Qi+∆) −1(P−1i Pi+∆), (2) where ∆ defines a fixed frame interval. We used ∆ = 1 for the evaluation, which means that RPE measures a drift of visual odometry. RPE can be also represented by RPEt (the relative translational error) and RPEφ (the relative rotational error). 3. Evaluation To test our hypothesis and prove its feasibility we de- signed a practical experiment where we evaluated the localization accuracy provided by an offline frame-by- frame processing of a visual SLAM dataset captured by a hexapod robot traversing rough terrain with a ground truth provided by an external localization system. In particular, we focused on evaluation with the structured light camera and RGB-D SLAM algorithm based on [6]. We choose to use the structured light sensor because it is inexpensive and provide the SLAM method with a metric information unlike the stereo and monocular SLAM methods which can estimate the trajectory up to scale. For the establishment of a reliable ground truth, we used motion capture system WhyCon [7] which tracks the circular markers attached to the robot using two Logitech HD Pro Webcam C920 cameras. Used external localization provided centimeter-level precision ground truth. We used the RGB-D camera Asus Xtion Pro for the visual SLAM, which acquires a color image and also depth measurements. The experimental dataset captured using the RGB-D camera mounted on a hexapod walking platform contains sequences of RGB and depth images in full Asus Xtion Pro resolution 640 × 480 px at 10 frames per second. The whole dataset for the evaluation was captured using the ROS framework [9]. 3.1. RGB-D SLAM The considered RGB-D SLAM system is based on [6] and it operates as follows. First positions of the salient image features are extracted from the RGB image supplemented by their depth provided by the RGB-D sensor. Then, a rigid transformation is found 2 vol. 6/2016 On Construction of a Reliable Ground Truth Frame-by-frame processing Online processing Trial ATEt ATEφ RPEt RPEφ End dist. ATEt ATEφ RPEt RPEφ End dist. [cm] [rad] [cm] [rad] [cm] [cm] [rad] [cm] [rad] [cm] No. 1 4.00 – 0.92 – 7.11 9.60 – 1.13 – 3.5 No. 2 4.32 – 0.54 – 12.54 8.13 – 0.56 – 8.93 No. 3 3.68 – 0.40 – 5.21 22.02 – 0.87 – 33.15 No. 4 9.97 0.08 1.21 0.01 9.13 12.50 0.09 1.31 0.01 7.12 No. 5 11.44 0.06 0.80 0.01 15.96 15.64 0.06 0.94 0.02 27.55 Table 1. Trajectory estimation results y [m] Ground-truth and RGB-D SLAM trajectories 1 0.5 0 -0.5 -12 x [m] 1 0 1.5 -1 -0.5 0 0.5 1 z [ m ] Ground-truth RGB-D SLAM (a) . No.3 y [m] Ground-truth and RGB-D SLAM trajectories 1 0.5 0 -0.5 -12 x [m] 1 0 0.5 -0.5 0 1 z [ m ] Ground-truth RGB-D SLAM (b) . No.4 y [m] Ground-truth and RGB-D SLAM trajectories 1 0 -1 2 x [m] 1 0 1.5 1 -1 0.5 0 -0.5 z [ m ] Ground-truth RGB-D SLAM (c) . No.5 Figure 2. Trajectory estimation results. using the RANSAC [10] algorithm between the cur- rently processed frame and a subset of already mapped frames. This subset consists of np directly preceding frames, ng graph neighboring frames and nr random frames from the whole map which serves to detect large loop closures. The estimated pose is then added to the map (pose graph) and refined using the g2o graph optimizing algorithm [11]. The optimization is especially beneficial in case of large loop closures when the whole trajectory estimate is improved. 3.2. Experimental setup Experimental environment (see Fig. 1) that has been used for capturing of the dataset consists of a plane, low stairs, ramp, and square filled with wooden blocks of different heights. The dataset has been captured by a hexapod walking robot equipped with adaptive motion gait [12] which enables the robot to traverse difficult obstacles. The robot was guided by an opera- tor along the squared path of approx. 9 m length. Our experimental environment has been designed to represent some basic problems, which can SLAM method face. These include high speed of rotation at the corners of the trajectory, which affects the preci- sion of rotation estimation. The stairs and wooden blocks parts of the trajectory force angular rotations along the geometrical axis and the descending ramp force the SLAM method to deal with the limited num- ber of image features. Moreover, the locomotion of the hexapod robot induces unpredictable and abrupt camera motions during the whole trajectory. 3.3. Results Altogether 5 trajectories were captured and analyzed in order to test our hypothesis. Ground truth for the first three trajectories; however, does not contain information about the heading of the robot, and thus only translational component of the ATE and RPE can be verified. The considered RGB-D SLAM parametrization for the frame-by-frame dataset processing has been the SURF [13] feature extractor detecting 600 features and keeping 400 best matches with a very long com- parison horizon of np = 16, ng = 4 and nr = 1. RANSAC stopping criteria have been set to 85% of inliers and 1000 iterations considering point an inlier if it is not outside the range of 3 pixels. Optimization precision has been set to the best values optimizing each individual frame. Such a processing took more than 4 hours per one trajectory on Intel Core i5 ma- chine. To show the feasibility of our approach we processed the dataset online as well. The parametrization for the online processing has been the SURF feature extractor detecting 400 features and keeping 250 of them, np = 3, ng = 3 and nr = 1 with 500 RANSAC iterations and optimization after every 10 frames. Such a processing achieves an online performance of average 5.2 frames per second on the same evaluation hardware. The results are summarized in Table 1. Resulting trajectories for trials No. 3, No. 4 and No. 5 are depicted in Figure 2. 3 J. Bayer, P. Čížek, J. Faigl Acta Polytechnica CTU Proceedings 3.4. Discussion Table 1 indicates that an average ATEt, when using frame-by-frame processing, is about 1% of the tra- jectory length and the average RPEt is under 1 cm. Notice the ATEt and RPEt values are always higher for the online processing results. On the other hand, differences in the ATEφ and RPEφ are neglectable although the slight imperfections in the orientation estimation are the most influencing factor in the accu- racy of the localization [14]. The results also indicate that the frame-by-frame processing provides the best result obtainable by a given SLAM method. Thus, it is possible to use the estimated ground truth for the evaluation of the online versions of the same algorithm to find the best parametrization which is deployable online onboard of the mobile robot. Note, the resulting localization accuracy is also greatly influenced by the loop closure in case the robot visits previously mapped location. If there is no loop closure, then the trajectory estimate is given as the visual odometry only, which error is asymptotically unbound. In case the loop closure is detected, the optimization process corrects the pose estimates along the whole trajectory. The phenomenon is illustrated in Figure 2c, where the loop did not close; thus, the trajectory estimate exhibits higher error in the end part of the trajectory, and Figure 3 which presents the results of the frame-by-frame processing of the same trajectory with open-loop enabled and disabled. The results for the closed-loop trajectory are ATEt = 4.19 cm, RPEt = 0.94 rad·10−2 and End dist = 7.33 cm, while for the open-loop trajectory are ATEt = 8.72 cm, RPEt = 0.96 rad·10−2 and End dist = 13.73 cm. Although the drift of the visual odometry repre- sented by the RPEt is similar in both cases, the re- sulting end distance and absolute trajectory error are higher when the open-loop scenario is considered. Thus, we recommend designing the experimental tra- jectory to exploit the loop closures to obtain a better ground truth estimate. Note the prerequisite of our method is that there are no systematic errors, which can significantly decrease the precision of ground truth estimate. To avoid systematic errors we recommend to use a special shape of the robot’s path, e.g., square. 4. Conclusion In this paper, we have studied the problem of the local- ization accuracy assessment in a SLAM task whenever a reliable ground truth provided by an external local- ization system is not available. We have proposed to use frame-by-frame processing of the captured dataset to obtain a sufficiently reliable ground truth for the quantitative evaluation of the online SLAM systems. The estimated ground truth is then suitable for the comparison of different SLAM methods or determina- tion of the best parametrization of the online SLAM. Note, the important thing for the dataset creation is Ground-truth and RGB-D SLAM trajectories -2 -1.5 x [m] -1 -0.5 0 1.5 1 0.5 y [m] 0 -0.5 0 0.6 0.4 0.2 z [ m ] Ground-truth Closed-loop Open-loop Figure 3. Comparison of closed-loop and open-loop frame-by-frame processed trajectories the loop closing, which can significantly improve the robot localization and it is also important to design the shape of the path in order to avoid systematic errors. We tested the presented method with a hexa- pod crawling robot and RGB-D SLAM system, but we suppose that the proposed method is applicable to different SLAM methods and different platforms too. Presented results support the feasibility of the proposed approach and can be utilized in cases when expensive and hard to setup external localization sys- tems are not available or suitable. Acknowledgements The presented work was supported by the Czech Science Foundation (GAČR) under research project No. 15-09600Y. The support of grant No. SGS16/235/OHK3/3T/13 to Petr Čížek is also gratefully acknowledged. References [1] R. Kümmerle, B. Steder, C. Dornhege, et al. On measuring the accuracy of SLAM algorithms. Autonomous Robots 27(4):387–407, 2009. doi:10.1007/s10514-009-9155-6. [2] J. Sturm, N. Engelhard, F. Endres, et al. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE International Conference on Intelligent Robots and Systems, pp. 573–580. 2012. doi:10.1109/IROS.2012.6385773. [3] S. Thrun, W. Burgard, D. Fox. Probabilistic robotics. MIT press, 2005. doi:10.1162/artl.2008.14.2.227. [4] M. Menze, A. Geiger. Object Scene Flow for Autonomous Vehicles. In Conference on Computer Vision and Pattern Recognition, pp. 3061–3070. 2015. doi:10.1109/CVPR.2015.7298925. [5] N. Sünderhauf, K. Konolige, S. Lacroix, P. Protzel. Visual Odometry Using Sparse Bundle Adjustment on an Autonomous Outdoor Vehicle, pp. 157–163. Springer Berlin Heidelberg, 2006. doi:10.1007/3-540-30292-1_20. [6] F. Endres, J. Hess, N. Engelhard, et al. An evaluation of the RGB-D SLAM system. In IEEE International Conference on Robotics and Automation, pp. 1691–1696. 2012. doi:10.1109/ICRA.2012.6225199. 4 http://dx.doi.org/10.1007/s10514-009-9155-6 http://dx.doi.org/10.1109/IROS.2012.6385773 http://dx.doi.org/10.1162/artl.2008.14.2.227 http://dx.doi.org/10.1109/CVPR.2015.7298925 http://dx.doi.org/10.1007/3-540-30292-1_20 http://dx.doi.org/10.1109/ICRA.2012.6225199 vol. 6/2016 On Construction of a Reliable Ground Truth [7] T. Krajník, M. Nitsche, J. Faigl, et al. A practical multirobot localization system. Journal of Intelligent & Robotic Systems 76(3-4), 2014. doi:10.1007/s10846-014-0041-x. [8] S. Bao. Semantic structure from motion. In Intl. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2025–2032. 2011. doi:10.1109/CVPR.2011.5995462. [9] M. Quigley, K. Conley, B. P. Gerkey, et al. ROS: an open-source Robot Operating System. In IEEE International Conference on Robotics and Automation (ICRA) – Workshop on Open Source Robotics. 2009. [10] M. A. Fischler, R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6):381–395, 1981. [11] R. Kümmerle, G. Grisetti, H. Strasdat, et al. g2o: A general framework for graph optimization. In IEEE International Conference on Robotics and Automation, pp. 3607–3613. 2011. doi:10.1109/ICRA.2011.5979949. [12] J. Mrva, J. Faigl. Tactile sensing with servo drives feedback only for blind hexapod walking robot. In 10th International Workshop on Robot Motion and Control, RoMoCo, pp. 240–245. 2015. doi:10.1109/RoMoCo.2015.7219742. [13] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool. Speeded-up robust features (SURF). Computer vision and image understanding 110(3):346–359, 2008. doi:10.1016/j.cviu.2007.09.014. [14] P. Čížek, J. Faigl. On Localization and Mapping with RGB-D Sensor and Hexapod Walking Robot in Rough Terrains. In IEEE International Conference on Systems, Man, and Cybernetics, pp. 2273–2278. 2016. 5 http://dx.doi.org/10.1007/s10846-014-0041-x http://dx.doi.org/10.1109/CVPR.2011.5995462 http://dx.doi.org/10.1109/ICRA.2011.5979949 http://dx.doi.org/10.1109/RoMoCo.2015.7219742 http://dx.doi.org/10.1016/j.cviu.2007.09.014 Acta Polytechnica CTU Proceedings 6:1–5, 2016 1 Introduction 2 Proposed solution 2.1 Evaluation metrics 3 Evaluation 3.1 RGB-D SLAM 3.2 Experimental setup 3.3 Results 3.4 Discussion 4 Conclusion Acknowledgements References