Acta Polytechnica CTU Proceedings doi:10.14311/APP.2016.6.0028 Acta Polytechnica CTU Proceedings 6:28–33, 2016 © Czech Technical University in Prague, 2016 available online at http://ojs.cvut.cz/ojs/index.php/app MAPPING THE SURROUNDINGS AS A REQUIREMENT FOR AUTONOMOUS DRIVING Martin Steininger∗, Christoph Stephan, Christian Böhm, Fabian Sauer, Roland Zink Faculty of Electrical, Media and Computer Engineering, Deggendorf Institute of Technology, 94469 Deggendorf, Germany ∗ corresponding author: martin.steininger@stud.th-deg.de Abstract. Motivated by the hype around driverless cars and the challenges of the sensor integration and data processing, this paper presents a model for using a XBox One Microsoft Kinect stereo camera as sensor for mapping the surroundings. Today, the recognition of the environment of the car is mostly done by a mix of sensors like LiDAR, RADAR and cameras. In the case of the outdoor delivery challenge Robotour 2016 with model cars in scale 1 : 5, it is our goal to solve the task with one camera only. To this end, a three-stage approach was developed. The test results show that our approach can detect and locate objects at a range of up to eight meters in order to incorporate them as barriers in the navigation process. Keywords: Microsoft Kinect, Mapping, 3D Mapping, Robotour, Autonomous Driving. 1. Introduction In addition to the extensive launch of full electric vehicle (fev) autonomous driving is deemed to be a key research and development trend in the automo- tive industry. The entire industry is working on the topic and has already developed first solutions for automatic braking, line keeping (i.e. AUDI “Active lane assist” or BMW “Active Assist”) or even autopi- lots for driverless cars like Tesla Motors’ “Autopilot” or Google’s “Driverless Car” [1]. The used naviga- tion and self-driving algorithms run on maps [2] and require very accurate digital (geo-)data of the envi- ronment [3]. Therefore, the spatial coverage of the surroundings as a basis for autonomous navigation of the vehicle is common to all applications. Consequently, autonomous driving – regardless of the respective task such as automatic braking, adap- tive cruise control, lane keeping or autonomous driving – requires detection and imaging of the space which it will pass through. Photo cameras, LiDAR-, RADAR- or other sensors continuously detect and measure the surrounding of the car, creating a digital spatial image of the area (map). Combined with positioning ser- vices like GPS or Galileo, the location of the vehicle can be determined and set in relation to surrounding objects such as other vehicles or obstacles. However, it is not an ordinary cartographic map resulting from the permanent mapping process but digital images that are stored temporarily as server frames for the navigation. The navigation of the car finally relies on mathematical functions (algorithms) to pass through the digital and real space on the shortest (spatial) or fastest (time) way. This paper addresses the issue of the space detection which is considered as an essential basis of autonomous driving and robot navigation (see for example [4]) and describes the use of a Microsoft X-Box One Kinect stereo camera for autonomous driving for the outdoor delivery challenge Robotour 2016 (robotika.cz). We present an approach for mapping the surroundings including the three steps (1) 3D capturing, (2) pro- jection and filtering and (3) object detection. The hardware and the approach are implemented on a four wheel drive off-road buggy, scale 1 : 5. The main hardware consists of a Congatec TS180 COM Express module. For the main processing, it communicates with the X-Box One Kinect and a Raspberry Pi and displays the crucial information on a Krämer V-800 Touchscreen. The Raspberry Pi reads the sensor values from the GPS and the Bosch BNO055 and transmits them to the Congatec module. It receives the steering and speed values from the Con- gatec module, which are transmitted to the Freescale KL25Z microcontroller, which in turn controls the RC-Car model via PWM signals. 2. Approach The development path of the approach to map the surroundings for an autonomous driving vehicle should be well-defined to avoid inconsistencies. The first step is answering the question which hardware will be used to detect objects. With the goal to use only a Kinect camera as sensor, the approach relies on the processing of the depth, and standard images. Both data represent a 2D respectively 3D model of the environment. The approach is characterised in three development steps: • 3D capturing, • projection, • object recognition. 28 http://dx.doi.org/10.14311/APP.2016.6.0028 http://ojs.cvut.cz/ojs/index.php/app vol. 6/2016 Mapping the Surroundings as a Requirement for Autonomous Driving 3D capturing obstacle detec on projec on and ltering Kinect camera Figure 1. spatial mapping of the current surroundings. It is developed to simplify complex processes, describe them in a proper way and visualize all of the main development steps in a manner that is easy to un- derstand. According to the development steps, the approach is divided into three processes, see Figure 1. At first, the Kinect camera films the space in front of the vehicle in order to gather potential obstacles. The second step contains the projection of the depth images on a 2D ground plane. At last, the obstacles or object will be recognized during the third step. The three main steps will be explained below. 3. Related work In 2002 Sebastian Thrun (see [5]) showed that the problem of terrain mapping is to generate a spatial model of a robot’s or vehicle’s environment using sensors and cameras. Concerning this point there are several more subproblems: • measurement noise, • robot must choose its way during mapping process, • environment changes over time, • possibly the hardest problem — the so called data association problem: correspond sensor measure- ments to the same physical object in the real world. “Mapping is largely considered the most difficult perceptual problem in robotics.” [5, p. 6]. In 1989 Herbert et al. already presented the so called locus al- gorithm to build terrain maps based on a single active range sensor, by “computing the intersection of the surface observed by the sensor with the vertical line passing through (x,y)” [6, p. 998]. Frankhauser et al. [4, p. 2] introduced an elevation mapping approach from a “robot-centric perspective” with relative dis- tances to the robot and a permanent update process of the entire elevation map with information about the motion of the robot. Kleiner and Dornhege presented their autonomous driving vehicle for Urban Search and Rescue (USAR) with two different methods. First the rapid map- ping of a large-scale environment by wheeled robots, and second the mapping of rough terrain by tracked robots [7]. Hornung et al. [6] developed an open source library to represent the environment in memory using octrees. The cells of the discretized space can have one of the following states: unknown, free or occupied. To map the sensor data into the octree they use a probabilistic sensor model. Thrun et al. [8] describes Stanley, the winner of the 2005 DARPA Grand Challenge. The robot was built by Stanford University researchers and uses laser Figure 2. Overview of the Kinect camera. range finders as well as RADAR sensors for sensing the environment. The evaluation of the sensor data is supported by artificial intelligence and machine learning algorithms. 4. 3D capturing As a 3D sensor is a rather expensive component, the Kinect is a cheap alternative compared to the func- tions it provides. The camera is doing most of the cal- culations on its own SoC, thus reducing the processing effort for the main unit. The first version of Microsoft Kinect camera was developed by Prime Sense, an Is- raeli 3D sensing company, which was bought by Apple in 2013. After that, Microsoft created its own system based the existing Prime Sense Technology [9]. 4.1. Overview The Kinect camera has an array of sensors, which can be divided into three categories: a depth sensor (green), a colour sensor (red) and an audio sensor (blue), as seen in Figure 2. The audio sensor is not used for the 3D reconstruction. The colour sensor consists of a CMOS camera with an infrared light filter. The main use of the colour sensor is the track detection. The colour image can be mapped with the depth information, although the resolution of the colour sensor is higher than that of the depth sensor. That is the reason why there are multiple colour pixels per one depth pixel. If there are more depth pixel, than colour pixels, the colour pixels need to be split into multiple smaller pixel, with the same information. The depth sensor consists of a CMOS camera and an infrared led blaster. The CMOS camera is filtered so that it just sees near-infrared light which is emitted by the infrared led blaster. All the information is processed in a specially designed SoC (Figure 3). This allows the SoC to process a depth map containing all the depths for every pixel and map the colour pixel to it. The result of the processing is a data stream containing an audio, a video, and a depth stream. 29 M Steininger, C. Stephan, C. Böhm et al. Acta Polytechnica CTU Proceedings Figure 3. 3D reconstruction of the Kinect [11, 12]. 4.2. 3D reconstruction There are multiple ways of reconstructing the depth information from an image. The classic way needs two calibrated cameras at a defined distance. This allows for calculating the distance to the object with trigonometry formulas. The difficulty will be in dif- ferencing the objects resulting in the so called corre- spondence problem which “refers to the problem of ascertaining which parts of one image correspond to which parts of another image, where differences are due to movement of the camera, the elapse of time, and/or movement of objects in the photos.” [10]. The Kinect overcomes this problem by using a method called “Time of Flight”. In this method the travel time of infrared light, which is emitted by a led, is measured for each pixel, as seen in 3. “A strobed infrared light illuminates the scene, the light is reflected by obstacles, and the time of flight for each pixel is registered by the infrared camera. Internally, wave modulation and phase detection is used to estimate the distance to obstacles (indirect ToF)” [13]. The problem is that the infrared light can be washed out by other near-infrared light sources like sunlight, resulting in the Kinect not being able to detect the infrared light and losing the depth information. 5. Projection and filtering The Kinect camera is mounted on the chassis in an arbitrary fixed angle, tilted to the cars chassis. The exact angle in which the camera is mounted to the ground is not known. It can be measured, but to be independent from that setting, the ground plane coefficients, which are returned from the Kinect, are used. The coordinate system of the returned 3D point cloud (as camera coordinate system in the following; indicated with a car subscript) does not match with the coordinate system of the car (as car coordinate system in the following; indicated with a camera subscript). In Figure 4 both coordinate systems are shown. The body-frame which can be requested from the camera contains a four-dimensional vector represent- ing the floor coefficients by ~pcoeff = (A; B; C; D). Figure 4. The camera coordinate system (green) and the car coordinate system (blue) [11, 12]. These coefficients are combined in (1) to receive the plane equation for further mathematical operations: 0 = Ax + By + Cz + D (1) In a first step, the distance of every point to the ground is calculated. A point is considered to be a part of the ground if a certain limit is not exceeded. Therefore they are removed from the point collection if the distance to the ground is smaller than a certain distance. The remaining points are processed in the following steps. In the next step, each frame of the Kinect has a resolution of 512 by 424 and can thereby result in a large amount of 3D-pixels. A pixel is returned, if it is in the specified range of the camera. The nearest distance the camera returns a pixel for is about half a meter. The maximum distance is about eight meters. So there are quite a large number of points waiting to be computed. The rather small computing system with limited processing power on the car must handle this in real-time. For the presented project, a real-time reaction is defined with a latency of a few hundred milliseconds. To reach this goal, one issue is making the algorithm faster. This could be realised by: • removing as many points as possible to save energy and processing effort converting the points from one coordinate system to another, • adjusting processes how the data is processed step by step, • using a programming language with small process- ing overhead. To reduce the further processing effort the points get filtered next. A new parameter named MinDistance is introduced in the GUI (Graphical User Interface) where the points must have a greater distance from 30 vol. 6/2016 Mapping the Surroundings as a Requirement for Autonomous Driving the camera than that value, to prevent recognising parts of the car as obstacles. This is needed due to the current setting of the car. There may be points above the height of the car so that the car can drive beneath. These points get removed if they are above a maximum height from the ground plane which is defined as a parameter named MaxHeight. Obstacles far away are also not interesting for the process at the moment. In the same way points far away are also not interesting and get filtered with an allowed maximum distance named MaxDistance. In this section some functions are introduced to simplify the explanation of the coordinate transforma- tion. The first three equations deal with projections. Project a vector v onto another vector vonto (2), also a vector v onto a plane E : 0 = (~x−~r) ·~n (3) and a point with position vector ~x onto a plane (4): fv2v(~vorig,~vonto) : ~vproj = ( ~vorig � ~vonto |~vonto| ) (2) fv2p(~vorig) : ~vproj = ~vorig −fv2v (~vorig,~n) ·~n (3) fp2p(~xorig) : ~xproj = ~xorig −fv2v (~xorig −~r,~n) (4) Next the normalisation is defined in (5): fn(~vorig) : ~vnorm = ~vorig |~vorig| (5) Since the source coordinate system of the Kinect is in three dimensional space and the destination system is only two dimensional, it is necessary to project every detected 3D-pixel onto the ground plane. In the three dimensional coordinate space of the Kinect we define a two dimensional coordinate space for the new coordinate system. The new coordinate system can be seen as a plane in the original space. Therefore an origin and also two vectors for the two axes of that coordinate system are needed. The two dimensional coordinate system for the ob- stacle detection is implemented within the three di- mensional coordinate system of the Kinect camera. The camera position perpendicular above the ground is origin: Ccamera = fp2p(O) with O as the origin of coordinate system. The axes are defined as ex = fn (fv2p ((0; 0;−1))) and ey = fn (fv2p ((−1; 0; 0))). In Figure 4 you see the orientation of the axis. Now, (6) can be used to project all points into the new car coordinate system: Pcar = ( fv2v (Pcamera −Ccamera,ex) fv2v (Pcamera −Ccamera,ey ) ) (6) Next, the obstacle detection will use these projected points and create a barrier along the nearest edge of the recognized obstacles. 6. Obstacle detection When implementing algorithms for autonomous ve- hicles it is mandatory to detect and tag obstacles in front of the craft. As described, a depth image is 0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 5 6 7 8 L o g a ri th m ic s c a li n g o f d is ta n c e Distance (m) Figure 5. Logarithmic distance scaling for image size reduction purposes. provided by the Kinect camera from where each sin- gle depth pixel is projected to a ground plane with each having the correct distance and position to the camera’s point of view during the first stage of image processing. As a first step, the depth points contributed by the Kinect camera and the projection to the ground plane are transformed into a 2D image. The next three steps described in this section include: • distance scaling, • Gaussian functions, • and the actual object and obstacle detection. 6.1. Distance scaling The project team faced problems concerning comput- ing power: the large number of pixels coming from a depth frame provided by the Kinect camera exceeded the available and limited hardware. For the purpose of reducing the computing effort for processing the Gaussian function (see subsection 6.2) the project team decided to scale the projected image’s size by its height logarithmically. As shown in Figure 5, the scaling of distances allows to have more detailed image information in a closer range to the camera’s origin than in a larger distance to this specific point. The distance we get from the Kinect Camera is a floating point number, the index we need to reference a pixel within and on the image is an integer. If the distance is only rounded to the next integer value there is one pixel per meter. Because of this a parameter is introduced which represents the resolution of the image’s distances. The scaled value is divided by this value to get the index as an integer, after rounding of this new integer we get the correct index value for the image. For example, 0.5 m in reality are mapped to an index value of 0.585 after the scaling, while 8 m in the real world, which is the Kinect camera’s maximum recognition distance, correspond to an index value 3.171 logarithmically scaled. 31 M Steininger, C. Stephan, C. Böhm et al. Acta Polytechnica CTU Proceedings G(x,y) x y G(x,y) Figure 6. Gaussian distribution in 2D. As can be seen in Figure 7 the opening angle of the camera is segmented with a specific resolution, which can be set as a parameter. Every segment is mapped to a single rectangular column into the projected image. The advantage of this approach is that near distance pixels are richer in detail than more distant pixels. 6.2. Gaussian function The next step is to smoothen the depth image to reduce single error pixels or noise and smoothen pixel clouds for further and more precise computations and calculations. Although there are several options for smoothing images, the chosen solution is the Gaussian filter G(x,y) = 1 2πσ2 e − x 2 +y2 2σ2 (7) In (7), the Gaussian function for two dimensions is shown. Where x is the distance from the origin on the horizontal axis and y is the distance from the origin on the vertical axis, and σ is the standard deviation of the Gaussian distribution. The distribution can be seen in Figure 6 for two-dimensional Gaussian functions. This formula produces a surface of concentric circles with a Gaussian distribution from the center point. The new value of each pixel is set to a weighted average of that specific pixel’s neighborhood. The original pixel’s value receives the highest Gaussian value and neighborhood pixels receive smaller weights as their distance to the original pixel increases. This results in a blur effect [14, 15]. Figure 7 shows three identical depth frames which are already projected to a ground plane (you see the frames in a top view) and transformed into a gray image. On the left-hand side, the single frame is displayed without any filtering or blurring. There are many false pixels, noise and distortion. A Gaussian filter is already used in the middle frame, nevertheless quite a lot noise and other false pixels can be seen. The approach works in practice but still needs a little fine tuning. Therefore a smoother configuration of the Gaussian settings was used. The third frame on the right-hand side displays very little distortion, noise and false pixels, but provides clearly identifiable pixel clouds. These pixel clouds mostly represent objects or other obstacles in the range of the Kinect camera. 1 2 3 Figure 7. Three identical frames with no Gaussian filtering in frame 1, minimal Gaussian filtering in frame 2 and for the project’s obstacle detection best Gaussian filtering settings in frame 3. 6.3. Detect obstacles The last step before actually being able to detect obstacles is removing unnecessary pixels which still are in the image after applying a Gaussian filter. In the scope of the project a parameter was added to set a tolerance for still existing false or error pixels: threshold. It’s inevitable to do so, due to the fact that there are not only black and white pixels anymore (values 0 and 255) because of the Gaussian filtering but also values in between. The final step is to find objects in the distance- scaled and Gaussian filtered image. After preparation of the given depth image with Gaussian filtering and logarithmic distance scaling, the final step for finding and detecting objects in range is to find pixel clouds with a defined minimal distance (parameter in algo- rithm: minDistanceBetweenObstacles in meters) to other clouds and therefore to other obstacles. When a pixel cloud (respectively an obstacle) is detected with no other obstacle in range of minDis- tanceBetweenObstacles, it is marked as a detected obstacle. 7. Conclusions The autonomous driving requires the continuous map- ping of the surroundings. The presented approach and the hardware setting show that the mapping can be done with a Kinect camera in the case of model cars in a scale up to 1 : 5. With the 3D capturing of the surroundings, the projection and filtering and obstacle detection, it is possible to get a temporary map of the car’s environment as a basis for the navigation pro- cess. The navigation itself relies on the A*-algorithm and uses the projected map including the detected 32 vol. 6/2016 Mapping the Surroundings as a Requirement for Autonomous Driving obstacles as the digital weighted spatial information. In a next step, we are planning to test the whole configuration and the navigation process under dif- ferent conditions, i.e. different surfaces or different weather conditions. The aim is to obtain feedback on the use and the performance of the stereo camera and the navigation algorithm. Beyond the self-driving car, the presented approach can be implemented in other vehicles, like unmanned aerial vehicles (UAV), autonomous underwater vehicles (AUV) or even in robots for different services. References [1] A. M. Kessler. Elon Musk Says Self-Driving Tesla Cars Will Be in the U.S. by Summer - The New York Times, 2015. [2] A. Fisher. Google’s Road Map to Global Domination - The New York Times, 2013. [3] A. Geiger, P. Lenz, R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. 2012. doi:10.1109/CVPR.2012.6248074. [4] P. Fankhauser, M. Bloesch, C. Gehring, et al. Robot-centric elevation mapping with uncertainty estimates. In International Conference on Climbing and Walking Robots (CLAWAR), EPFL-CONF-198746. 2014. doi:10.1142/9789814623353_0051. [5] S. Thrun, et al. Exploring artificial intelligence in the new millennium. vol. 1, chap. Robotic Mapping: A survey, pp. 1–35. 2002. [6] M. Herbert, C. Caillas, E. Krotkov, et al. Terrain mapping for a roving planetary explorer. In IEEE International Conference on Robotics and Automation, pp. 997–1002. 1989. doi:10.1109/ROBOT.1989.100111. [7] A. Kleiner, C. Dornhege. Real-time localization and elevation mapping within urban search and rescue scenarios. Journal of Field Robotics 24(8-9):723–745, 2007. doi:10.1002/rob.20208. [8] S. Thrun, M. Montemerlo, H. Dahlkamp, et al. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics 23(9):661–692, 2006. doi:10.1002/rob.20147. [9] PrimeSense, 2016. Page Version ID: 710008223. [10] Correspondence problem, 2015. Page Version ID: 661209488. [11] S. DESIGN. RC-XD (Call Of Duty: Black Ops) - 3d Warehouse. [12] t. plus. Kinect from Microsoft - 3d Warehouse. [13] P. Fankhauser, M. Bloesch, D. Rodriguez, et al. Kinect v2 for mobile robot navigation: Evaluation and modeling. In International Conference on Advanced Robotics (ICAR), pp. 388–394. 2015. doi:10.1109/ICAR.2015.7251485. [14] M. S. Nixon, A. S. Aguado. Feature extraction and image processing. Newnes, Oxford ; Boston, 1st edn., 2002. OCLC: ocm49045100. [15] L. G. Shapiro, G. C. Stockman. Computer vision. Prentice Hall, Upper Saddle River, NJ, 2001. 33 http://dx.doi.org/10.1109/CVPR.2012.6248074 http://dx.doi.org/10.1142/9789814623353_0051 http://dx.doi.org/10.1109/ROBOT.1989.100111 http://dx.doi.org/10.1002/rob.20208 http://dx.doi.org/10.1002/rob.20147 http://dx.doi.org/10.1109/ICAR.2015.7251485 Acta Polytechnica CTU Proceedings 6:28–33, 2016 1 Introduction 2 Approach 3 Related work 4 3D capturing 4.1 Overview 4.2 3D reconstruction 5 Projection and filtering 6 Obstacle detection 6.1 Distance scaling 6.2 Gaussian function 6.3 Detect obstacles 7 Conclusions References