Human identification and tracking using ultra-wideband-vision data fusion in unstructured environments ACTA IMEKO ISSN: 2221-870X December 2021, Volume 10, Number 4, 124 - 131 ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 124 Human identification and tracking using ultra-wideband- vision data fusion in unstructured environments Alessandro Luchetti1, Andrea Carollo1, Luca Santoro1, Matteo Nardello1, Davide Brunelli1, Paolo Bosetti1, Mariolino De Cecco1 1 Department of Industrial engineering, University of Trento, Sommarive, 9 - 38123 Trento, Italy Section: RESEARCH PAPER Keywords: Human-robot interaction; following system; ultra-wideband technology; 3D vision system; sensor fusion Citation: Alessandro Luchetti, Andrea Carollo, Luca Santoro, Matteo Nardello, Davide Brunelli, Paolo Bosetti, Mariolino De Cecco, Human identification and tracking using ultra-wideband-vision data fusion in unstructured environments, Acta IMEKO, vol. 10, no. 4, article 21, December 2021, identifier: IMEKO- ACTA-10 (2021)-04-21 Section Editor: Roberto Montanini, Università di Messina and Alfredo Cigada, Politecnico di Milano, Italy Received July 26, 2021; In final form December 6, 2021; Published December 2021 Copyright: This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Corresponding author: Alessandro Luchetti, e-mail: alessandro.luchetti@unitn.it 1. INTRODUCTION Cooperation between mobile robots and people is playing a significant role in the modern economy while its demand is increasing worldwide, aided also by the growing use of autonomous and smart mobile robots. In particular, co-bots can increase human resources while reducing physical and mental load, increasing operational safety and productivity, in industrial environments. In this context, the human-following function, called “follow-me”, is crucial. It consists of identifying and following the assigned operator even in unstructured environments. There are different approaches to achieve such a task. The most commonly used technologies for tracking include vision [1], [2], time-of-flight (TOF)-Camera [3], LiDAR [4], light-emitting device (LED) [5] and UWB transceivers [6]-[131]. For each of these technologies, there are several advantages and disadvantages. LiDAR technology, while faster and more accurate than TOF camera, is much more expensive and not used unless strictly necessary. With the LED detection method there are few applications from the literature because of the low robustness due to the robot’s inability to detect the light-emitting device frequently. The best candidates for tracking operations with low cost are still 3D vision and UWB systems. The main literature contributions use them independently to solve the “follow-me” task in unstructured environments without solving their disadvantages. For example, the disadvantage of traditional vision systems is the limited field of view (FoV) compared to UWB systems and lighting influence, especially outdoor. In contrast, UWB can be applied both indoor and outdoor but suffers from higher uncertainties especially for measurements of less than one meter and in the presence of obstruction by people between the transceivers [9], [10]. However, UWB systems measurements can be made up to 80 m, in contrast to TOF cameras where after 10 m there is mostly noise while are more precise than the UWB below. Furthermore, the shape of the uncertainties of the two systems is different but complementary. With this work, we overcome the disadvantages of these technologies by combining them to improve the robustness and reduce the uncertainty of the measurement result. ABSTRACT Nowadays, the importance of working in changing and unstructured environments such as logistics warehouses through the cooperation between Automated Guided Vehicles (AGV) and the operator is increasingly demanded. The challenge addressed in this article aims to solve two crucial functions of autonomy: operator identification, and tracking. These tasks are necessary to enable an AGV to follow the selected operator along his path. This paper presents an innovative, accurate, robust, autonomous, and low-cost operator real-time tracking system, leveraging the inherent complementarity of the uncertainty regions (2D ellipses) between ultra-wideband (UWB) transceivers and cameras. The test campaign shows how the UWB system has higher uncertainty in the angular direction. In contrast, in the case of the vision system, the uncertainty is predominant along the radial coordinate. Due to the nature of the data, a sensor fusion demonstrates improvement in the accuracy and goodness of the final tracking. mailto:alessandro.luchetti@unitn.it ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 125 The only literature works that apply Sensor Fusion between UWB and vision systems work in structured environments with fixed UWB transceivers [11], [12] or with both fixed UWB and vision systems [13]. Their solutions are not dynamic to changes, are more expensive because of the number of devices to be used, and must be calibrated for each environment. This work is organized as follows. An overview of the involved measurement systems is provided in the next section. The human identification algorithms to define which operator to follow are provided in Section 3; Section 4 describes the uncertainties models of the UWB and vision systems. The operator localization results through Sensor Fusion approach are discussed in Section 5 followed by conclusions in Section 6. 2. MEASUREMENT SYSTEMS For our application, we used Decawave’s UWB development board DWM10011 [14]. It has the advantages of low cost, low power consumption, and strong penetration. The receivers were programmed with a two-way ranging (TWR) architecture [15] that allows them to work in an unstructured environment, with two anchors on the robot, one master and one generic, and a tag for the tracked operator, Figure 2. The tag communicates first with the generic anchor to calculate the distance d1, i.e. the distance between the tag and the generic anchor. Then with the master anchor to obtain the distance d2, i.e. the distance between the tag and master anchor. Successively the generic anchor will send the estimated distance d1 to the master, and finally, the master shares the two distances d1, d2 to the computer within a total time of 9 ms. The distance between the two anchors d3 is fixed and constant. The position of the tag in the environment using only the UWB system is ambiguous because the two radii d1 and d2 of the circumferences can intersect at two different points. To solve this problem, the robot is equipped with a camera, which determines the uniqueness of the measurement (i.e., whether the operator is in front of the robot or not). The selected camera is the Intel RealsenseTM Depth Camera D4552 [16]. The device includes an RGB camera, two infrared (IR) cameras, a laser projector, and an inertia measurement system (IMU). The vision system can extract both a 3D point cloud of the scene and a traditional RGB image. The RGB is used to apply artificial intelligence (AI) algorithms that perform human skeleton detection, while the point cloud allows localizing the key points of the skeleton in 3D. Figure 1 shows the designed system. In particular, the UWB and vision systems onboard the mobile robot for operator identification and tracking. 3. HUMAN IDENTIFICATION Human detection and human pose estimation are done by applying a neural network provided by Intel in the OpenVINOTM toolkit, called human-pose-estimation-0001. The toolkit enables Convolution Neural Networks (CNN)-based deep learning inference and contains an optimized version of OpenCV libraries for Intel hardware [17]. This network is based on OpenPose [18] approach with tuned MobileNet v1 [19] as a feature extractor. This network results in two different outputs. The first is composed of 18 probability maps, called heat-maps, that provide all the key-points on the image: ears, eyes, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles, Figure 3a. The second is 19*2 layers called part affinity fields, and it gives us information on how to match the key-points that correspond to a single person, Figure 3b. Heat-maps provide the probability for each pixel to be at the position of a key-point. Performing a threshold on the heat-maps it is possible to select the pixels with a probability higher than 50 %, from which it is possible to evaluate for each cluster of pixels the average point and the covariance matrix referring to each key-point location, Figure 3a. From these covariance matrices, the average uncertainty of each pixel was calculated as three pixels. The image resolution was set at 848 × 480 (CxR), and the network for the human-pose-estimation-0001 as INT8 running on the CPU extracts each frame up to 18 key-points per person. The resolution was chosen as a good trade-off between resolution of the image, and so linked to the uncertainty for the human positioning, and the speed of execution of the network. The net is available in three different resolutions, i.e. INT8, FP16, and FP32. There is no difference in performance in our test when running on the CPU, but the INT8 version is slightly faster than the others for this estimation. Only the operator, entitled to use the robot, must be tracked. To reach this, we introduced face identification. The steps to identify the operator are two: find all the faces within the RGB frame and then detect the operator among them. For these steps, we used two CNN available on the OpenVINOTM toolkit. The face detection CNN used is face-detection-retail-0005 and to extract the features of each face and compare them with the operator features stored in a database we use face-reidentification-retail-0095. The matching is made through the cosine similarity, equation (1), which is defined as the cosine of the angle θ between two features’ vectors A and B on dimension Rn. Where A is the features’ vector acquired in real-time and B is the one stored as ground truth. The lower the cosine, the higher the probability that the features vector are similar. Figure 1. Real system with mobile robot and operator. Figure 2. System configuration (Top view). ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 126 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 = cos(𝜃) = 𝐴 ·𝐵 ||𝐴|| ||𝐵|| = = ∑ 𝐴𝑖𝐵𝑖 𝑛 𝑖=1 √∑ 𝐴𝑖 2𝑛 𝑖=1 √∑ 𝐵𝑖 2𝑛 𝑖=1 , (1) The operator, once identified, turns on the spot, and the software saves all the features again with a neural network, called person-reidentification-retail-0031, for different poses. As previously stated, the correspondence is made with the cosine similarity between the feature vectors of people’s bodies in the frames. Table 1 shows the overall times for each network inference calculated on an Intel CPU i7-7700HQ with 16 GB of RAM. 4. HUMAN LOCALIZATION: 2D UNCERTAINTY MODELING From UWB and vision system are extracted the same information about the operator’s position with two measurements systems that are both referred to the robot base but whose uncertainty regions are complementary as explained in this section. 4.1. UWB system To maintain consistency with the information received from the vision system, the reference frame of the UWB is fixed on the camera and the anchors are placed at its sides with equal distance d3/2, Figure 2. In particular, the X coordinate is related to the lateral axis, the Z coordinate to the depth, and the Y coordinate to the vertical axis. For the localization of the operator in the space, we are interested only in the X and Z coordinates, since we project everything on the ground floor. The equations of the circumferences from the UWB devices become: (𝑥 − 𝑑3 2 ) 2 + 𝑧2 = 𝑑1 2 , (2) (𝑥 + 𝑑3 2 ) 2 + 𝑧2 = 𝑑2 2 , (3) From equations (2) and (3), we obtain the closed-form solution for the tag position: 𝑥 = 𝑑2 2 − 𝑑1 2 2𝑑3 , (4) 𝑧 = ± 1 2 √− (𝑑1 2 −𝑑2 2)2 𝑑3 2 − 𝑑3 2 + 2(𝑑1 2 + 𝑑2 2) . (5) As discussed in Section 2, we only keep the positive value for the Z-coordinate because it is guaranteed by the vision system. The covariance matrix of the coordinates of the tag position with the UWB system is equal to: 𝐶UWB = 𝐽dist · 𝐶dist · 𝐽dist 𝑇 , (6) where Cdist is the covariance matrix of the measured distances d1, d2, and d3, (Figure 2), with their standard deviations σ1, σ2, and σ3 𝐶dist = ( 𝜎1 2 0 0 0 𝜎2 2 0 0 0 𝜎3 2 ) . (7) The UWB devices used to test the overall system were calibrated in different indoor and outdoor scenarios to find the corresponding systematic offset at different distances and the related uncertainties. Furthermore, the master and generic anchors were calibrated independently because, although the de- vice type is the same, their internal crystals and thus their responses are not the same. In particular, we carried out measurements from 1 m to 20 m every 0.5 m with random order both indoor and outdoor. We collect data at 52 Hz for three minutes. Between different distances, we wait three minutes with all the devices turned off. The tests were performed for each distance, one with the line of sight (LOS) between the devices free (Figure 4a) and the other two with noise elements: in one 3- 4 people simultaneously walked randomly back and forth between the devices throughout the test time (Figure 4b), in the other large static metal elements were placed in random positions between the devices (Figure 4c). This was done to understand how much these scenarios affect the measurements. Figure 5a shows an example of the box-plot result for an indoor test at 7 m in different scenarios. In particular, in all the tests done, the standard deviation of the measurements made with the walking people is significantly higher than the ones with free LOS and metal objects. The data acquired with walking people generate an overestimation, possibly because the occlusion of the LOS causes the reflected radio waves to be caught as the free LOS. From the tests, the standard deviations d1 and d2 were calculated at 0.05 m outdoor with free LOS. This value was used Table 1. Overall times. Network Image resolution Net resolution Inference time (ms) face-detection-retail-0005 848 × 480 INT8 9.42 face-reidentification-retail-0095 848 × 480 FP16 5.38 person-reidentification-retail-0031 96 × 48 FP16 6.50 human-pose-estimation-0001 848 × 480 INT8 115 (a) (b) Figure 3. Heat-maps with covariances of key-points (a) and affinity map (b) of 848x480 image from Camera D455 with operator at two meters. ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 127 ± and set in our application; instead, the standard deviation for d3 was set at 0.01 m to take into account human error during fixing. About the offset between the measured distances with UWB devices and the real distances, the values were modeled from the outdoor tests in free LOS, Figure 5b. All the real distances, taken as ground truth, were measured with a laser meter type Fervi ML80 [20], able to measure distances from 0.05 m to 80.00 m with an accuracy of 0.02 m. The generic anchor and the tag were powered by a Varta power bank type 57962, while a USB cable powered the master anchor from the PC. From that USB cable was possible to communicate in serial and save the estimated distances from the two anchors. Previously we saw how the free outdoor LOS was chosen as the reference environment for the offset and standard deviation values. To take into account also other scenarios each time we check the Channel Impulse Response (CIR) [21] provided by the UWB modules. In case of some human obstruction, this parameter decreases, and we do not update the distance information waiting for a reasonable value of CIR. The jacobian matrix Jdist with respect to the distances is: 𝐽dist = ( − 𝑑1 𝑑3 𝑑2 𝑑3 𝑎2 2 𝑑3 2 4 𝑑1 − 4 𝑑1 𝑎2 𝑑3 2 𝑎1 4 𝑑2 + 4 𝑑2 𝑎2 𝑑3 2 𝑎1 − 2 𝑑3 − 2 𝑎2 2 𝑑3 3 𝑎1 ) , (8) with: 𝑎1 = 4 √2 𝑑1 2 + 2 𝑑2 2 −𝑑3 2 − 𝑎2 2 𝑑3 2 𝑎2 = 𝑑1 2 −𝑑2 2 Equation (6) becomes: 𝐶UWB = ( σ3 2 𝑎6 2 4 𝑑3 4 + 𝑑1 2  σ1 2 𝑑3 2 + 𝑑2 2  σ2 2 𝑑3 2 𝑎1 𝑎1 σ3 2 𝑎3 2 16 𝑎2 + σ1 2 𝑎5 2 16 𝑎2 + σ2 2 𝑎4 2 16 𝑎2 ) , (9) with: 𝑎1 = 𝑑2 σ2 2 𝑎4 4 𝑑3 √𝑎2 − 𝑑1 σ1 2 𝑎5 4 𝑑3 √𝑎2 − σ3 2 𝑎6 𝑎3 8 𝑑3 2  √𝑎2 (a) (b) (c) (d) Figure 4. Indoor scenarios: free LOS (a), people walking (b), metal objects (c); Outdoor scenarios (d). (a) (b) Figure 5. (a) Box-plot of data at 7 m indoor in different scenarios; (b) Offset model of master and generic anchors outdoor in free line of sight (LOS). ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 128 𝑎2 = 2 𝑑1 2 +2 𝑑2 2 −𝑑3 2 − 𝑎6 2 𝑑3 2 𝑎3 = 2 𝑑3 − 2 𝑎6 2 𝑑3 3 𝑎4 = 4 𝑑2 + 4 𝑑2 𝑎6 𝑑3 2 𝑎5 = 4 𝑑1 − 4 𝑑1 𝑎6 𝑑3 2 𝑎6 = 𝑑1 2 −𝑑2 2 Noting that the elements of Equation (9) have at the denominator the measure d3, we can say that the more distant the anchors are from each other, the more accurate the position measurement is. Another note is that the off-diagonal terms are zero if and only if the measures d1 and d2 are equal (considering also σ1 = σ2). In this case, Equation (9) becomes: 𝐶UWB = ( 2 𝑑1 2  𝜎1 2 𝑑3 2 0 0 2 𝑑1 2  𝜎1 2 4 𝑑1 2 −𝑑3 2 + 𝑑3 2  𝜎3 2 4 (4 𝑑1 2 −𝑑3 2 )) , (10) Under these conditions, Figure 6 shows the behaviour of the standard deviations’ values of each eigenvalue in the two principal directions X and Z with respect to the Z distance of the tag from the anchors, square roots of CUWB(1, 1) and CUWB(2, 2) respectively. As can be seen from the behaviour of the eigenvalues in Figure 6 and shapes of the covariances in Figure 7 at the beginning (Z = 0 m) the covariance is a tight ellipse stretched in the radial direction (CUWB(1, 1) < CUWB(2, 2), tag A in Figure 7), then when the distances d1 and d2 are orthogonal to each other it becomes a perfect circle (CUWB(1, 1) = CUWB(2, 2), tag B in Figure 7) and lastly with a higher Z distance value it becomes stretched in the angular direction (CUWB(1, 1) > CUWB(2, 2), tag C in Figure 7). If d1 and d2 are different, tag D in Figure 7, the quadratic approximation of the probability ellipse will be rotated. 4.2. Vision system The model used to study the variance for the depth coordinate Z of the selected camera is a pinhole model. Two-IR sensors of the camera are used to evaluate the depth through a disparity matching algorithm, an RGB sensor is used instead to calculate the 2D image, all of them are co-planar and aligned. The common reference frame of the two-IR sensors is set at the same focal length of the two- IR sensors and translated by overlapping the reference frame of the right sensor above the left sensor by an amount equal to the baseline b, Figure 8. The charge-coupled device (CCD) resolution for each sensor is 848 pixels in columns (c) and 480 pixels in rows (r) for a total of 848 × 480 (C × R) pixels. The expression of the depth coordinate z is: 𝑧 = 𝑓𝑐 ⋅ 𝑏 |𝑐𝑄1 − 𝑐𝑄2| = 𝑓𝑐 ⋅ 𝑏 𝑑 , (11) where fc is the focal length in X direction, b the baseline, and d the disparity, Figure 8. From equation (11) is possible to evaluate the error in the depth estimation from the model with respect to the disparity d, assuming all the other parameters as known and constant values: 𝜎𝑧 = 𝜕𝑧 𝜕𝑑 = 𝑏 ⋅𝑓𝑐 𝑑2 ⋅ 𝜎𝑑 , (12) Equation (12) can be rewritten as function of the Z distance, for better understanding, by substituting the value of the disparity d with a formulation obtainable from equation (11): 𝜎𝑧 = 𝑧2 𝑏 ⋅ 𝑓𝑐 𝜎𝑑 , (13) where the focal length f in pixel is: Figure 6. Standard deviation (std) of each eigenvalue in the two principal direction X and Z for UWB system from 0 m to 10 m with the off-diagonal terms of covariance matrix CUWB equal to 0. Figure 7. UWB system uncertainty ellipses for different tag positions. Figure 8 Stereo pin-hole model. ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 129 3 𝑓𝑐 = 0.5 𝐶 tan(𝐻𝐹𝑂𝑉/2) , (14) 𝑓𝑟 = 0.5 𝑅 tan(𝑉𝐹𝑂𝑉/2) , (15) HFOV is the horizontal field of view and VFOV is the vertical field of view, in our case are 87° and 58° respectively. By varying the resolution, the focal length changes and so does the distance uncertainty. The disparity standard deviation σd for the Camera D455 is declared to be 0.08 pixel by the manufacturer. Since many interfering effects increase this value, we calibrated the camera to check if the declared model equation (13) fits the real data and if so, to obtain the real standard deviation σd. We made 400 consecutive depth acquisitions of a chessboard from 1 m to 10 m at random steps of one meter, Figure 9a. , Figure 9b shows the theoretical and the obtained σz as a function of Z distance for the image resolution C of 848 pixels. As can be seen in Figure 9b the behaviour of the experimental found model is higher than the theoretical one. The new experimental value for σd found by applying the least absolute residual robust (LAR) method is 0.4 pixel. To evaluate the coordinates in X and Y dimension we use the Brown-Conrady distortion calibration model [22] with the intrinsic constant coefficients (k1, k2, k3, p1, p2) of the specific selected camera: 𝑢𝑥 = �̂� ⋅𝑓 + 2 ⋅ 𝑝1 ⋅ �̂��̂� +𝑝2(𝑟 + 2�̂��̂�) , (16) 𝑢𝑦 = �̂� ⋅𝑓 +2 ⋅𝑝2 ⋅ �̂��̂� + 𝑝1(𝑟 +2�̂��̂�) , (17) with: �̂� = 𝑐 − 𝐶/2 𝑓𝑐 = 𝑥𝑝 𝑓𝑐 �̂� = 𝑟 −𝑅/2 𝑓𝑟 = 𝑦𝑝 𝑓𝑐 𝑟 = �̂� 2 +�̂�2 𝑓 = 1+ 𝑘1 ⋅ 𝑟 + 𝑘2 ⋅ 𝑟 2 + 𝑘3 ⋅ 𝑟 3 The new expression of the X and Y position becomes: 𝑥 = 𝑢𝑥 ⋅ 𝑧 , (18) 𝑦 = 𝑢𝑦 ⋅ 𝑧 , (19) As for the UWB system, we are interested only in the X and Z coordinates for the localization of the operator in the space. The covariance matrix of the coordinates of the operator with camera system is equal to: 𝐶cam = 𝐽cam ⋅ 𝐶σ ⋅ 𝐽cam 𝑇 , (20) where: 𝐶𝜎 = ( 𝜎𝑐 2 0 0 0 𝜎𝑟 2 0 0 0 𝜎𝑧 2 ) , (21) 𝐽𝑐𝑎𝑚 𝑇 = ( 𝑧  ( 𝑎1 𝑓𝑐 + 6 𝑝2 𝑥𝑝 𝑓𝑐 2 + 𝑥𝑝  ( 2 𝑘1 𝑥𝑝 𝑓𝑐 2 + 4 𝑘2 𝑥𝑝 𝑎2 𝑓𝑐 2 + 6 𝑘3 𝑥𝑝 𝑎2 2 𝑓𝑐 2 ) 𝑓𝑐 + 2 𝑝1 𝑦𝑝 𝑓𝑐  𝑓𝑟 ) 0 𝑧  ( 2 𝑝2 𝑦𝑝 𝑓𝑟 2 + 𝑥𝑝  ( 2 𝑘1 𝑦𝑝 𝑓𝑟 2 + 4 𝑘2 𝑦𝑝 𝑎2 𝑓𝑟 2 + 6 𝑘3 𝑦𝑝 𝑎2 2 𝑓𝑟 2 ) 𝑓𝑐 + 2 𝑝1 𝑥𝑝 𝑓𝑐  𝑓𝑟 ) 0 𝑝2  ( 3 𝑥𝑝 2 𝑓𝑐 2 +𝑎3)+ 𝑥𝑝 𝑎1 𝑓𝑐 + 2 𝑝1 𝑥𝑝 𝑦𝑝 𝑓𝑐  𝑓𝑟 1 ) , (22) with: 𝑎1 = 𝑘1 𝑎2 +𝑘2 𝑎2 2 + 𝑘3 𝑎2 3 + 1 𝑎2 = 𝑥𝑝 2 𝑓𝑐 2 + 𝑎3 𝑎3 = 𝑦𝑝 2 𝑓𝑟 2 . Ccam depends on the focal length f, on the baseline b, and on the standard deviations of disparity σd and pixels σc, σr. Once the depth frame and the RGB one are aligned, it is possible to estimate the distance to any pixel and thus project the operator’s skeleton in space. Sometimes, key-points can have an incorrect distance value, i.e., the Z-coordinate, especially when the pose is very close to the viewing system. It is due to distortion caused by camera lenses and imperfect image realignment. To overcome this problem, we extract an average position of key- points with the median. In this case, if a key-point is projected too far from the other or too near, it will have no impact on the final operator position. The off-diagonal terms of equation (20) are zero if and only if xp = 0, yp = 0. In this case equation (20) becomes: (a) (b) Figure 9. (a) Camera calibration test with chessboard; (b) standard deviation results on Z distance with stereo camera D455 for resolution C of 848 pixels. 2 ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 130 𝐶cam = ( 𝑧2𝜎𝑐 2 𝑓𝑐 2 0 0 𝜎𝑧 2 ) , (23) Figure 10 shows the standard deviations’ values of each eigenvalue in the two principal directions X and Z of equation (23) with the covariances’ shapes in Figure 11 as previously done for UWB system. As can be seen in Figure 10, after 30 cm, the radial coordinate in the Z direction is always greater than the angular one in the X direction (Ccam(1, 1) < Ccam(2, 2) i.e. the ellipse is stretched in the radial direction. 5. SENSOR FUSION Sensor fusion was done between the position estimated by the UWB system and the position estimated by the vision system to reduce the uncertainty and improve the estimation of the operator’s position in space. Figure 12 shows two examples in two different tag positions where the uncertainties of the position estimation through the UWB system are more significant along the angular direction (X-coordinate), centring the reference system on the origin of the camera. Notice that in the case of the vision system, the uncertainties are predominant along the radial direction (Z-coordinate). By fusing the information with Bayes’ theorem [23], it is possible to reduce their uncertainties in both directions. We can summarize the fused information shown in Figure 12 through the following expression: 𝐶fused = [[𝐶UWB] −1 + [𝐶cam] −1]−1 , (24) 𝑃fused = 𝐶fused[[𝐶UWB] −1𝑃UWB + [𝐶cam] −1𝑃cam] , (25) where: - PUWB and Pcam are the estimated mean position of the UWB system and camera system respectively; - CUWB and Ccam are the covariance matrices of the estimated position of the UWB system and camera system respectively. 6. CONCLUSIONS In this work, an innovative and robust method to identify and track an operator from a mobile robot in real-time was developed. In particular, the operator can be identified through a convolution neural network tool and tracked through a designed localization method obtained by fusing the information from low-cost sensors such as UWB transceivers and a depth camera. It was implemented to locate the operator’s position with less uncertainty. The test campaign shows the UWB system has a higher uncertainty in the angular direction, contrary to the camera, where the uncertainty is higher in the radial direction. Specifically, CUWB is affected by the distances measured between Figure 10: Standard deviation (std) of each eigenvalue in the two principal direction X and Z for camera system from 0 m to 10 m with the off-diagonal terms of covariance matrix Ccam equal to 0. Figure 11. Camera system uncertainty ellipses for different tag positions. (a) (b) Figure 12. (a) Two examples of uncertainty ellipses (95 % with k= 2.4478 [24]) in two different tag positions (x=0 m, z=4 m; x=2 m, z=6 m) from camera and UWB system; (b) zoom of the results. ACTA IMEKO | www.imeko.org December 2021 | Volume 10 | Number 4 | 131 the tag and the two anchors (d1, d2) and between the anchors themselves (d3) with their corresponding uncertainties. Ccam instead depends on the focal length f, on the baseline b, and on the standard deviations of disparity σd and pixels σc, σr. Our solution makes the final system robust and more precise due to the complementarity of information between the covariance matrices of the UWB system and the vision system. To obtain a better result for the real application, the UWB transceivers were calibrated following a test campaign that provided the correct behaviour of the systematic offset in the measurements and their standard deviations. Offsets change for a specific UWB device and increase with Z distance. The standard deviation for the distances between the devices was calculated at 0.05 m in the scenario with a free line of sight and outdoor. Another test campaign, this time on the vision system, was carried out to evaluate the theoretical model of the standard deviation σz as a function of Z distance and the value of camera disparity standard deviation σd. The σd calculated for our Camera D455 is 0.4 pixels. Moreover, the average uncertainty of each pixel σc, σr was calculated as three pixels through heat-maps analysis of the convolution neural network used. ACKNOWLEDGEMENT This work was supported by the Italian Ministry for University and Research (MUR) under the program “Dipartimenti di Eccellenza (2018-2022)”. REFERENCES [1] M. Gupta, S. Kumar, L. Behera, V. K. Subramanian, A novel vision-based tracking algorithm for a human-following mobile robot, IEEE Transactions on Systems, Man, and Cybernetics: Systems 7 (2016), pp. 1415–1427. DOI: 10.1109/TSMC.2016.2616343 [2] S.-O. Lee, M. Hwang-Bo, B.-J. You, S.-R. Oh, Y.-J. Cho, Vision based mobile robot control for target tracking, IFAC Proceedings Volumes 34(4) (2001), pp. 75–80. DOI: 10.1016/S1474-6670(17)34276-3 [3] G. Xing, S. Tian, H. Sun, W. Liu, H. Liu, People-following system design for mobile robots using kinect sensor, 25th Chinese Control and Decision Conference (CCDC), IEEE, Guiyang, China, 25-27 May 2013, pp. 3190–3194. DOI: 10.1109/CCDC.2013.6561495 [4] S. A. Ahmed, A. V. Topalov, N. G. Shakev, and V. L. Popov, Model-free detection and following of moving objects by an omnidirectional mobile robot using 2d range data, IFAC- PapersOnLine 51(22) (2018), pp. 226– 231. DOI: 10.1016/j.ifacol.2018.11.546 [5] Y. Nagumo A. Ohya, Human following behavior of an autonomous mobile robot using light-emitting device, in Proceedings 10th IEEE International Workshop on Robot and Human Interactive Communication. ROMAN 2001 (Cat. No. 01TH8591), IEEE, Bordeaux, Paris, France, 18-21 Sept. 2001, pp. 225–230. DOI: 10.1109/ROMAN.2001.981906 [6] T. Feng, Y. Yu, L. Wu, Y. Bai, Z. Xiao, Z. Lu, A human-tracking robot using ultra wideband technology, IEEE Access 6 (2018), pp. 42541–42550. DOI: 10.1109/ACCESS.2018.2859754 [7] T. G. Kim, D.-J. Seo, K.-S. Joo, A following system for a specific object using a UWB system, in 2018 18th International Conference on Control, Automation and Systems (ICCAS), IEEE, PyeongChang, Korea (South), 17-20 Oct. 2018, pp. 958–960. [8] D.-J. Seo, T. G. Kim, S. W. Noh, H. H. Seo, Object following method for a differential type mobile robot based on ultra wide band distance sensor system, 17th International Conference on Control, Automation and Systems (ICCAS), IEE, Jeju, Korea (South), 18-21 Oct. 2017, pp. 736–738. DOI: 10.23919/ICCAS.2017.8204325 [9] L. Santoro, D. Brunelli, D. Fontanelli, On-line Optimal Ranging Sensor Deployment for Robotic Exploration, IEEE SENSORS JOURNAL, v. 2021, (2021). DOI: 10.1109/JSEN.2021.3120889 [10] L. Santoro, M. Nardello, D. Brunelli and D. Fontanelli, Scale up to infinity: the UWB Indoor Global Positioning System, 2021 IEEE International Symposium on Robotic and Sensors Environments (ROSE), 2021, pp. 1-8, DOI: 10.1109/ROSE52750.2021.9611770 [11] G. Ding, H. Lu, J. Bai, X. Qin, Development of a high precision UWB/vision-based AGV and control system, 5th International Conference on Control and Robotics Engineering (ICCRE), Osaka, Japan, 24-26 April 2020, pp. 99–103. DOI: 10.1109/ICCRE49379.2020.9096456 [12] H. Xu, L. Wang, Y. Zhang, K. Qiu, S. Shen, Decentralized visual- inertial-UWB fusion for relative state estimation of aerial swarm, IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May-31 Aug. 2020, pp. 8776–8782. DOI: 10.1109/ICRA40945.2020.9196944 [13] F. Liu, J. Zhang, J. Wang, H. Han, D. Yang, An UWB/vision fusion scheme for determining pedestrians’ indoor location, Sensors 20(4) (2020), p. 1139. DOI: 10.3390/s20041139 [14] Decawave dwm1001. Online [Accessed 06 December 2021] https://www.decawave.com/product/dwm1001-development- board/ [15] M. Kwak, J. Chong, A new double two-way ranging algorithm for ranging system, 2nd IEEE International Conference on Network Infrastructure and Digital Content, IEEE, Beijing, China, 24-26 Sept. 2010, pp. 470–473. DOI: 10.1109/ICNIDC.2010.5657814 [16] Intel RealSense Camera D455. Online [Accessed: December 2021] https://www.intelrealsense.com/depth-camera-d455/ [17] OpenCV. Online [Accessed: December 2021] https://opencv.org/ [18] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, Y. A. Sheikh, Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1) (2019), pp. 172- 186. DOI: 10.1109/TPAMI.2019.2929257 [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for mobile vision applications. Online [Accessed 06 December 2021] https://arxiv.org/abs/1704.04861 [20] Fervi Misuratore di distanza laser [Accessed 06 December 2021] https://www.fervi.com/ita/strumenti-di-misura/misuratori- analogici-e-digitali/misuratore-di-distanze/misuratore-di- distanza-laser-pr-8240.htm [21] C. Jiang, S. Chen, Y. Chen, D. Liu, Y. Bo, An UWB channel impulse response de-noising method for NLOS/LOS classification boosting, IEEE Communications Letters 24(11) (2020), pp. 2513–2517. DOI: 10.1109/LCOMM.2020.3009659 [22] C. B. Duane, Close-range camera calibration, Photogramm. Eng, 37(8) (1971), pp. 855–866. [23] W. Elmenreich, An introduction to sensor fusion, Vienna University of Technology, Austria 502 (2002), pp. 1– 28. [24] R. C. Smith, P. Cheeseman, On the representation and estimation of spatial uncertainty, The international journal of Robotics Research 5(4) (1986), pp. 56–68. DOI: 10.1177/027836498600500404 https://doi.org/10.1109/TSMC.2016.2616343 https://doi.org/10.1016/S1474-6670(17)34276-3 https://doi.org/10.1109/CCDC.2013.6561495 https://doi.org/10.1016/j.ifacol.2018.11.546 https://doi.org/10.1109/ROMAN.2001.981906 https://doi.org/10.1109/ACCESS.2018.2859754 https://doi.org/10.23919/ICCAS.2017.8204325 https://doi.org/10.1109/JSEN.2021.3120889 https://doi.org/10.1109/ROSE52750.2021.9611770 https://doi.org/10.1109/ICCRE49379.2020.9096456 https://doi.org/10.1109/ICRA40945.2020.9196944 https://doi.org/10.3390/s20041139 https://www.decawave.com/product/dwm1001-development-board/ https://www.decawave.com/product/dwm1001-development-board/ https://doi.org/10.1109/ICNIDC.2010.5657814 https://www.intelrealsense.com/depth-camera-d455/ https://opencv.org/ https://doi.org/10.1109/TPAMI.2019.2929257 https://arxiv.org/abs/1704.04861 https://www.fervi.com/ita/strumenti-di-misura/misuratori-analogici-e-digitali/misuratore-di-distanze/misuratore-di-distanza-laser-pr-8240.htm https://www.fervi.com/ita/strumenti-di-misura/misuratori-analogici-e-digitali/misuratore-di-distanze/misuratore-di-distanza-laser-pr-8240.htm https://www.fervi.com/ita/strumenti-di-misura/misuratori-analogici-e-digitali/misuratore-di-distanze/misuratore-di-distanza-laser-pr-8240.htm https://doi.org/10.1109/LCOMM.2020.3009659 https://doi.org/10.1177%2F027836498600500404