Omnidirectional camera pose estimation and projective texture mapping for photorealistic 3D virtual reality experiences


ACTA IMEKO 
ISSN: 2221-870X 
June 2022, Volume 11, Number 2, 1 - 8 

 
ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 1 

Omnidirectional camera pose estimation and projective 
texture mapping for photorealistic 3D virtual reality 
experiences 

Alessandro Luchetti1, Matteo Zanetti1, Denis Kalkofen2, Mariolino De Cecco1 

1 Department of Industrial Engineering, University of Trento, Sommarive 9, 38123 Trento, Italy  
2 Institute of Computer Graphics and Vision, Graz University of Technology, Rechbauerstraße 12, 8010 Graz, Austria 

 
Section: RESEARCH PAPER  

Keywords: Omnidirectional cameras; mesh reconstruction; camera pose estimation; optimization; enhanced comprehension 

Citation: Alessandro Luchetti, Matteo Zanetti, Denis Kalkofen, Mariolino De Cecco, Omnidirectional camera pose estimation and projective texture mapping 
for photorealistic 3D virtual reality experiences, Acta IMEKO, vol. 11, no. 2, article 24, June 2022, identifier: IMEKO-ACTA-11 (2022)-02-24 

Section Editor: Alfredo Cigada, Politecnico di Milano, Italy 

Received May 26, 2021; In final form March 21, 2022; Published June 2022 

Copyright: This is an open-access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, 
distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This work was developed inside the European project MiReBooks: Mixed Reality Handbooks for Mining Education, a project funded by EIT Raw 
Materials. 

Corresponding author: Alessandro Luchetti, e-mail: alessandro.luchetti@unitn.it   

 
1. INTRODUCTION 

Media acquired by 360°cameras (also known as 
omnidirectional, spherical, or panoramic cameras) is becoming 
increasingly important to many applications. Compared to 
conventional cameras, images taken by 360° cameras offer a 
larger field of view, which is why they are traditionally useful to 
applications that derive their state from information about the 
environment. Examples include robot localization, navigation, 
and visual servoing [1]. However, omnidirectional cameras have 
recently also become an essential tool for content creation in 
Virtual Reality (VR) applications because spherical photographs 
and videos can provide a high level of realism. For example, 
applications for real estate agents already make use of 
omnidirectional images and video data within a VR head 

mounted display to improve the realism of virtual customer 
inspections and research domains span widely from 360° tourism 
[2] to education in 360° classrooms [3]. 

VR applications using omnidirectional media allow their users 
to change the view within the boundaries of a 360° image that 
has been captured at a specific Point of Interest (POI). Thus, VR 
users are commonly restricted to head rotations only while 
translations require transitioning into a 360° image that has been 
captured at a different POI [4]. Thus, motion parallax is missing 
in VR applications, which use omnidirectional data. 
Furthermore, view transitions are limited to where 
omnidirectional images or videos exist. These shortcomings limit 
the benefit of omnidirectional media in VR. For example, the 
missing 3D information restricts the usage of advanced 
exploration techniques [5], [6] and the missing motion parallax 
can cause visual discomfort [7]. 

ABSTRACT 
Modern applications in virtual reality require a high level of fruition of the environment as if it was real. In applications that have to deal 
with real scenarios, it is important to acquire both its three-dimensional (3D) structure and details to enable the users to achieve good 
immersive experiences. The purpose of this paper is to illustrate a method to obtain a mesh with high quality texture combining a raw 
3D mesh model of the environment and 360° images. The main outcome is a mesh with a high level of photorealistic details. This enables 
both a good depth perception thanks to the mesh model and high visualization quality thanks to the 2D resolution of modern 
omnidirectional cameras. The fundamental step to reach this goal is the correct alignment between the 360° camera and the 3D mesh 
model. For this reason, we propose a method that embodies two steps: 1) find the 360° cameras pose within the current 3D 
environment; 2) project the high-quality 360° image on top of the mesh. After the method description, we outline its validation in two 
virtual reality scenarios, a mine and city environment, respectively, which allows us to compare the achieved results with the ground 
truth. 

mailto:alessandro.luchetti@unitn.it


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 2 

To overcome these limitations, we propose combining 
omnidirectional photorealistic image data with its corresponding 
3D representation. Since 3D reconstructions commonly suffer 
from poor color representations, we apply projective texture 
mapping of omnidirectional images. Our approach supports 
photorealistic image fidelity at the POIs and motion parallax at 
viewpoints nearby. To enable projective texture mapping of 360° 
image data, we present an approach for omnidirectional camera 
pose estimation that automatically finds the position and 
orientation of the 360° camera relative to the 3D representation 
of the environment. 

To put our work in context, we first outline related work in 
Section 2, before we describe our approaches to omnidirectional 
camera pose estimation and projective texture mapping in 
Section 3. We evaluate our system in Section 4 and discuss 
possible directions for future work in Section 5. 

2. RELATED WORK  

Camera pose detection has always been a key problem in 
computer vision. For example, Makadia et al. [8] proposed a 
method useful for the alignment of large rotations with potential 
impact on 3D shape alignment to estimate the rotation directly 
from images defined on the sphere and without correspondence. 
Unfortunately, this approach is quite resistant only to small 
translations of the camera [9]. Another work [10] addresses the 
problem of camera pose recovery from spherical panoramas 
using pairwise essential matrices. In this case, the exact position 
of each panorama was an important step to ensure the 
consistency of visual information about a database of geo-
referenced images. Here the pose recovery works with a two-
stage algorithm for rotations and after for translations with a bad 
result if the camera starting pose is very far from the correct one. 

The above-mentioned problems have been overcome by our 
method because it works also for large variations of translation 
as well as rotations. Also Levin et al. present in [11] a method to 
compute camera pose from a sequence of spherical images 
through the use of an essential matrix for initial pairwise 
geometry. Differently from our work and the work of [10], they 
also use a rough estimate of the camera path as an additional 
system input to calculate camera positions. 

An example of generating a texture map of a 3D model with 
2D high-quality images is given in [12]. In particular, it is a 
specific application in the e-commerce presentation of shoes. It 
consists of a texture mapping technique that comprises several 
phases: mesh partitioning, mesh parameterization and packing, 
texture transferring, and texture correction and optimization. In 
particular, in the texture transferring step, each mesh is allocated 
to a front image, and all meshes that use the same front image 
are put in a group. Finally, the pixels from the front image 
corresponding to the 3D mesh are extracted. Differently, our 
method uses only a spherical image to recreate the high-
resolution 3D model by projecting each pixel of the image from 
the correct camera pose previously found. The obtained results 
are faster and good if the user's field of view rotates without large 
displacements with respect to the camera pose. 

A similar approach but for another application related to 
realizing surveying tasks in architectural, archaeological, and 
cultural landscapes conservation is provided by Abmayr et al. 
[13]. They developed a laser scanner, which offers high accuracy 
measurements of object surfaces, combined with a panoramic 
color camera to achieve precise and accurate monitoring of the 
actual environment employing colored point clouds. The camera 

rotates according to the same tripod as the laser scanner. Many 
similarities with the method described in the present article can 
be found. The main difference resides in the use of a single 360° 
camera instead of a rotating unit, the use of an automatic pose 
estimation method instead to use the same tripod for laser 
scanner and camera during the acquisition process. Our method 
is faster, and the 3D model reconstruction can be more complete 
because it doesn't need to be at a fixed distance from the camera 
during the scanning process. This aspect becomes more 
important if it is necessary to reconstruct a high-resolution model 
with different cameras from unknown positions.  

Finally, an interesting study was provided by Teo et al. [14], 
where, in the context of remote collaboration, helpers shared 
360° live videos or 3D virtual reconstructions of their 
surroundings from different places to work together with local 
workers. The results showed that participants preferred having 
both 360° and 3D modes, as it provides variation in controls and 
features from different perspectives. Our work provides a 
combination of a 360° live video and 3D virtual reconstruction 
to combine their advantages without the need to switch between 
them. 

3. METHOD 

In this section, the localization algorithm to estimate the 
camera pose (i.e., its positions and orientations in the 
environment), and the method used to project the texture 
mapping on a 3D representation of the environment are 
explained. 

3.1. Camera pose estimation 

A good alignment between the virtual environment and the 
captured image is fundamental for the final texture projection 
that will be covered in the next chapter. For example, this step is 
necessary when an operator needs to place the camera in a 
predefined position and orientation. Some human errors may be 
made during this operation and a method to find an accurate 
camera pose is necessary. Moreover, for large distances, even a 
small angle or small position errors can compromise the final 
result.  

The large-scale automatic camera pose identification 
algorithm has been implemented in Matlab 2019b using a ZMQ 
communication protocol [15] between Matlab and Unity 3D. A 
Particle Swarm Optimization (PSO) was used. The procedure of 
the camera pose estimation is shown in Figure 1. 
Starting from the reconstructed 3D model with its low-quality 
texture but with depth information of the environment and given 
as input a high quality equirectangular photorealistic image taken 
by an omnidirectional camera, the localization algorithm finds 
the pose that gives a 360° image taken with a simulated camera 
that is as similar as possible to the input one.  
  

Figure 1. Schematic diagram of the camera pose detection algorithm. 


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 3 

In particular: 
i. A new camera position is set for each iteration of the 

PSO algorithm. 
ii. The equirectangular image corresponding to the set 

camera pose, at the previous step, is acquired. 
iii. The algorithm checks the similarity between the new 

image and the input one that has to be used as a new 
texture for the 3D mesh; the parameters to be 
optimized are the translation and the Euler angles to be 
applied to the 3D model to generate an equirectangular 
image that matches the one in the input. The cost 
function for comparing the two equirectangular images 
uses the following quantities: 

• The structural similarity (SSIM) index of the 
equirectangular images. 

• The mean-squared error (MSE) between the 
two equirectangular images. 

• SSIM of the approximation coefficients 
(SSIMA) of level 1 of the wavelet 
decomposition. 

• SSIM of the horizontal detail coefficients 
(SSIMH) of level 1 of the wavelet 
decomposition. 

• SSIM of the vertical detail coefficients 
(SSIMV) of level 1 of the wavelet 
decomposition. 

• SSIM of the diagonal detail coefficients 
(SSIMD) of level 1 of the wavelet 
decomposition. 

The final cost function C obtained by adding the above-
mentioned quantities is: 

𝐶 =  𝑆𝑆𝐼𝑀 +  𝑀𝑆𝐸 +  𝑆𝑆𝐼𝑀𝐴  
+ 𝑆𝑆𝐼𝑀𝐻  + 𝑆𝑆𝐼𝑀𝑉  + 𝑆𝑆𝐼𝑀𝐷 . 

(1) 

The MSE represents the cumulative squared error 
between two images x(i,j) and y(i,j): 

𝑀𝑆𝐸(𝑥, 𝑦)

=
1

𝑀𝑁
∑ ∑[𝑥(𝑚, 𝑛) − 𝑦(𝑚, 𝑛)]2 ,

𝑁

𝑛=1

𝑀

𝑚=1

 
(2) 

where M and N are the number of rows and columns 
of x and y. 
SSIM is used for measuring the similarity between two 
images x and y [16]. The SSIM Index quality assessment 
index is based on the computation of three terms, 
namely the luminance term l, the contrast term c and 
the structural term s. The overall index is a 
multiplicative combination of the three terms: 

𝑆𝑆𝐼𝑀(𝑥, 𝑦) = [𝑙(𝑥, 𝑦)] [𝑐(𝑥, 𝑦)]  [𝑠(𝑥, 𝑦)] , (3) 

where: 

𝑙(𝑥, 𝑦) =
2𝜇𝑥 𝜇𝑦 + 𝐶1

𝜇𝑥
2 + 𝜇𝑦

2 + 𝐶1
 , (4) 

𝑐(𝑥, 𝑦) =
2𝑥𝑦 + 𝐶2

𝑥
2 + 𝑦

2 + 𝐶2
 , (5) 

𝑠(𝑥, 𝑦) =
𝑥𝑦 + 𝐶3

𝑥𝑦 + 𝐶3
 . (6) 

x, y, x, y and xy are the local means, standard 
deviations, and cross-covariance for images x and y. C1, 
C2, C3 are constants to avoid instability for image 
regions where the local mean or standard deviation is 

close to zero. Choosing  =  =  = 1 and 𝐶3 =
𝑐2

2
, the 

index simplifies to: 

𝑆𝑆𝐼𝑀(𝑥, 𝑦)

=  
(2𝜇𝑋 𝜇𝑌  +  𝐶1)(2𝑥𝑦 + 𝐶2)

(𝜇𝑥
2 + 𝜇𝑦

2 + 𝐶1)(𝑥
2 + 𝑦

2 + 𝐶2) 
 . 

(7) 

iv. The PSO optimization runs until convergence, giving 
as output the best camera pose (translation and Euler 
angles) that makes the two images as similar as possible. 

3.2. Texture projection 

In this chapter, the method to apply the high-quality texture 
mapping will be described. Essentially, a merge of the high-
quality 360° image with the 3D mesh is performed.  

Firstly, the 3D Cartesian coordinates and colors of each 360° 
image's pixel were obtained by projecting the equirectangular 
image on the surface of a unitary radius sphere. 

Given an equirectangular image with N rows and M columns, 
each image's pixel in 2D Cartesian coordinates (n,m) was 
transformed in spherical coordinates, computing the 
corresponding azimuth a and elevation e, setting the radius R 
equal to 1. The equations used for the conversion are: 

𝑎 = − (
𝑚

𝑀
− 0.5) · 2 π , (8) 

𝑒 = − (
𝑛

𝑁
− 0.5) · π , (9) 

𝑅 = 1. (10) 

Finally, the 3D Cartesian coordinates are obtained to be 
visualized in Matlab software like a 3D point cloud. 
The mapping from spherical coordinates to 3D Cartesian 
coordinates is: 

𝑥 = 𝑅 · cos(𝑒) · cos(𝑎) (11) 

𝑦 = 𝑅 · cos(𝑒) · sin(𝑎) (12) 

𝑧 = 𝑅 · sin(e). (13) 

This "spherical" point cloud was imported inside Unity and 
placed with the position and orientation found in the previous 
pose estimation step chapter.  

The Raycasting technique was used: through the Ray class, it 
is possible to emit or "cast" rays in a 3D environment and control 
the resulting collisions. The rays used in Raycasting are invisible 
lines that have the center of the image sphere as the origin and 
are oriented in each pixel's direction. The important point is that 
these invisible lines or rays that are cast into the scene can return 
information about GameObjects that have been hit by the rays. 

Attached to the environment's mesh as GameObject in Unity, 
there is a Mesh Collider to register a hit with the ray. When a ray 
intersects or "hits" a GameObject, the event is referred to as a 
RaycastHit. This hit provides details about the GameObject and 
where it was hit, including a reference to the GameObject's 
Transform, the length of the ray when it hits something, the 
point in the world where the hit happened. 


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 4 

Once the collision of each pixel is detected, their new position 
is saved with color properties. 

Lastly, the new point cloud was used to reconstruct a high-
quality photorealistic texture, using the Screened Poisson Surface 
Reconstruction algorithm [17] implemented in Meshlab [18]. 
This algorithm is particularly useful when the model to 
reconstruct is very big with very fine details to be preserved. The 
reconstruction of the 3D model was done setting the 
Reconstruction Depth parameter (i.e., the maximum depth of the 
octree that is used to make the reconstruction) to 13. The default 
value of Meshlab for this parameter is 8, we increased it because 
in general, the higher this value is the more time will be needed 
for reconstitution, the more details will be preserved [17]. We did 
not increase it more than 13 because after 14 it is not possible to 
see a real change in the final result. 

The Minimum Number of Samples was set to 1.5 and the 
Interpolation Weight to 4 as default values of Meshlab. Since the 
Poisson algorithm tends to "close" the reconstructed mesh, the 
triangles whose area was above a certain threshold were deleted 
to preserve the original form of the reconstructed environment. 

4. EVALUATION 

For the validation of the camera pose localization algorithm 
and the high-quality texture mapping projection, a Wavefront 3D 
Object File (OBJ file extension) of two 3D high-quality virtual 
outdoor environments, one for a mine and one for a city, were 
imported into Unity 3D platform. An original script was also 
written to simulate a 360° camera. The 360° capture technique is 
based on Google's Omni-directional Stereo (ODS) technology 
using Cubemap rendering [19]. After the Cubemap is generated, 
it is possible to convert this Cubemap to an equirectangular map 
which is a projection format used by 360° video players.  

After placing the simulated camera at a specific pose inside 
the scene of a specific scenario, a high-quality equirectangular 

image was acquired, Figure 2. This will be the input images whose 
pose has to be detected by the developed algorithm. 

To simulate the acquisition of the environment through a 3D 
scanner, a point cloud for each analyzed environment was 
extracted from the 3D high-quality models using the Cloud 
Compare software [20]. These point clouds were downsampled 
to simulate a 3D model with less detail than the input model, and 
new reconstructions were performed in MeshLab [18] to obtain 
new low-quality 3D models, Figure 3. New scenes were then 
recreated in Unity with the downsampled 3D models.  

Figure 4 shows the schematic diagram of our camera pose 
detection algorithm proposed in Figure 1 applied to the specific 
example of the mine environment. 

The input omnidirectional image has a resolution of 
4096 × 2048 pixels. However, to improve the calculation time 
speed, the comparison between images is done by downsampling 
them to 256 × 128 pixels for both the analyzed environments. 
The bounding box dimensions of the scenario with the mine are 
113 m × 169 m × 37 m for the x, y, z coordinates, respectively. 
Instead, the dimensions of the city environment are 
440 m × 100 m × 435 m. 

The same analysis was done for both environments using the 
same approach and shifting the camera pose by the same values. 
Table 1 shows the position and orientation for 10 random trials. 

The initial starting position was set to the origin (0, 0, 0) with 
null rotations for each trial. The research limits were set to 

 20 m for translations and  80° for rotations. 
By default, Unity applies the following rotation order: 

Extrinsic Rotation around the z-axis (γ), then around the x-axis 
(α), and finally around the y-axis (β). 

The average time spent by the PSO algorithm is around 20 
minutes. The tests were run on a PC with an Intel i7-9700KF 
processor and 64 GB of RAM. 

For each of the 10 trials of Table 1, the PSO algorithm has 
been run changing 5 times the numbers of generations, i.e., 200, 

 
(a)  

 
(b)  

Figure 2. High-quality equirectangular images whose detection poses must 
be identified for a mine (a) and city (b) environments. 

 
(a) 

 
(b) 

Figure 3. The 3D downsampled models used by the localization algorithm for 
a mine (a) and city (b) environments. 


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 5 

250, 300, 350, 400, keeping the number of particles fixed to 100, 
and 5 times changing the number of particles, i.e. 60, 70, 80, 90, 
100, keeping the number of generation fixed to 400. The number 
of generations and particles was changed to force the algorithm 
to increase variability.  

To compute the error in pose detection, we decided to 
separate the translation and the rotation part. The translation 
error is computed by performing the Euclidean distance between 
the position of the camera found by the PSO algorithm and the 
ground truth. For what concerns the rotations, firstly, the 
rotations found by the optimization process and the ground truth 
were decomposed in Axis and Angle notation. Consequentially, 
the error, in the case of rotation, has 2 terms: the error in the axis 
orientation with respect to the ground truth and the amount of 
rotation around such axis. 

Figure 5 shows the cost function score for the various error 
components explained above, while Figure 6 shows the three 
possible couple combinations of the error components with 
respect to the final score optimization value. 
As can be noticed, sometimes, a higher score of the cost function 
at the end of the optimization does not mean that an incorrect 
pose was found. This fact is probably due to the mesh 
reconstruction process. Indeed, after this process, there could be 
portions of the environment that can be less accurate compared 
to the real model. For this reason, the meaning of the final 
reached score values is not absolute or easily comparable 
considering different camera poses. 

This generates the need to quantify the accuracy of the camera 
localization measurement within a scene. 

Despite the uncertainty concerning the accuracy in the pose 
found by the algorithm with respect to the final cost function 
score, Figure 5 and Figure 6 show that, for the mine 
environment, a score below 1.6 means that, for the trial 
performed, the error in translation is below 0.7 m, the difference 
in the amount of rotation is below 1°, and the difference in the 
rotation axis orientation is below 2°. For the city environment, 

the same amount of errors corresponds to a cost function score 
of 2. The score is higher because the city environment is a 
scenario with much more detail than a mine. Many of these 
details, through initial downsampling, are lost and the initial 
reconstructed mesh is much less detailed, as can be seen in Figure 
3b. The final score, therefore, which measures the similarity 
between the input high-quality equirectangular image and that 
obtained from this low-quality model, turns out to be higher. 
However, the errors, especially those related to rotations (Figure 
5b and Figure 5c), are lower for the city environment even at 
high levels of the cost function score because the environment is 
more diverse. Because of this relationship of the cost function 
threshold from the level of detail of the reconstructed 3D model, 

Table 1. Camera poses chosen for 10 trials (ground truth). 

Trial x (m) y (m) z (m) α (°) β (°) γ (°) 

1 -4.00 10.00 15.00 10.00 15.00 18.00 

2 5.00 -2.00 5.00 10.00 -60.00 1.00 

3 -8.00 5.00 -6.00 30.00 45.00 15.00 

4 2.00 -7.00 15.00 -10.00 -45.00 -20.00 

5 10.00 10.00 10.00 20.00 -15.00 5.00 

6 0.00 15.00 8.00 25.00 -15.00 5.00 

7 -5.00 2.00 -5.00 -10.00 60.00 -1.00 

8 -1.00 -2.00 -3.00 -4.00 -5.00 -6.00 

9 -15.00 10.00 10.00 40.00 70.00 40.00 

10 -19.00 19.00 -19.00 2.00 80.00 -5.00 

 
Figure 4. Example of the camera pose detection algorithm flow for the mine 
environment. 

 
(a) Cost function score vs translation error. 

 
(b) Cost function score vs axis orientation error. 

 
(c) Cost function score vs rotation angle error. 

Figure 5. 2D plots of the cost function score vs the errors in translation (a), 
axis orientation (b), and rotation angle (c). 


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 6 

there is a need for further analysis to investigate possible 
acceptance criteria and multidimensional models capable of 
finding a correlation between the different terms of the cost 
function and the uncertainty in translation and rotation. For 
example, Figure 7 shows that MSE could be a possible 
discriminant factor for accuracy. Indeed, in this case, the accurate 
solutions are all centered around the 0.005 value for both 
exanimated environments. 

Once the camera poses were found for each environment, 
this information is used to set the 360° image projected on the 
surface of a unitary radius sphere in the correct position and 
orientation, Figure 8a. After that, using the Raycasting technique, 
the 3D mesh, Figure 8b, is hit by 360° image pixels (Figure 8c).  

5. CONCLUSIONS AND FUTURE WORK 

In this paper, we presented an approach for combining 
photorealistic with 3D environment representations using a 360° 
high-quality image and a 3D model of an environment with low-

 
(a) Cost function score vs translation error vs rotation angle error. 

 
(b) Cost function score vs axis orientation error vs rotation angle error. 

 
(c) Cost function score vs translation error vs axis orientation error. 

Figure 6. 3D plots of the cost function score and the errors in translation, 
rotation angle, and axis orientation. 

 
Figure 7. MSE score vs translation error. 

 
(a) 360° image placed on the surface of a unitary sphere (Matlab Software). 

 
(b) Raw 3D mesh (Unity Software). 

 
(c) Point cloud obtained projecting the pixels of the 360° image on the raw 
3D mesh (Unity Software). 

Figure 8. The pixels of the 360° image of the mine environment are projected 
on a sphere surface (a), which is put in the correct camera pose found by our 
algorithm inside the raw 3D mesh (b). The pixels are then projected using the 
ray cast technique on the raw mesh, obtaining a new dense point cloud (c). 


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 7 

quality. At the core of our system, we have developed an 
approach for automatic large-scale 360° camera pose estimation 
within a 3D environment and a method for projective texture 
mapping spherical images.  

Contrary to previous work, outline in the related work 
section, the camera pose estimator developed in this paper works 
both for significant differences in rotation and displacement, and 
it works without the need to start from a known point of view. 
The positions and orientations of the camera were estimated with 
a translation error below 0.7 m, and below 1° and 2° for the 
difference in the amount of rotation, and the difference in the 
rotation axis orientation, respectively. 

These results were obtained for both environments analyzed 

at full size and with search limits of  20 m for translations and 

 80° for rotations using an MSE of 0.005 as a possible 
discriminant factor for accuracy. While this work was validated 
using a 360° camera simulation in virtual scenes, we plan to test 
its capability on real scenes as well. In such situations, the light 
conditions could be very different between the model and 
equirectangular image which is why the luminance has to be 
carefully considered. 

Furthermore, the approach here presented is valid until the 
view of the user rotates without large displacements from the 
camera’s initial position because not all the mesh areas are 
covered after the pixel projection. To overcome this problem, 
the same method presented in this paper can be applied with 
more than one camera, but in the case of the final reconstruction 
of the texture, there is not a discriminating parameter that allows 
us to choose which pixels to use from one or another camera for 
the final reconstruction. This choice can be useful if the field of 
view of one camera is better for some mesh areas than another 
one to obtain a better result and it can be implemented in future 
work. 

Finally, in the optimization camera pose process, a further 
study can be done to find a correlation between the different 

terms of the cost function and the uncertainty in translation and 
rotation by investigating other possible acceptance criteria 
through a multidimensional analysis. 

REFERENCES 

[1] R. Benosman, S. Kang, O. Faugeras, Panoramic vision, Springer 
Verlag, 2000, ISBN 978-0387951119. 

[2] J. Hakulinen, T. Keskinen, V. Mäkelä, S. Saarinen, M. Turunen, 
Omnidirectional video in museums–authentic, immersive and 
entertaining, in International Conference on Advances in 
Computer Entertainment, Springer, 2017, pp. 567–587. 
DOI: 10.1007/978-3-319-76270-8_39    

[3] D. Kalkofen, S. Mori, T. Ladinig, L. Daling, A. Abdelrazeq, M. 
Ebner, M. Ortega, S. Feiel, S. Gabl, T. Shepel, J. Tibbett, T. H. 
Laine, M. Hitch, C. Drebenstedt, P. Moser, Tools for Teaching 
Mining Students in Virtual Reality based on 360° Video 
Experiences, Conference on Virtual Reality and 3D User 
Interfaces Abstracts and Workshops (VRW), IEE, Atlanta, GA, 
USA, 2020, pp. 455-459.  
DOI: 10.1109/VRW50115.2020.00096  

[4] A. MacQuarrie, A. Steed. The effect of transition type in multi-
view 360 media, IEEE Transactions on visualization and 
computer graphics 24(4) (2018), pp. 1564-1573. 
DOI: 10.1109/TVCG.2018.2793561  

[5] M. Tatzgern, R. Grasset, D. Kalkofen, D. Schmalstieg, 
Transitional augmented reality navigation for live captured scenes, 
Virtual Reality (VR), IEEE, 2014, pp. 21-26. 
DOI: 10.1109/VR.2014.6802045  

[6] M. Tatzgern, R. Grasset, E. Veas, D. Kalkofen, H. Seichter, 
D. Schmalstieg, Exploring real world points of interest: Design 
and evaluation of object-centric exploration techniques for 
augmented reality. Pervasive and mobile computing 18 (2015), pp. 
55-70. 
DOI: 10.1016/j.pmcj.2014.08.010 

[7] J. Thatte, B. Girod, Towards perceptual evaluation of six degrees 
of freedom virtual reality rendering from stacked omnistereo 
representation, Electronic Imaging, 2018. 
DOI: 10.2352/ISSN.2470-1173.2018.05.PMII-352 

[8] A. Makadia, K. Daniilidis, Rotation recovery from spherical 
images without correspondences, IEEE transactions on pattern 
analysis and machine intelligence 28(7) (2006), pp. 1170–1175. 
DOI: 10.1109/TPAMI.2006.150  

[9] A. Makadia, K. Daniilidis, Direct 3d-rotation estimation from 
spherical images via a generalized shift theorem, IEEE Computer 
Society Conference on Computer Vision and Pattern Recognition, 
vol. 2, Madison, WI, USA, 2003, pp. II–217. 
DOI: 10.1109/CVPR.2003.1211473  

[10] R. Laganiere and F. Kangni, Orientation and pose estimation of 
panoramic imagery, Mach Graph Vis 19(3) (2010), pp. 339–363. 

[11] A. Levin, R. Szeliski, Visual odometry and map correlation, in 
Proceedings of the IEEE Computer Society Conference on 
Computer Vision and Pattern Recognition 1 (2004) Washington, 
DC, USA. 
DOI: 10.1109/CVPR.2004.1315088 

[12] J.-Y. Lai, T.-C. Wu, W. Phothong, D. W. Wang, C.-Y. Liao, J.-Y. 
Lee, A high-resolution texture mapping technique for 3d textured 
model, Applied Sciences, vol. 8, no. 11, 2018, p. 2228. 
DOI: 10.3390/app8112228 

[13] T. Abmayr, F. Härtl, M. Mettenleiter, I. Heinz, A. Hildebrand, B. 
Neumann, C. Fröhlich, Realistic3d reconstruction–combining 
laserscan data with RGB color information, Proceedings of ISPRS 
Internation Archives of Photogrammetry, Remote Sensing and 
Spatial Information Sciences 35 (2004), pp. 198–203. 

[14] T. Teo, L. Lawrence, G. A. Lee, M. Billinghurst, M. Adcock, 
Mixed reality remote collaboration combining 360 video and 3D 
reconstruction, in Proceedings of the 2019 CHI conference on 
human factors in computing systems, 2019, pp. 1–14. 
DOI: 10.1145/3290605.3300431 

 
(a) 

 
(b) 

Figure 9. Final results after the 3D reconstruction for the mine (a) and the 
city (b) environments. 

https://doi.org/10.1007/978-3-319-76270-8_39
https://doi.org/10.1109/VRW50115.2020.00096
https://doi.org/10.1109/TVCG.2018.2793561
https://doi.org/10.1109/VR.2014.6802045
https://doi.org/10.1016/j.pmcj.2014.08.010
https://doi.org/10.2352/ISSN.2470-1173.2018.05.PMII-352
https://doi.org/10.1109/TPAMI.2006.150
https://doi.org/10.1109/CVPR.2003.1211473
https://doi.org/10.1109/CVPR.2004.1315088
https://doi.org/10.3390/app8112228
https://doi.org/10.1145/3290605.3300431


ACTA IMEKO | www.imeko.org June 2022 | Volume 11 | Number 2 | 8 

[15] P. Hintjens, ZeroMQ: messaging for many applications. O’Reilly 
Media, Inc., 2013, ISBN: 9781449334062. 

[16] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image 
quality assessment: from error visibility tostructural similarity, 
IEEE Transactions on Image Processing 13(4) (2004), pp. 600–
612. 
DOI: 10.1109/TIP.2003.819861  

[17] M. Kazhdan, H. Hoppe, Screened poisson surface reconstruction, 
ACM Transactions on Graphics (ToG) 32(3) (2013), pp. 1–13. 
DOI: 10.1145/2487228.2487237 

[18] P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, 
G. Ranzuglia, Meshlab: an open-source mesh processing tool, in 

Eurographics Italian chapter conference, Salerno, 2008, pp. 129–
136. 
DOI: 
10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/1
29-136 

[19] Google Inc., Rendering omni-directional stereo content. Online 
[Accessed 21 March 2022]  
https://developers.google.com/vr/jump/rendering-ods-
content.pdf 

[20] D. Girardeau-Montaut, CloudCompare, 2016. Online [Accessed 
21 March 2022]  
https://www.danielgm.net/cc 

 
https://doi.org/10.1109/TIP.2003.819861
https://doi.org/10.1145/2487228.2487237
http://dx.doi.org/10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/129-136
http://dx.doi.org/10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/129-136
https://developers.google.com/vr/jump/rendering-ods-content.pdf
https://developers.google.com/vr/jump/rendering-ods-content.pdf
https://www.danielgm.net/cc