ap-4-10.dvi


Acta Polytechnica Vol. 50 No. 4/2010

Projective 3D-reconstruction of Uncalibrated Endoscopic Images

P. Faltin, A. Behrens

Abstract

The most common medical diagnostic method for urinary bladder cancer is cystoscopy. This inspection of the bladder is
performed by a rigid endoscope, which is usually guided close to the bladder wall. This causes a very limited field of view;
difficulty of navigation is aggravated by the usage of angled endoscopes. These factors cause difficulties in orientation and
visual control. To overcome this problem, the paper presents a method for extracting 3D information from uncalibrated
endoscopic image sequences and for reconstructing the scene content. The method uses the SURF-algorithm to extract
features from the images and relates the images by advanced matching. To stabilize the matching, the epipolar geometry
is extracted for each image pair using a modified RANSAC-algorithm. Afterwards these matched point pairs are used to
generate point triplets over three images and to describe the trifocal geometry. The 3D scene points are determined by
applying triangulation to the matched image points. Thus, these points are used to generate a projective 3D reconstruction
of the scene, and provide the first step for further metric reconstructions.

Keywords: 3D reconstruction, uncalibrated camera, epipolar geometry, trifocal geometry, bladder, cystoscopy, endoscopy.

1 Introduction
With about 68810 new cases in 2008 in the United
States [1], bladder cancer is a common disease of the
urinary system. Tumors are usually inspected and
treated by endoscopic interventions. Urological inter-
vention of the bladder and urethra is also called cys-
toscopy. The cystoscope is inserted into the bladder
through the urethra, which allows an inspection of the
bladderwall. The inspection isusuallyperformedclose
to the bladder wall, which is why the field of view is
very limited. A possible way to improve the difficult
orientation is e.g. by using an image mosaicking algo-
rithm [2] to provide a panoramic overview image, or
by generating a 3D model of the bladder. This paper
presents amethod for extracting 3D information from
an uncalibrated endoscopic image sequence, which is
then used for a projective 3D bladder reconstruction.
In further steps, this information can be used for auto
calibration of the camera, which leads to the desired
metric reconstruction.
The paper is organized as follows: In section 2 the

image preprocessing, the mathematical reconstruction
and the reconstruction algorithms are described. In
section 3 the evaluated image sequences and the re-
sults are presented. Finally section 4 summarizes the
results and gives prospects for future work.

2 Reconstruction

2.1 Imaging

The image sequences are acquired by a rigid video en-
doscope system, in this case an Olympus EVIS EX-

ERA II platform. At the ocular of the cystoscope, a
CCD camera is attached, which delivers the data to a
workstation. To illuminate the organ, a light source
is coupled into the cystoscope. To increase the field
of view, endoscopes usually have a fish-eye optic. A
typical setup is shown in fig. 1. The RealTimeFrame
software framework [3] is used for real-time data pro-
cessing of endoscopicdata. This software allowsa very
rapid prototyping of algorithms.

Fig. 1: A schematic view of a rigid cystoscope

2.2 Distortion correction

Cystoscope optics produces a strongly radial distorted
image, which has to be corrected to extract valid
3D information. To compensate this distortion, the
method ofHartley andKang [4] is applied to each im-
age in a preprocessing step. The radial distortion is
modeled by the function

�xd = �z + λ(r) · (�xu − �z) (1)

with distorted point �xd, center of distortion �z, a func-
tion depending on the radius λ(r) and corrected point
�xu. λ(r) is not based on a fixed model function but is
dynamically determined, resulting in very precise dis-
tortion correction. An example using the implementa-
tion from [8] is shown in fig. 2.

29


Acta Polytechnica Vol. 50 No. 4/2010

Fig. 2: Image distorted (left) by an endoscope, and image
after correction (right)

2.3 Feature detection

Feature detection is accomplished by the SURF-
algorithm [5], which extracts and describes distinctive
points in each image independent of its scale, position
and rotation. To detect points of interest, a Hessian
matrix containing the approximated second order par-
tial derivatives of a Gaussian function from fig. 3 is
used. The extracted features are describedby an anal-
ysis of the surrounding area via Haar wavelets. The
results are stored in the descriptor vectors of the fea-
tures. A simple comparison of different features can
bemade by summing the absolute summed differences
of these vectors.

Fig. 3: Box filters in Hessian matrix

2.4 2-View correspondences

A simple method for generating correspondences over
two images is brute matching. During this process,
each feature f1 from the first image is compared with
every feature f2 from the second image, and the f2
which minimizes the difference between the feature
vectors is chosen. A correspondence is identified, if the
difference is less than a given threshold. This process
usually results in a high number of wrong correspon-
dences. Thus, advanced matching is applied.
In addition to forward matching, where the best

f2 for each f1 is chosen, backwardmatching is applied
by choosing the best f1 for each f2. A correspondence
is identified only if these two matchings are equal. To
improve the robustness, a slight restriction of the scale
and orientation of the features by factor two, respec-
tively 45◦, is also applied. This assumption is valid
since the position of the endoscope does not change
much between two sequential images in a real bladder
inspection. The last check iswhether thedetectedbest
correspondence is reliable by comparing its distance
dbest with the distance d2ndbest from the second-best
one via looking at their ratio dbest/d2ndbest > τ.

2.5 Epipolar geometry

Epipolar geometry describes the setup of two cameras
looking at the same scene fromdifferentpoints of view.
While in this section only the basic fundamentals are
described, more details can be found in [14]. An ex-
emplary camera setup showing the camera centers �C
and �C ′ is given in fig. 4. The 3D-point �X is pro-
jected onto the image planes, resulting in points �x and
�x ′. The points �e and �e ′ are called epipoles, and they
represent the projected camera centers on the image
planes. The position, orientation andproperties of the
two cameras are described by the fundamental matrix
F. It has seven degrees of freedom and rank 2. Only
the intrinsic geometry is described, which is why the
fundamental matrix is independent of the scene con-
tent. A central equation for understanding epipolar
geometry is the epipolar relation

�x ′ T ·F · �x =0 (2)

which connects an image point from one image plane
with its correspondingpoint on the other image plane.
Epipolar lines for each image point can be calcu-
lated using the fundamental matrix. This line passes
through the position of the corresponding point on
the other image plane. All these lines intersect in the
epipoles. They can be calculated by

�l ′ =F · �x or �l =FT · �x ′ . (3)

A geometric interpretation of eq. 3 is visualized in
fig. 5.

Fig. 4: Epipolar geometry

Fig. 5: Epipolar line

Thedifferent framesof the sequenceare interpreted
as different views. This is correct if there is a camera
movement between the frames and if the scene is sta-
tionary.

30


Acta Polytechnica Vol. 50 No. 4/2010

The RANSAC-algorithm [6] is applied to estimate
the fundamental matrix. This algorithm generates a
random set of correspondences and calculates the fun-
damental matrix. Using backprojection, each corre-
sponding point is classified as an inlier or an outlier.
To be classified as an inlier, the reprojection error
of a point pair has to be smaller than a threshold.
For an acceptable computation time, the Sampson-
Approximation [9] is used to determine the error. This
process is repeated iteratively for other random sam-
ples. Finally, the inliers of the fundamental matrix,
whichyields to the largestnumberof inliers, are chosen
to calculate the final fundamental matrix, and all out-
liers are eliminated. TheRANSAC-algorithmuses the
7-point algorithm to calculate the fundamental matri-
ces. The final matrix is estimated using the 8-point
algorithm [10], which can handle eight or more points
andprovides a least squares solution. Both algorithms
use a system of equations constructed with eq. 2.

2.6 2-View camera matrices

To perform a reconstruction at least two camera ma-
trices are required. If the first camera is chosen with
P= [I|�0] the second camera matrix is defined by

P′ = [[�e ′]× ·F+�e ′ · �v T |λ · �e ′] . (4)
The fundamental matrix F and the epipole �e ′ are
known, but the scalar λ and the vector �v are un-
known. Correspondingly, thereare fourdegreesof free-
dom to choose the second camera. Therefore a scene
reconstruction based on eq. 4 is subjected to a projec-
tive transformation compared to the original scene, as
shown in [11]. Fig. 6 shows an example. Without any
camera calibration or additional scene information, no
metric reconstruction from two views is feasible.

Fig. 6: Reconstruction with projective transformation

2.7 3-View correspondences

It seems tobea straightforwardprocess to connect two
matched imagepairswithonecommonimagetoan im-
age triplet using the two view correspondences. But in
practise the SURF-algorithm cannot detect identical

features in the three images, because of view changes
and noisy image data. Additionally, not all correspon-
dences could be identified. This results in situations
where not every feature could be retrieved in all im-
ages. This is visualized in fig. 7. As can easily be
seen, only a small number of correspondences share
the same corresponding point in the image i +1 indi-
cated by surrounding circles. To increase the num-
ber of correspondences over three images, an addi-
tional matching process from the first to the third
view is performed. This step may induce new in-
correct matches, which have to be considered. Thus,
an advanced RANSAC-algorithm is used to join the
set of tracked correspondences and the set of directly
matched correspondences, and to detect outliers. The
set of tracked correspondences contains a high amount
of valid ones. To benefit from this fact, theRANSAC-
algorithms fills the samples with a higher probability
from the tracked set than from the directly matched
set of correspondences. But even if the tracked cor-
respondences are verified by epipolar matching, they
should not be chosen definitely, because they could
still be wrong, as fig. 8 shows. �x ′1 and �x

′
2 are lo-

cated on the epipolar line, which corresponds to the
point �x1. This implies the epipolar relation is fulfilled,
and a correspondence between �x1 and �x

′
1 or between

�x1 and �x
′
2 appears to be correct in the second view.

Only in the third view it is possible to identify the
wrong correspondence �x1 and �x

′′
2.

Fig. 7: Three image correspondences

Fig. 8: Wrong correspondence detected in three views

2.8 Trifocal geometry

The mathematical description of trifocal geometry
uses tensornotation. Agood introduction to this topic
can be found in [7] and [14]. In this paper, tensors are
written in calligraphic letters and the Einstein nota-
tion is used.

31


Acta Polytechnica Vol. 50 No. 4/2010

Trifocal geometry describes the setup of three dif-
ferent views for the same scene. Like epipolar geom-
etry, trifocal geometry is also intrinsic and thus inde-
pendent from the scene content. A sample configura-
tion is shown in fig. 8 with the camera centers �C, �C ′

and �C ′′. The 3D point �X1 is projected to the image
points �x1, �x

′
1 and �x

′′
1. The epipoles �e

′ and �e ′′ rep-
resent the projected camera center of the first camera
on the image planes of the second and third view. The
first camera is defined by P = [I|�0], whereby the sec-
ond camera is P′ = [A|�e ′] and the third camera is
P′′ = [B|�e ′′]. The properties and the relation of these
cameras are described by the trifocal tensor T . This
is a 3×3×3 third-order tensorwith 18 degrees of free-
dom. The reduction from 27 parameters to 18 degrees
of freedom is caused by the internal constraint

T jki = P
′i
j P

′′k
4 − P

′j
4 P

′′k
i (5)

with the cameramatricesP′ andP′′ in tensornotation
P′ and P′′.
By analogywith the epipolar relation �x ′T ·F·�x =0

of two view geometry, trifocal geometry yields to

X iX ′jX ′′kEjprEkqsT pqi =0rs . (6)

The tensor E in eq. 6 – called theLevi-Cevita symbol –
represents the constant third-order 3×3×3 tensor

Erst =

⎧⎪⎪⎨
⎪⎪⎩
0 if r, s, t not distinctive

+1 if (r, s, t) is an even permutation

−1 if (r, s, t) is an odd permutation
(7)

with r, s, t ∈ {1,2,3}.
The trifocal tensor can also be written in matrix

notation using the three 3×3 matrices

Ti jk = T
jk

i mit i, j, k ∈ {1,2,3} . (8)

This notation can be used to extract the fundamental
matrices between two different views from the trifocal
tensor using the equations

F21 = [�e
′]× · [T1,T2,T3] · �e ′′ (9)

and
F31 = [�e

′′]× · [TT1 ,T
T
2 ,T

T
3 ] · �e

′ . (10)

Here the notation �a T · [M1,M2,M3] ·�b represents the
vector (�a T ·M1 ·�b, �a T ·M2 ·�b, �a T ·M3 ·�b), and [�x]×
denotes the skew-symmetricmatrix,which for a vector
�a is given by

[�a]× =

⎛
⎜⎜⎝

0 −a3 a2
a3 0 −a1
−a2 a1 0

⎞
⎟⎟⎠ . (11)

The trifocal tensor is calculated by the normalized
linear algorithm [14] including algebraicminimization.

The basic idea is to solve a system of equations gen-
erated from eq. 6 and to enforce the inner constrains
given by eq. 5.

2.9 Triangulation

After calculating the camera matrices, the 3D points
canbe reconstructedusing triangulation. The concept
is that the projection lines through the camera centers
�C and �C ′ and the image points �x and �x ′ intersect in
the 3Dpoint �X, as shown in fig. 5. To calculate �X the
equation

⎛
⎜⎜⎜⎜⎝

x̄ · �p 3T − �p 1T

ȳ · �p 3T − �p 2T

x̄′ · �p ′3T − �p ′1T

ȳ′ · �p ′3T − �p ′2T

⎞
⎟⎟⎟⎟⎠ · �X =�0 (12)

is solved, where �p iT and �p ′iT are the i-th row vector
of P and P′.
Since subpixel positions can only be determined

by interpolation and additional distortion is induced
by the camera system, the detected image points are
noisy. This results in the effect that two projection
lines do not meet in space and instead form two skew
lines. To overcome this problem, the image points �x
and �x ′ are adjusted to meet the epipolar relation and
are called �̄x and �̄x ′. Simultaneously, the sum of the
Euclidean distance sum d(�x, �̄x)2 + d(�x ′, �̄x ′)2 is mini-
mized.

3 Results
The four different endoscopic video sequences from
fig. 9 are used to analyze the different steps of the
algorithm.

a) Pisa b) Rome

c) Train Station d) Wooden House

Fig. 9: Test sequences

32


Acta Polytechnica Vol. 50 No. 4/2010

Fig. 10: Correspondences for two views

Boxplots are used to compare the data statistically
over all frames of the sequences. 25 % or 75 % of all
measured values are below the values indicated by the
boxborders. The line inside the box is themedian and
the whiskers indicate the 5 % and 95 % percentile.
Fig. 10 analyses the matching process for three

views, as described in section 2.7. The number of
correspondences is shown on the left side of the box-
plots. On the right side, the related reprojection er-
ror is shown. The first row shows the results from
the two-view process using the advanced matching
from section 2.5 compared to the three-view process.
Analysing the “Wooden House” sequence it is observ-
able that the number of tracked correspondences (1–3)
of about 60 is significantly lower than the number rep-
resenting thedirectly identified correspondences (1–2),
which is of about 125. The advanced matching be-
tween the first and third view (1–3) is only slightly
inferior than the matching from the first view to the
secondview(1–2). This canbe explainedbythehigher
temporal distance between the images, which leads to
higher variation of the image data. In the fourth row,
the tracked and newly matched correspondences are
joined using the advanced RANSAC-algorithm (1–3),
which leads to the higher number of about 150 cor-
respondences, compared to the matching among the

first and second view. Finally only correspondences
between all three views (1–2–3), called triplets, are
selected, which reduces the total number to about
90. The mean error of about 0.3 pixels is constantly
low. Only the tracked correspondences (1–3) show
slightly higher error and variation, before applying
the RANSAC-algorithm. This is caused by sporadic
wrong correspondences, as described in section 2.7.
Finally, the reprojection error from the estimated

trifocal tensor is analyzed. For this step, two funda-
mentalmatrices are calculated from the trifocal tensor
by eq. 9 (1–2) and 10 (1–3). Fig. 11 shows for all se-
quences nearly the same subpixel error of about 0.3
pixels (1–2) or ∼ 0.6 pixels (1–3), respectively. Com-
pared to epipolar geometry, the temporal distance has
a stronger impact. Since no RANSAC-algorithm has
yetbeenapplied for estimating the trifocal tensor, out-
liers have a direct impact on the error value.

Fig. 11: Trifocal error

An exemplary reconstruction of the “Pisa” se-
quence is shown in fig. 12. In the left image all de-
tected features are shown on the image plane, and in
the right image a 3D reconstruction from these points
is shown. Corresponding points have the same color.
The reconstruction is compressed in the x-direction,
caused by the projective transformation described in
section 2.6.

Fig. 12: Image from sequence “Pisa” and related recon-
struction

4 Summary and prospects

This paper presents a method for reconstructing 3D
scene content from uncalibrated endoscopic sequences
based on SURF-features. The different steps yield
robust results by using the RANSAC-algorithm in

33


Acta Polytechnica Vol. 50 No. 4/2010

adapted forms. It has been shown that an epipolar
geometry and a trifocal geometry can be extracted
with high precision, whereby subpixel-precise recon-
struction is possible. An important application for
trifocal geometry is for extracting consistent camera
matrices for the whole sequence by a linear method,
like in [12] or [13]. Subsequently these cameras can
be used for auto calibration [14], which allows metric
reconstruction.

Acknowledgement

We would like to thank Prof. Dr.-Ing. Til Aach, Insti-
tute of Imaging andComputerVision, RWTHAachen
University, Germany for supervising this project.
Additionally we would like to thank Olympus &

Winter IbeGmbHforprovidingavideo endoscope sys-
tem.

References

[1] American Cancer Society: Cancer Facts & Fig-
ures 2008. American Cancer Society, 2008.

[2] Behrens, A., Bommes, M., Stehle, T., Gross, S.,
Leonhardt, S., Aach, T.: A Multi-Threaded Mo-
saicking Algorithm for Fast Image Composition
of FluorescenceBladder Images.Medical Imaging
2010: Visualization, Image-Guided Procedures,
and Modeling, Vol. 7625, San Diego, CA, USA,
2010.

[3] Gross,S., Behrens,A., Stehle,T.: RapidDevelop-
ment of Video Processing Algorithms with Real-
TimeFrame. Conference Book Biomedica, Liege,
Belgium, 2009, pp. 217–220.

[4] Hartley, R., Kang, S. B.: Parameter-Free Ra-
dial Distortion Correction with Center of Distor-
tion Estimation. IEEE Transactions on Pattern
Analysis andMachine Intelligence, 2007,Vol.29,
No. 8, pp. 1309–1321.

[5] Bay, H., Ess, A., Tuytelaars, T., Gool, L. V.:
Speeded-Up Robust Features (SURF). Computer
Vision and ImageUnderstanding, 2008,Vol.110,
No. 3, pp. 346–359.

[6] Fischler, M. A., Bolles, R. C.: Random sample
consensus: a paradigm for model fitting with ap-
plications to image analysis and automated car-
tography. Communications of the ACM, 1981,
Vol. 24, No. 6, pp. 381–395.

[7] Triggs, W.: The geometry of projective recon-
struction I:Matchingconstraintsandthe joint im-
age.Proc. International Conference onComputer
Vision, Boston, MA, USA, 1995, pp. 338–343.

[8] Stehle, T., Truhn, D., Aach, T., Trautwein, C.,
Tischendorf, J.: CameraCalibration for Fish-Eye
Lenses in Endoscopy with an Application to 3D

Reconstruction. Proceedings IEEE International
Symposium on Biomedical Imaging, Washington,
D.C., USA, 2007.

[9] Sampson, P. D.: Fitting Conic Sections to ‘Very
Scattered’ Data: An Iterative Refinement of the
Bookstein Algorithm. Computer Graphics and
Image Processing. Fitting conic sections to scat-
tered data, 1982, Vol. 18, No. 1, pp. 97–108.

[10] Hartley, R.: In Defense of the Eight-Point Al-
gorithm. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 1997, Vol. 19, No. 6,
pp. 580–593.

[11] Pollefeys, M.: Self-calibration and metric 3D re-
construction from uncalibrated image sequences.
PhD Thesis, Leuven, Belgium, 1999.

[12] Triggs, B.: Linear Projective Reconstruction
from Matching Tensors. Image and Vision Com-
puting, 1997, Vol. 15, Issue 8, pp. 617–625.

[13] Avidan, S., Shashua,A.: ThreadingFundamental
Matrices. IEEE Transactions on Pattern Analy-
sis andMachine Intelligence, 2001,Vol.23,No. 1,
pp. 73–77.

[14] Hartley, R., Zisserman, A.: Multiple View Geom-
etry inComputerVision. 2nd ed.CambridgeUni-
versity Press, 2003.

About the authors

Peter FALTIN was born in 1983 in Cologne, Ger-
many. He studied Computer Engineering at RWTH
AachenUniversity and receivedhis Dipl.-Ing. in 2010.
Since then he has held a PhDposition at the Institute
of Imaging&ComputerVision atRWTHAachenUni-
versity. His research focuses onmedical imageprocess-
ing, video processing, signal processing and computer
vision.

Alexander BEHRENS was born in Bückeburg,
Germany in 1980. He received a Dipl.-Ing. degree in
electrical engineering from the Leibniz University of
Hannover, Hannover, Germany, in 2006. After receiv-
ing the degree, he worked as a Research Scientist at
the university’s Institut für Informationsverarbeitung.
Since 2007, he hasbeen aPh.D. candidate at the Insti-
tute of Imaging andComputerVision, RWTHAachen
University, Aachen, Germany. His research interests
are inmedical imageprocessing, signalprocessing,pat-
tern recognition, and computer vision.

Peter Faltin
Alexander Behrens
E-mail: Peter.Faltin@lfb.rwth-aachen.de
Alexander.Behrens@lfb.rwth-aachen.de
Institute of Imaging & Computer Vision
RWTH Aachen University
D-52056 Aachen, Germany

34