ap-4-10.dvi Acta Polytechnica Vol. 50 No. 4/2010 Projective 3D-reconstruction of Uncalibrated Endoscopic Images P. Faltin, A. Behrens Abstract The most common medical diagnostic method for urinary bladder cancer is cystoscopy. This inspection of the bladder is performed by a rigid endoscope, which is usually guided close to the bladder wall. This causes a very limited field of view; difficulty of navigation is aggravated by the usage of angled endoscopes. These factors cause difficulties in orientation and visual control. To overcome this problem, the paper presents a method for extracting 3D information from uncalibrated endoscopic image sequences and for reconstructing the scene content. The method uses the SURF-algorithm to extract features from the images and relates the images by advanced matching. To stabilize the matching, the epipolar geometry is extracted for each image pair using a modified RANSAC-algorithm. Afterwards these matched point pairs are used to generate point triplets over three images and to describe the trifocal geometry. The 3D scene points are determined by applying triangulation to the matched image points. Thus, these points are used to generate a projective 3D reconstruction of the scene, and provide the first step for further metric reconstructions. Keywords: 3D reconstruction, uncalibrated camera, epipolar geometry, trifocal geometry, bladder, cystoscopy, endoscopy. 1 Introduction With about 68810 new cases in 2008 in the United States [1], bladder cancer is a common disease of the urinary system. Tumors are usually inspected and treated by endoscopic interventions. Urological inter- vention of the bladder and urethra is also called cys- toscopy. The cystoscope is inserted into the bladder through the urethra, which allows an inspection of the bladderwall. The inspection isusuallyperformedclose to the bladder wall, which is why the field of view is very limited. A possible way to improve the difficult orientation is e.g. by using an image mosaicking algo- rithm [2] to provide a panoramic overview image, or by generating a 3D model of the bladder. This paper presents amethod for extracting 3D information from an uncalibrated endoscopic image sequence, which is then used for a projective 3D bladder reconstruction. In further steps, this information can be used for auto calibration of the camera, which leads to the desired metric reconstruction. The paper is organized as follows: In section 2 the image preprocessing, the mathematical reconstruction and the reconstruction algorithms are described. In section 3 the evaluated image sequences and the re- sults are presented. Finally section 4 summarizes the results and gives prospects for future work. 2 Reconstruction 2.1 Imaging The image sequences are acquired by a rigid video en- doscope system, in this case an Olympus EVIS EX- ERA II platform. At the ocular of the cystoscope, a CCD camera is attached, which delivers the data to a workstation. To illuminate the organ, a light source is coupled into the cystoscope. To increase the field of view, endoscopes usually have a fish-eye optic. A typical setup is shown in fig. 1. The RealTimeFrame software framework [3] is used for real-time data pro- cessing of endoscopicdata. This software allowsa very rapid prototyping of algorithms. Fig. 1: A schematic view of a rigid cystoscope 2.2 Distortion correction Cystoscope optics produces a strongly radial distorted image, which has to be corrected to extract valid 3D information. To compensate this distortion, the method ofHartley andKang [4] is applied to each im- age in a preprocessing step. The radial distortion is modeled by the function �xd = �z + λ(r) · (�xu − �z) (1) with distorted point �xd, center of distortion �z, a func- tion depending on the radius λ(r) and corrected point �xu. λ(r) is not based on a fixed model function but is dynamically determined, resulting in very precise dis- tortion correction. An example using the implementa- tion from [8] is shown in fig. 2. 29 Acta Polytechnica Vol. 50 No. 4/2010 Fig. 2: Image distorted (left) by an endoscope, and image after correction (right) 2.3 Feature detection Feature detection is accomplished by the SURF- algorithm [5], which extracts and describes distinctive points in each image independent of its scale, position and rotation. To detect points of interest, a Hessian matrix containing the approximated second order par- tial derivatives of a Gaussian function from fig. 3 is used. The extracted features are describedby an anal- ysis of the surrounding area via Haar wavelets. The results are stored in the descriptor vectors of the fea- tures. A simple comparison of different features can bemade by summing the absolute summed differences of these vectors. Fig. 3: Box filters in Hessian matrix 2.4 2-View correspondences A simple method for generating correspondences over two images is brute matching. During this process, each feature f1 from the first image is compared with every feature f2 from the second image, and the f2 which minimizes the difference between the feature vectors is chosen. A correspondence is identified, if the difference is less than a given threshold. This process usually results in a high number of wrong correspon- dences. Thus, advanced matching is applied. In addition to forward matching, where the best f2 for each f1 is chosen, backwardmatching is applied by choosing the best f1 for each f2. A correspondence is identified only if these two matchings are equal. To improve the robustness, a slight restriction of the scale and orientation of the features by factor two, respec- tively 45◦, is also applied. This assumption is valid since the position of the endoscope does not change much between two sequential images in a real bladder inspection. The last check iswhether thedetectedbest correspondence is reliable by comparing its distance dbest with the distance d2ndbest from the second-best one via looking at their ratio dbest/d2ndbest > τ. 2.5 Epipolar geometry Epipolar geometry describes the setup of two cameras looking at the same scene fromdifferentpoints of view. While in this section only the basic fundamentals are described, more details can be found in [14]. An ex- emplary camera setup showing the camera centers �C and �C ′ is given in fig. 4. The 3D-point �X is pro- jected onto the image planes, resulting in points �x and �x ′. The points �e and �e ′ are called epipoles, and they represent the projected camera centers on the image planes. The position, orientation andproperties of the two cameras are described by the fundamental matrix F. It has seven degrees of freedom and rank 2. Only the intrinsic geometry is described, which is why the fundamental matrix is independent of the scene con- tent. A central equation for understanding epipolar geometry is the epipolar relation �x ′ T ·F · �x =0 (2) which connects an image point from one image plane with its correspondingpoint on the other image plane. Epipolar lines for each image point can be calcu- lated using the fundamental matrix. This line passes through the position of the corresponding point on the other image plane. All these lines intersect in the epipoles. They can be calculated by �l ′ =F · �x or �l =FT · �x ′ . (3) A geometric interpretation of eq. 3 is visualized in fig. 5. Fig. 4: Epipolar geometry Fig. 5: Epipolar line Thedifferent framesof the sequenceare interpreted as different views. This is correct if there is a camera movement between the frames and if the scene is sta- tionary. 30 Acta Polytechnica Vol. 50 No. 4/2010 The RANSAC-algorithm [6] is applied to estimate the fundamental matrix. This algorithm generates a random set of correspondences and calculates the fun- damental matrix. Using backprojection, each corre- sponding point is classified as an inlier or an outlier. To be classified as an inlier, the reprojection error of a point pair has to be smaller than a threshold. For an acceptable computation time, the Sampson- Approximation [9] is used to determine the error. This process is repeated iteratively for other random sam- ples. Finally, the inliers of the fundamental matrix, whichyields to the largestnumberof inliers, are chosen to calculate the final fundamental matrix, and all out- liers are eliminated. TheRANSAC-algorithmuses the 7-point algorithm to calculate the fundamental matri- ces. The final matrix is estimated using the 8-point algorithm [10], which can handle eight or more points andprovides a least squares solution. Both algorithms use a system of equations constructed with eq. 2. 2.6 2-View camera matrices To perform a reconstruction at least two camera ma- trices are required. If the first camera is chosen with P= [I|�0] the second camera matrix is defined by P′ = [[�e ′]× ·F+�e ′ · �v T |λ · �e ′] . (4) The fundamental matrix F and the epipole �e ′ are known, but the scalar λ and the vector �v are un- known. Correspondingly, thereare fourdegreesof free- dom to choose the second camera. Therefore a scene reconstruction based on eq. 4 is subjected to a projec- tive transformation compared to the original scene, as shown in [11]. Fig. 6 shows an example. Without any camera calibration or additional scene information, no metric reconstruction from two views is feasible. Fig. 6: Reconstruction with projective transformation 2.7 3-View correspondences It seems tobea straightforwardprocess to connect two matched imagepairswithonecommonimagetoan im- age triplet using the two view correspondences. But in practise the SURF-algorithm cannot detect identical features in the three images, because of view changes and noisy image data. Additionally, not all correspon- dences could be identified. This results in situations where not every feature could be retrieved in all im- ages. This is visualized in fig. 7. As can easily be seen, only a small number of correspondences share the same corresponding point in the image i +1 indi- cated by surrounding circles. To increase the num- ber of correspondences over three images, an addi- tional matching process from the first to the third view is performed. This step may induce new in- correct matches, which have to be considered. Thus, an advanced RANSAC-algorithm is used to join the set of tracked correspondences and the set of directly matched correspondences, and to detect outliers. The set of tracked correspondences contains a high amount of valid ones. To benefit from this fact, theRANSAC- algorithms fills the samples with a higher probability from the tracked set than from the directly matched set of correspondences. But even if the tracked cor- respondences are verified by epipolar matching, they should not be chosen definitely, because they could still be wrong, as fig. 8 shows. �x ′1 and �x ′ 2 are lo- cated on the epipolar line, which corresponds to the point �x1. This implies the epipolar relation is fulfilled, and a correspondence between �x1 and �x ′ 1 or between �x1 and �x ′ 2 appears to be correct in the second view. Only in the third view it is possible to identify the wrong correspondence �x1 and �x ′′ 2. Fig. 7: Three image correspondences Fig. 8: Wrong correspondence detected in three views 2.8 Trifocal geometry The mathematical description of trifocal geometry uses tensornotation. Agood introduction to this topic can be found in [7] and [14]. In this paper, tensors are written in calligraphic letters and the Einstein nota- tion is used. 31 Acta Polytechnica Vol. 50 No. 4/2010 Trifocal geometry describes the setup of three dif- ferent views for the same scene. Like epipolar geom- etry, trifocal geometry is also intrinsic and thus inde- pendent from the scene content. A sample configura- tion is shown in fig. 8 with the camera centers �C, �C ′ and �C ′′. The 3D point �X1 is projected to the image points �x1, �x ′ 1 and �x ′′ 1. The epipoles �e ′ and �e ′′ rep- resent the projected camera center of the first camera on the image planes of the second and third view. The first camera is defined by P = [I|�0], whereby the sec- ond camera is P′ = [A|�e ′] and the third camera is P′′ = [B|�e ′′]. The properties and the relation of these cameras are described by the trifocal tensor T . This is a 3×3×3 third-order tensorwith 18 degrees of free- dom. The reduction from 27 parameters to 18 degrees of freedom is caused by the internal constraint T jki = P ′i j P ′′k 4 − P ′j 4 P ′′k i (5) with the cameramatricesP′ andP′′ in tensornotation P′ and P′′. By analogywith the epipolar relation �x ′T ·F·�x =0 of two view geometry, trifocal geometry yields to X iX ′jX ′′kEjprEkqsT pqi =0rs . (6) The tensor E in eq. 6 – called theLevi-Cevita symbol – represents the constant third-order 3×3×3 tensor Erst = ⎧⎪⎪⎨ ⎪⎪⎩ 0 if r, s, t not distinctive +1 if (r, s, t) is an even permutation −1 if (r, s, t) is an odd permutation (7) with r, s, t ∈ {1,2,3}. The trifocal tensor can also be written in matrix notation using the three 3×3 matrices Ti jk = T jk i mit i, j, k ∈ {1,2,3} . (8) This notation can be used to extract the fundamental matrices between two different views from the trifocal tensor using the equations F21 = [�e ′]× · [T1,T2,T3] · �e ′′ (9) and F31 = [�e ′′]× · [TT1 ,T T 2 ,T T 3 ] · �e ′ . (10) Here the notation �a T · [M1,M2,M3] ·�b represents the vector (�a T ·M1 ·�b, �a T ·M2 ·�b, �a T ·M3 ·�b), and [�x]× denotes the skew-symmetricmatrix,which for a vector �a is given by [�a]× = ⎛ ⎜⎜⎝ 0 −a3 a2 a3 0 −a1 −a2 a1 0 ⎞ ⎟⎟⎠ . (11) The trifocal tensor is calculated by the normalized linear algorithm [14] including algebraicminimization. The basic idea is to solve a system of equations gen- erated from eq. 6 and to enforce the inner constrains given by eq. 5. 2.9 Triangulation After calculating the camera matrices, the 3D points canbe reconstructedusing triangulation. The concept is that the projection lines through the camera centers �C and �C ′ and the image points �x and �x ′ intersect in the 3Dpoint �X, as shown in fig. 5. To calculate �X the equation ⎛ ⎜⎜⎜⎜⎝ x̄ · �p 3T − �p 1T ȳ · �p 3T − �p 2T x̄′ · �p ′3T − �p ′1T ȳ′ · �p ′3T − �p ′2T ⎞ ⎟⎟⎟⎟⎠ · �X =�0 (12) is solved, where �p iT and �p ′iT are the i-th row vector of P and P′. Since subpixel positions can only be determined by interpolation and additional distortion is induced by the camera system, the detected image points are noisy. This results in the effect that two projection lines do not meet in space and instead form two skew lines. To overcome this problem, the image points �x and �x ′ are adjusted to meet the epipolar relation and are called �̄x and �̄x ′. Simultaneously, the sum of the Euclidean distance sum d(�x, �̄x)2 + d(�x ′, �̄x ′)2 is mini- mized. 3 Results The four different endoscopic video sequences from fig. 9 are used to analyze the different steps of the algorithm. a) Pisa b) Rome c) Train Station d) Wooden House Fig. 9: Test sequences 32 Acta Polytechnica Vol. 50 No. 4/2010 Fig. 10: Correspondences for two views Boxplots are used to compare the data statistically over all frames of the sequences. 25 % or 75 % of all measured values are below the values indicated by the boxborders. The line inside the box is themedian and the whiskers indicate the 5 % and 95 % percentile. Fig. 10 analyses the matching process for three views, as described in section 2.7. The number of correspondences is shown on the left side of the box- plots. On the right side, the related reprojection er- ror is shown. The first row shows the results from the two-view process using the advanced matching from section 2.5 compared to the three-view process. Analysing the “Wooden House” sequence it is observ- able that the number of tracked correspondences (1–3) of about 60 is significantly lower than the number rep- resenting thedirectly identified correspondences (1–2), which is of about 125. The advanced matching be- tween the first and third view (1–3) is only slightly inferior than the matching from the first view to the secondview(1–2). This canbe explainedbythehigher temporal distance between the images, which leads to higher variation of the image data. In the fourth row, the tracked and newly matched correspondences are joined using the advanced RANSAC-algorithm (1–3), which leads to the higher number of about 150 cor- respondences, compared to the matching among the first and second view. Finally only correspondences between all three views (1–2–3), called triplets, are selected, which reduces the total number to about 90. The mean error of about 0.3 pixels is constantly low. Only the tracked correspondences (1–3) show slightly higher error and variation, before applying the RANSAC-algorithm. This is caused by sporadic wrong correspondences, as described in section 2.7. Finally, the reprojection error from the estimated trifocal tensor is analyzed. For this step, two funda- mentalmatrices are calculated from the trifocal tensor by eq. 9 (1–2) and 10 (1–3). Fig. 11 shows for all se- quences nearly the same subpixel error of about 0.3 pixels (1–2) or ∼ 0.6 pixels (1–3), respectively. Com- pared to epipolar geometry, the temporal distance has a stronger impact. Since no RANSAC-algorithm has yetbeenapplied for estimating the trifocal tensor, out- liers have a direct impact on the error value. Fig. 11: Trifocal error An exemplary reconstruction of the “Pisa” se- quence is shown in fig. 12. In the left image all de- tected features are shown on the image plane, and in the right image a 3D reconstruction from these points is shown. Corresponding points have the same color. The reconstruction is compressed in the x-direction, caused by the projective transformation described in section 2.6. Fig. 12: Image from sequence “Pisa” and related recon- struction 4 Summary and prospects This paper presents a method for reconstructing 3D scene content from uncalibrated endoscopic sequences based on SURF-features. The different steps yield robust results by using the RANSAC-algorithm in 33 Acta Polytechnica Vol. 50 No. 4/2010 adapted forms. It has been shown that an epipolar geometry and a trifocal geometry can be extracted with high precision, whereby subpixel-precise recon- struction is possible. An important application for trifocal geometry is for extracting consistent camera matrices for the whole sequence by a linear method, like in [12] or [13]. Subsequently these cameras can be used for auto calibration [14], which allows metric reconstruction. Acknowledgement We would like to thank Prof. Dr.-Ing. Til Aach, Insti- tute of Imaging andComputerVision, RWTHAachen University, Germany for supervising this project. Additionally we would like to thank Olympus & Winter IbeGmbHforprovidingavideo endoscope sys- tem. References [1] American Cancer Society: Cancer Facts & Fig- ures 2008. American Cancer Society, 2008. [2] Behrens, A., Bommes, M., Stehle, T., Gross, S., Leonhardt, S., Aach, T.: A Multi-Threaded Mo- saicking Algorithm for Fast Image Composition of FluorescenceBladder Images.Medical Imaging 2010: Visualization, Image-Guided Procedures, and Modeling, Vol. 7625, San Diego, CA, USA, 2010. [3] Gross,S., Behrens,A., Stehle,T.: RapidDevelop- ment of Video Processing Algorithms with Real- TimeFrame. Conference Book Biomedica, Liege, Belgium, 2009, pp. 217–220. [4] Hartley, R., Kang, S. B.: Parameter-Free Ra- dial Distortion Correction with Center of Distor- tion Estimation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2007,Vol.29, No. 8, pp. 1309–1321. [5] Bay, H., Ess, A., Tuytelaars, T., Gool, L. V.: Speeded-Up Robust Features (SURF). Computer Vision and ImageUnderstanding, 2008,Vol.110, No. 3, pp. 346–359. [6] Fischler, M. A., Bolles, R. C.: Random sample consensus: a paradigm for model fitting with ap- plications to image analysis and automated car- tography. Communications of the ACM, 1981, Vol. 24, No. 6, pp. 381–395. [7] Triggs, W.: The geometry of projective recon- struction I:Matchingconstraintsandthe joint im- age.Proc. International Conference onComputer Vision, Boston, MA, USA, 1995, pp. 338–343. [8] Stehle, T., Truhn, D., Aach, T., Trautwein, C., Tischendorf, J.: CameraCalibration for Fish-Eye Lenses in Endoscopy with an Application to 3D Reconstruction. Proceedings IEEE International Symposium on Biomedical Imaging, Washington, D.C., USA, 2007. [9] Sampson, P. D.: Fitting Conic Sections to ‘Very Scattered’ Data: An Iterative Refinement of the Bookstein Algorithm. Computer Graphics and Image Processing. Fitting conic sections to scat- tered data, 1982, Vol. 18, No. 1, pp. 97–108. [10] Hartley, R.: In Defense of the Eight-Point Al- gorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, Vol. 19, No. 6, pp. 580–593. [11] Pollefeys, M.: Self-calibration and metric 3D re- construction from uncalibrated image sequences. PhD Thesis, Leuven, Belgium, 1999. [12] Triggs, B.: Linear Projective Reconstruction from Matching Tensors. Image and Vision Com- puting, 1997, Vol. 15, Issue 8, pp. 617–625. [13] Avidan, S., Shashua,A.: ThreadingFundamental Matrices. IEEE Transactions on Pattern Analy- sis andMachine Intelligence, 2001,Vol.23,No. 1, pp. 73–77. [14] Hartley, R., Zisserman, A.: Multiple View Geom- etry inComputerVision. 2nd ed.CambridgeUni- versity Press, 2003. About the authors Peter FALTIN was born in 1983 in Cologne, Ger- many. He studied Computer Engineering at RWTH AachenUniversity and receivedhis Dipl.-Ing. in 2010. Since then he has held a PhDposition at the Institute of Imaging&ComputerVision atRWTHAachenUni- versity. His research focuses onmedical imageprocess- ing, video processing, signal processing and computer vision. Alexander BEHRENS was born in Bückeburg, Germany in 1980. He received a Dipl.-Ing. degree in electrical engineering from the Leibniz University of Hannover, Hannover, Germany, in 2006. After receiv- ing the degree, he worked as a Research Scientist at the university’s Institut für Informationsverarbeitung. Since 2007, he hasbeen aPh.D. candidate at the Insti- tute of Imaging andComputerVision, RWTHAachen University, Aachen, Germany. His research interests are inmedical imageprocessing, signalprocessing,pat- tern recognition, and computer vision. Peter Faltin Alexander Behrens E-mail: Peter.Faltin@lfb.rwth-aachen.de Alexander.Behrens@lfb.rwth-aachen.de Institute of Imaging & Computer Vision RWTH Aachen University D-52056 Aachen, Germany 34