FACTA UNIVERSITATIS 
Series: Electronics and Energetics Vol. 33, N

o
 3, September 2020, pp. 379-394 

https://doi.org/10.2298/FUEE2003379G 

© 2020 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND 

COMPARATIVE EVALUATION OF KEYPOINT DETECTORS 

FOR 3D DIGITAL AVATAR RECONSTRUCTION

 

Dušan Gajić, Gorana Gojić, Dinu Dragan, Veljko Petrović 

University of Novi Sad, Faculty of Technical Sciences, Novi Sad, Serbia 

Abstract. Three-dimensional personalized human avatars have been successfully utilized in 

shopping, entertainment, education, and health applications. However, it is still a 

challenging task to obtain both a complete and highly detailed avatar automatically. One 

approach is to use general-purpose, photogrammetry-based algorithms on a series of 

overlapping images of the person. We argue that the quality of avatar reconstruction can be 

increased by modifying parts of the photogrammetry-based algorithm pipeline to be more 

specifically tailored to the human body shape. In this context, we perform an extensive, 

standalone evaluation of eleven algorithms for keypoint detection, which is the first phase of 

the photogrammetry-based reconstruction pipeline. We include well established, patented 

Distinctive image features from scale-invariant keypoints (SIFT) and Speeded up robust 

features (SURF) detection algorithms as a baseline since they are widely incorporated into 

photogrammetry-based software. All experiments are conducted on a dataset of 378 images 

of human body captured in a controlled, multi-view stereo setup. Our findings are that 

binary detectors highly outperform commonly used SIFT-like detectors in the avatar 

reconstruction task, both in terms of detection speed and in number of detected keypoints. 

Key words: Detector, Photogrammetry-based reconstruction, 3D human avatar, 

Structure from Motion, Multi-view Stereo 

1. INTRODUCTION 

An avatar is a digital self-representation of a participant in a computer generated 

virtual world [1] and can be represented both in two (2D) or three dimensions (3D). The 

significance of 3D avatars is constantly growing due to the expansion of virtual worlds in 

which participants identify themselves through their avatars. Recently, avatars have been 

successfully involved in many applications, including entertainment [2], shopping [3], 

education [4], health [5], and military [6]. 

For some applications, the avatar must be a 3D, highly personalized representation of 

a person, e.g., avatars used for meeting events or virtual try-on applications [3], [7]. Since 

it is a labor-intensive task to produce high-quality 3D avatars manually, many techniques 

for automatic generation have been proposed. One of them is digital photogrammetry 

                                                           
Received October 13, 2019; received in revised form January 12, 2020 

Corresponding author: Dušan Gajiš 

University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradoviša 6, 21102 Novi Sad, Serbia  

E-mail: dulegajic@gmail.com 

  

380 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

which is the subject of the research described in this paper. To obtain a 3D avatar through 

digital photogrammetry software, a series of overlapping images showing a person from 

different viewpoints are first acquired. A typical photogrammetry-based pipeline consists of 

three phases [8]: Structure from Motion (SFM), Multi-view Stereo (MVS), and mesh 

creation. As an input, the SFM phase receives a series of overlapping 2D images and 

outputs a 3D sparse point cloud. This phase relies on a triangulation process to recover 3D 

points from multiple 2D projections of the same 3D point present on two or more images. 

To identify 2D points on images that represent the same 3D point, an algorithm for point 

detection, in literature also known as a detector, is applied to all input images. This helps 

locate keypoints—patches of the image which represent the 3D points that will make up the 

sparse point cloud. Depending on type, detectors find keypoints corresponding to structures 

known as edges, blobs, or corners. Detected keypoints are matched with each other to find 

tracks of keypoints that represent the same 3D point using a description-generating 

algorithm. For more information about SFM, MVS, and mesh creation phases of the 

photogrammetry-based pipeline, we refer the reader to [8]. 

Recently, large number of detection algorithms have been proposed. Although minor 

discrepancies in the research on the evaluation of detection algorithms exist, scale-invariant 

feature transform (SIFT) based algorithms are still considered to be state-of-the-art 

algorithms for general-purpose use. However, it has been shown that even cutting-edge 

commercial software solutions that use SIFT or SIFT-like algorithms in phases of a 

keypoint detection, such as AgiSoft PhotoScan [9], have difficulties when reconstructing 

human avatars. Those difficulties are often caused by an insufficient number of detected 

keypoints on particular problem areas (e.g., backs), and ultimately result in an incomplete 

avatar model. According to [10], the optimal choice of detector might depend on properties 

of the input data. This means that SIFT and SURF might not perform best in the specific 

case of 3D avatar reconstruction. Additionally, the price of software used for avatar 

reconstruction could be reduced if a patent-free detector algorithm were to be used. 

Recently, patent-free detectors have been implemented in some of the leading open-source 

photogrammetry-based solutions, such as Meshroom [11] and OpenMVG [12]. Many of the 

detectors tested in our study have been proposed considerably after SIFT and SURF 

algorithms, thus it is expected that more of them will be implemented in photogrammetry 

software in the future to compensate for shortcomings of SIFT and SURF. All detectors 

included in this study, including those already incorporated into photogrammetry software, 

can be used for human avatar reconstruction. Still, it remains a question if detectors not yet 

implemented in available photogrammetry software could yield a comparable or better 

result to those already implemented. From this viewpoint, our study can be seen as a first 

step to guide the implementation of human-based photogrammetry software. 

To this end, we have conducted an extensive, standalone detector evaluation study on a 

human-based image dataset captured in controlled conditions. The results of such a study 

can lead to less expensive, more widely-available photogrammetry software, if it shows that 

free-to-use detection algorithms can replace SIFT without sacrificing quality. We evaluate 

eleven detectors, both binary and floating-point in terms of the number of keypoints 

detected, detection speed and detector efficiency in finding keypoints in the region of an 

image representing a person. Our overall findings are that binary detectors highly 

outperform floating-point detectors tested in this study, including SIFT and SURF 

detectors, for the task of 3D human body reconstruction.  


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 381 

The rest of the paper is organized as follows. In Section 2, we present a brief overview 

of work in the field of detector evaluation. Section 3 discusses in detail the experimental 

framework for evaluation, as well as the dataset used in the experiments. We present and 

discuss the obtained results in Section 4. The final section offers the main conclusions, as 

well as possible directions for future work. 

2. RELATED WORK 

In this section, we give a brief overview of the work related to human body reconstruction. 

We start by presenting associated studies in the detector evaluation field and report established 

state-of-the-art results. Next, we discuss different techniques for 3D reconstruction of a 

clothed human body representing an avatar, with a particular interest in model’s level of 

detail. 

2.1. Detector Evaluation 

Detector evaluation has been a widely addressed topic in a computer vision. Extensive 

standalone detector evaluation for the use case similar to ours have been proposed in 

[13]. Ten well-established detectors at the time the paper was written were evaluated on a 

dataset captured in a multi-view stereo setup showing complex, non-planar scenes such 

as buildings, fruits, etc. Authors evaluate detectors through three metrics: recall rate 

introduced in [14], keypoint location, and the average number of detected keypoints. To 

calculate the first two metrics, they use a ground-truth data in the form of known camera 

positions and a 3D dense point cloud of a scene captured by a laser scanner. Since we do 

not have a precomputed 3D model of a person that can be used for recall and location 

calculation, we adopt the average number of keypoints as a metric in our work.  Results 

of the experiments conducted in [13] show that FAST (Features from accelerated 

segment test) detector showed unreliable performance despite the large average number 

of detected keypoints. Although not so extensive, one of the most influencing works in 

the field of detector evaluation is the early work of Mikolajczyk and Schmid [15]. To 

evaluate the performance of the detectors under an extensive set of image 

transformations, the authors used ground-truth homographies between image pairs to 

match detected keypoints. This solution for keypoint verification is commonly used in 

experiments performed on images showing planar scenes, which is not valid in our case, 

since the scenes used for detector evaluation are non-planar.  

Along with standalone detector evaluation, recent studies provide detector evaluation 

jointly with description algorithms through feature matching task [10], [16]–[18]. Joint 

detector-descriptor evaluation has been appealing due to the nature of the keypoint 

matching problem. Keypoint matching between two images is a two-step problem: (1) all 

keypoints are detected on both images and (2) described by a descriptor algorithm of the 

choice. Then, keypoint pairs are tracked by similarity in terms of descriptor’s output. 

However, introducing descriptors into detector evaluation adds more complexity to the 

evaluation task, since the final performance cannot be assigned solely to detection or 

description algorithm, but rather to the combination of these two. In [16] SIFT, SURF, 

MSER (Maximally stable color regions), FAST and ORB (Oriented FAST and rotated 

BRIEF) detectors are evaluated in terms of fast matching on a dataset with different 

geometric and photogrammetric transformations including rotation, scale change, viewpoint 


382 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

change, image blur, JPEG compression and change in illumination. In [17] more detectors 

are added in evaluation, including CENSURE, AGAST (Adaptive and generic accelerated 

segment test), and BRISK (Binary robust invariant scalable keypoints) over the extensive 

image transformations dataset comprised of multiple well-known feature evaluation 

datasets. Although commonly employed metrics for joint detector evaluations include 

repeatability score, precision and recall value, number of keypoint correspondences and 

keypoint detection time, in our experiment we adopt just the keypoint detection time since 

the other metrics are descriptor dependent.  

There is a majority consent between proposed evaluation methods that FAST is one of 

the top-performing detectors in terms of the number of detected keypoints and detection 

speed. Considering the detection speed, FAST is followed by other binary detectors such 

as ORB and AGAST [17]. Although FAST expresses superior performance when it 

comes to the number of keypoints detected, it is stated in [13] that it was unreliable 

compared to the other scale-space keypoint detectors, such as Difference of Gaussian 

(DoG), today incorporated as a part of SIFT detector. According to a ranking proposed in 

[19], the best performing detector-descriptor combinations were FAST+SIFT and 

FAST+BRISK. In [20], a novel method for detector evaluation is introduced through the 

reconstruction of a 3D dense point cloud. Although authors compare just SIFT and 

AKAZE detectors, the method can be applied to other detection algorithms to verify 

already produced numerical results additionally. As future work, we intend to incorporate 

a similar approach in our evaluation framework. 

2.2. Human Body Reconstruction 

To reconstruct the 3D body model of a clothed human, affordable image-based techniques 

are used as an alternative to more expensive laser scan and structured light techniques [21]. 

Image-based reconstruction requires one [22]–[25] or more [26]–[29] temporally [26][30] or 

spatially [27] connected images captured by RGB [22]–[25] or RGB-D [26], [27], [30] 

sensors. Early work in this field was directed toward general-purpose multi-view stereo 

algorithms. In a multi-view stereo, multiple sensors are used in a setup to simultaneously 

capture images of the subject from different viewpoints with certain redundancy between the 

views. Although these techniques are not primarily designed for human body reconstruction, 

it has been demonstrated that highly detailed models can be obtained using this method [28], 

[29], [31], [32]. By design, multi-view stereo algorithms are sensitive to complex occlusions 

between the views, as well as sparse or repeated textures [28], [33], [34]. These appearances 

are ubiquitous in human body reconstruction: texture issues are often caused by clothes, and 

occlusions by nontrivial body shape and pose. As a result, the output body model may be 

missing some of the body parts  [33], [34]. In [35], authors minimize model incompleteness 

by increasing redundancy between the views in a dense multi-view stereo setup. However, 

using tens or thousands of sensors in a setup significantly limits the proposed method’s 

applicability due to high setup price and increased reconstruction time. Different approach to 

address the model incompleteness problem based on compressive sensing technique (CS) is 

presented in [36]. Compressive sensing has already been used to refine depth maps that are 

generated in later steps of the human avatar 3D reconstruction pipeline. This technique could 

be used to reduce the number of sensors in the setup, with a limitation that this approach can 

be applied just in cases where exact sensor positions are known during the image acquisition 

process, which is not the assumption in this paper.  Still, it could be possible to apply CS to fill 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 383 

missing parts of the final model reconstructed by the photogrammetry-based pipeline. It 

remains to be tested if CS could recover whole body parts or just minor patches on the model. 

Another effort to reduce the setup price was attempted in [27] where more affordable, low-

resolution RGB-D sensors are used instead of RGB sensors. Although RGB-D sensors 

improve the reconstruction process results in terms of improved depth estimation, due to low 

sensor resolution, body models reconstructed in these setups lack details. Another approach to 

reduce the setup price is to use a sparser sensor setup. This is approach we utilize in our work. 

Since the redundancy between views in a sparse setup is low, models generated using these 

setups are more likely to be incomplete. To overcome this problem, algorithms for 

reconstruction from sparse setups usually do not rely solely on input images. In [24], [26] 

coarse human body template is used as a basis to overcome model incompleteness issues. The 

template is further modified according to input images to obtain a personalized model of a 

clothed person. The main disadvantage of using the template in the reconstruction process is 

unavailability to generate models with a high level of details. 

Lately, human body reconstruction from a single image has been a topic of great interest 

in a computer vision. The most successful single-image approaches are those based on 

convolutional neural networks (CNNs) [22]–[25]. There are two common approaches to 

reconstruct a body model when it comes to CNNs: (1) estimate human body template 

parameters [37], [38] or (2) directly output voxel occupancy in the form of a voxel grid 

[22], [23]. The latter approach is of more interest to this work since it is more suitable for 

the reconstruction of a clothed body model. Recently, not just input color images, but also 

segmentation masks and body landmarks are used to output the clothed body model 

successfully. However, although voxel grid-based CNN reconstruction methods output 

promising results both in terms of completeness and level of details, this approach is 

currently limited by computational power to a voxel grid of approximate size of 

128×128×128 voxels. This constraint is related to model detail level, which is limited by 

the maximal size of the grid. 

It is of particular interest to our work that the 3D model of a clothed human body is highly 

detailed and complete. Thus, we choose multi-view setup with RGB cameras to capture 

images of a clothed subject since currently no other method can produce models with 

comparable high level of details. As a basis for our research, we use a general-purpose multi-

view stereo reconstruction algorithm to obtain the clothed body model. Differently from the 

other work, we make an effort towards modifying general-purpose photogrammetry-based 

reconstruction algorithms for a human body reconstruction domain. To achieve that, we 

perform an extensive study of detector algorithms that are used as a first step in the pipeline to 

choose the best performing detectors on a human-based dataset. In this way, we tackle the 

problem of improving the reconstruction detail level and completeness through the 

improvement of the algorithm, instead of the more expensive sensor setup densification. 

3. EXPERIMENTAL FRAMEWORK 

In this section, we explain the sensor setup used to capture the human-body based dataset 

used for the experiments, as well as a detailed description of conducted experiments. To refer 

to the person who has been photographed, we use a term the subject. 


384 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

3.1. Camera Setup 

To capture image data, we use multi-view stereo setup with 54 high-resolution RGB 

calibrated cameras, conceptually similar to the one described in [39] and shown in Fig. 2. 

During the image acquisition process, the subject is standing in a center of the setup with legs 

slightly apart and arms positioned at an approximately 30-degree angle away from the body 

(so-called A-pose) [39]. Fig. 1 gives an idea of the body areas visible on images captured by 

different cameras in the setup. Due to privacy concerns, we display subject silhouettes instead 

of color images. Almost all parts of the subject’s body are visible on the captured images. The 

subject soles are the exception since they are not visible during the image acquisition process.  

 
Fig. 1 Body coverage schema—anonymized real data 

3.2. Dataset 

The dataset we conduct experiments on consists of seven image sets that we will refer to 

as scans. Each scan is captured with the setup similar to [39] and consist of images 

displaying different body parts as illustrated in Fig. 1. To capture scans, we use two 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 385 

different camera types (see Table 1). Due to relatively sparse camera setup, redundancy 

between images of a single scan is low. Images are captured from different viewpoints 

without precisely known camera positions. Some of the images may suffer from an 

illumination effect. Other frequently tested geometric or photogrammetric transformations 

in the general-purpose evaluations, such as rotation, blur or JPEG compression are omitted 

from the dataset since the presence of those transformations indicates an error in the scan 

acquisition process. 

Table 1 Scans specification 

Scan Identifier Camera Manufacturer Resolution 

1, 2, 3, 4, 5, 6 Canon 3456x5184 

7 Raspberry Pi 2464x3280 

 
 (a) (b) 

Fig. 2 Conceptual camera setup shown from the top (a) and side view (b). Acquisition 

cameras are represented as columns of dark spots. This image is taken from [36]. 

In our work, we use acquisition setup similar to the one presented in the image. 

3.3. Software and Hardware 

All experiments are conducted on a personal computer running Windows 10 64-bit, 

powered by Intel i5-6600 CPU at 3.3 GHz, 32 GBs of RAM, and Nvidia GeForce 1050Ti 

graphics card. Evaluation pipeline is implemented as a single-threaded application using 

C++ programming language and compiled with Visual C++ 2015 compiler using speed 

optimizations (/O2 compiler flag). To make experiments easily reproducible, all detector 

implementations used are part of a publicly available OpenCV 3.2 [40] library. We compile 

the library from source with opencv-contrib package to include support for patented 

algorithms such as SIFT and SURF. 


386 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

3.4. Detector Evaluation 

We include both floating-point and binary detection algorithms into our study. As for 

floating-point algorithms we include currently the most popular SIFT [41] and SURF [42] 

detection algorithms, as well as STAR [43], Maximal self-dissimilarities (MSD) [44], 

Maximally stable color regions (MSER) [45], [46], and Good features to track (GFFT) [47]. 

We also evaluate number of binary detection algorithms such as Oriented FAST and rotated 

BRIEF (ORB) [48], Features from accelerated segment test (FAST) [49], Binary robust 

invariant scalable keypoints (BRISK) [50], Adaptive and generic accelerated segment test 

(AGAST) [51], and Accelerated KAZE (AKAZE) [52]. We choose to include in our study 

as many detectors as possible limiting ourselves to implementations available as a part of 

OpenCV library. When instantiating a detector object, we use default parameters for all 

detectors except for ORB and GFFT for which the maximum number of keypoints has been 

set to 300000 instead of much smaller, default values of 500 and 1000, respectively. We 

experimentally choose 300000 as an upper limit for the number of detected keypoints, since 

none of our test images exceed this limit under any detector.  

We evaluate detectors on scans from Table 1. Experiments are conducted both on 

original scans and scans with a removed background (so-called masked scans). To remove 

the background, we apply a mask image to each image from the scan. A mask is new, 

binary image that corresponds to the original, color image with white pixels representing 

subject body and black pixels representing the background. Each mask image is manually 

labeled to precisely follow subject’s outline. After the mask is applied, the image is left 

showing just the subject while the background is made entirely white. Since our study is 

also directed toward setup cost reduction, we are also interested in detectors performance in 

lower resolution images, since low-resolution cameras are cheaper. Motivated by this fact, 

we test all detector algorithms on images with applied scale factors of 1, 2, 4, and 8. To 

downscale original images, we use bilinear interpolation. 

3.4.1. Performance metric 

We use three metrics for the measurement of detector performance: 

 The average number of detected keypoints has been calculated for both masked and 
original images. This metric is important for detector evaluation since the insufficient 

number of detected keypoints in the image segment showing the subject will almost 

certainly result in a sparse point cloud with too few points and, consequently, an 

incomplete avatar reconstruction. Certain areas of the human body, such as back or 

legs, can be particularly challenging to reconstruct due to a lack of edges or textures, 

which are detected as features by some detection algorithms. 

 The average number of keypoints per second = number of detected keypoints / time 
to detect keypoints. Large number of keypoints is necessary to reconstruct complete 

and detailed avatar of a human, making keypoint detection time significant factor in 

a 3D reconstruction process. Choosing detector with large execution time might 

limit applicability of avatar reconstruction to non-realtime applications. Thus, 

similar to [19], we include detector execution time measurement into ours study. We 

improve the approach from [19], by not limiting the maximum number of keypoints 

detected by the algorithm. Since the number of keypoints detected by different 

detectors on the same image can vary, we do not measure absolute execution time as 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 387 

introduced in [19]. Instead, we measure a detector’s execution time indirectly as the 

number of keypoints detected per second.  

 Semantic precision = number of keypoints detected on the image segment showing the 
subject / total number of keypoints detected both on image segment representing the 

subject and on the segment representing the background. To get the first value, we 

apply the selected detection algorithm on the masked image. For the second value, the 

detector algorithm is applied to the original image, and the total number of detected 

features is calculated. This measure is used as an indicator of detector algorithm 

expressiveness – higher ratio indicates a better ability of the detector to distinguish 

between subject and background, and possibly reduce the number of bad matches and 

noise later in the reconstruction process. 

4. RESULTS AND DISCUSSION 

This section offers our findings for detector evaluation on the human-based dataset.  

4.1. Detector Evaluation 

Here we present results of standalone detector evaluation for each of the aforementioned 

three metrics. 

4.1.1. The average number of detected keypoints 

As mentioned earlier, the number of detected keypoints can significantly impact later 

stages of the reconstruction process, since the low number of detected keypoints will 

undoubtedly lead to the low-quality avatar reconstruction. In Table 2, we show our findings 

on the average number of keypoints detected on the proposed dataset of seven scans for 

different scaling factors applied on both original and masked images. In all tested scenarios, 

binary detectors highly outperform SIFT and SURF, as shown in Table 3. In general, 

masking does not have a significant impact on the number of detected keypoints. We observe 

that the average number of keypoints detected on masked images can vary up to 10% 

compared to the average number of detected keypoints using the same detectors on original 

images. Since the change is positive in all cases except for SURF and MSD detectors, our 

estimate is that by eliminating the background from the input image, we emphasize contours 

of the subject which leads to the increased number of keypoints detected by the majority of 

detection algorithms. At the same time, keypoints detected on the background are discarded 

on masked images. In the case of SURF and MSD algorithms, the number of keypoints 

rejected by the mask is slightly larger than the number of newly detected keypoints on the 

masked images, which leads to the reduced number of keypoints detected on masked images. 

We observe that the average number of keypoints detected on masked images can vary up to 

10% compared to the average number of detected keypoints using the same detectors on 

original images. In all cases except for SURF and MSD detectors, more keypoints are 

detected on the masked image than on the original image even though image masking 

discards all keypoints detected on the background. Higher keypoint detection rate on masked 

images can be contributed to additional keypoints being identified by the majority of 

detectors when subject contours enhancement is introduced. 


388 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

Table 2 Detection algorithms ranked according to the average number of detected 

keypoints. In the first row are detectors that on average detect the largest number 

of keypoints, in the last row are detectors that on average  detect the smallest 

number of keypoints. The table provides detector rankings on images with 

(column Masked) and without masks applied (column Original) 

Rank 
Scale 1 Scale 2 Scale 4 Scale 8 

Original Masked Original Masked Original Masked Original Masked 

1 ORB ORB ORB ORB AGAST AGAST AGAST AGAST 

2 FAST FAST AGAST AGAST FAST FAST ORB ORB 

3 AGAST AGAST FAST FAST ORB ORB FAST FAST 

4 GFTT GFTT GFTT GFTT GFTT GFTT GFTT GFTT 

5 BRISK BRISK BRISK BRISK SURF SURF SURF SURF 

6 SIFT SIFT SURF SURF BRISK BRISK BRISK BRISK 

7 SURF SURF SIFT SIFT SIFT SIFT SIFT SIFT 

8 AKAZE AKAZE AKAZE AKAZE AKAZE AKAZE AKAZE MSD 

9 MSD MSD MSD MSD MSD MSD MSD AKAZE 

10 MSER MSER MSER MSER MSER MSER MSER MSER 

11 STAR STAR STAR STAR STAR STAR STAR STAR 

Table 3 Average number of detected keypoints 

Rank Detector Scale 1 Scale 2 Scale 4 Scale 8 

1 ORB 122591 46600 10392 2823 

2 FAST 115551 39980 10570 2644 

3 AGAST 114616 42207 11880 3013 

4 SIFT 48546 9981 1872 608 

5 SURF 37910 11416 3564 1079 

4.1.2. The average number of detected keypoint per second 

Since we do not limit the number of detected keypoints, it would be unfair to rank 

detectors directly according to the execution time. Instead, we use a relative ratio of the 

number of detected keypoints and time spent on keypoint detection, as shown in Table 4. 

Binary detectors FAST, AGAST and ORB show the best overall detection speed 

performance.  Both on the original and masked images, FAST detects 3.5 and 4.5 times more 

keypoints per seconds then AGAST and ORB, respectively. Detected keypoint ratio between 

these three detectors persists even across different scales, which is not valid for the 

comparison of FAST and state-of-the-art SIFT and SURF detectors, where ratio variations are 

not negligible. For original images and different values of a scale factor, FAST detects up to 

540 times more keypoints per second then SIFT, and 137 times more than SURF. For masked 

images, these ratios are slightly larger. When compared to other detectors, MSD and MSER 

algorithms are highly inefficient, detecting approximately less than a single keypoint per 

second on the original, and one to two keypoints per seconds on masked images. 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 389 

 
Fig. 3 Average number of keypoints detected per time unit (second) 

Table 4 Average number of keypoints detected per second 

Detector 
Scale 1 Scale 2 Scale 4 Scale 8 

Original Masked Original Masked Original Masked Original Masked 

AGAST 740 739 1046 1047 1224 1232 1172 1196 

AKAZE 4 4 5 5 6 6 8 8 

BRISK 78 77 87 87 88 88 86 88 

FAST 2959 2956 3902 3878 3900 3904 3465 3516 

GFFT 140 100 169 150 181 165 161 155 

MSD 1 1 1 1 1 1 2 2 

MSER 1 1 1 1 1 2 2 3 

ORB 680 686 894 885 787 778 663 681 

SIFT 11 11 9 9 7 6 9 8 

STAR 3 3 8 8 8 7 17 15 

SURF 3 3 8 8 8 15 17 15 


390 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

 
Fig. 4  The ratio of detected keypoints on original scans and scans with applied masks 

4.1.3. Semantic precision 

Not all keypoints are equally important in the process of human reconstruction since we 

would like to reconstruct the avatar of the subject with as little background noise as 

possible. That makes keypoints detected on the subject more important than the keypoints 

detected on the background. In Fig. 4 we show a ratio of the average number of keypoints 

detected on masked images and those detected on the original image. SURF and MSD are 

more likely to detect keypoints on the background for all tested values of the scale factor. 

Other algorithms express moderate to high robustness to the background keypoints since 

the computed ratio indicates that the number of keypoints detected on the subject is at least 

equal or even larger than the total number of keypoints detected on the original image.  

5. CONCLUSION 

In this paper, we presented an extensive evaluation of algorithms for keypoint detection in 

the context of 3D avatar reconstruction from an image sequence. Although similar exhaustive 

evaluations of detector performance exist, we are not aware of any other study performed in 

the context of photogrammetry-based human body reconstruction. First, we created a human 

body image dataset by capturing images of seven different persons in a multi-view stereo 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 391 

setup in controlled lighting conditions. The dataset is used to evaluate eleven algorithms for 

keypoint detection, including well established and patented SIFT and SURF algorithms as a 

baseline. Our findings are mainly in agreement with previously conducted work proposed in 

[13], [17]. Binary detectors show superior performance compared to floating-point detectors 

in terms of detection speed and number of detected keypoints. Among the binary detectors, 

FAST is the most efficient in terms of speed detection, detecting a considerably larger number 

of keypoints per second comparing to SIFT and SURF detectors, followed by ORB and 

AGAST. ORB, AGAST, and FAST are top-performing detectors considering the number of 

detected keypoints; their performance additionally increased when performed on the masked 

image. In our use case, FAST does not produce the largest number of keypoints but is 

significantly close to the top-performing ORB detector with approximately 2% less keypoints 

detected. We also found that SURF and MSD in comparison with other detectors, discover a 

significant number of keypoints in the background area, meaning that the usage of this 

detectors in the pipeline could lead to noisy reconstructions. 

In future work, detectors learned by machine learning techniques will be included in the 

evaluation. Although advanced handcrafted detector algorithms still exhibit at least 

comparable performance to those that are learned, machine learning is a rapidly developing 

area and it can be expected that learned detectors will outperform handcrafted soon. Another 

direction for future work includes improvement of the evaluation framework. The most 

reliable way to estimate actual detector performance would be to produce a 3D reconstruction 

based on detected keypoints. Current photogrammetry-based software commonly includes 

just SIFT and SURF detection algorithms into the pipeline. More work toward the adaption of 

other detectors in the pipeline will be done to additionally verify given numerical results. 

Acknowledgements.The research reported in this paper is partially supported by the Ministry of 

Education, Science, and Technological Development of the Republic of Serbia, projects number 

TR32044 (2011-2020), ON174026 (2011-2020), and III44006 (2011-2020). 

REFERENCES 

  [1] J. N. Bailenson, N. Yee, J. Blascovich, and R. E. Guadagno, ―Transformed social interaction 

in mediated interpersonal communication‖, Mediated Interpersonal Communication, 2008, 

pp. 77–99. 

  [2] H. Lin and H. Wang, ―Avatar creation in virtual worlds: Behaviors and motivations‖, 

Comput. Human Behav., vol. 34, pp. 213–218, May 2014. 

  [3] F. Cordier, W. Lee, H. Seo, and N. Magnenat-Thalmann, ―From 2D Photos of Yourself to 

Virtual Try-on Dress on the Web,‖ In People and Computers XV—Interaction without 

Frontiers, London: Springer London, 2011, pp. 31–46. 

  [4] C. Zizza, A. Starr, D. Hudson, S. S. Nuguri, P. Calyam, and Z. He, ―Towards a social virtual 

reality learning environment in high fidelity,‖ In Proceedings of the 15th IEEE Annual 

Consumer Communications & Networking Conference (CCNC), 2018, pp. 1–4. 

  [5] D. Dragan, Z. Anišiš, S. Mihiš, and V. Puhalac, ―3D Avatar Platforms: Tomorrow’s 

Gateways for Digitized Persons into Virtual Worlds‖, Springer, Cham, 2018, pp. 141–155. 

  [6] I. Hudson and J. Hurter, ―Avatar types matter: Review of avatar literature for performance 

purposes,‖ In Lecture Notes in Computer Science (including subseries Lecture Notes in 

Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9740, pp. 14–21. 

  [7] M. Yuan, I. R. Khan, F. Farbiz, S. Yao, A. Niswar, and M.-H. Foo, ―A Mixed Reality Virtual 

Clothes Try-On System‖, IEEE Trans. Multimed., vol. 15, no. 8, pp. 1958–1968, Dec. 2013. 


392 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

  [8] T. Luhmann, S. Robson, S. Kyle, and J. Boehm, Close Range Photogrammetry and 3D 

Imaging. 2013. 

  [9] AgiSoft, ―AgiSoft PhotoScan Professional (Version 1.2.6) (Software)‖, 2016. [Online]. 

Available: https://www.agisoft.com/downloads/installer/. 

[10] J. Heinly, E. Dunn, and J. M. Frahm, ―Comparative evaluation of binary features‖, In Lecture 

Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and 

Lecture Notes in Bioinformatics), 2012, vol. 7573 LNCS, no. PART 2, pp. 759–773. 

[11] AliceVision, ―Meshroom: A 3D reconstruction software.‖ 2018. 

[12] P. Moulon, P. Monasse, R. Perrot, and R. Marlet, ―Openmvg: Open multiple view geometry,‖ In 

Proceedings of the International Workshop on Reproducible Research in Pattern Recognition, 

2016, pp. 60–74. 

[13] H. Aanæs, A. L. Dahl, and K. S. Pedersen, ―Interesting interest points: A comparative study 

of interest point performance on a unique data set‖, Int. J. Comput. Vis., vol. 97, no. 1, pp. 

18–35, Mar. 2012. 

[14] K. Mikolajczyk and C. Schmid, ―A performance evaluation of local descriptors‖, IEEE 

Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, 2005. 

[15] K. Mikolajczyk and C. Schmid, ―Scale & affine invariant interest point detectors‖, Int. J. 

Comput. Vis., vol. 60, no. 1, pp. 63–86, Oct. 2004. 

[16] O. Miksik and K. Mikolajczyk, ―Evaluation of Local Detectors and Descriptors for Fast Feature 

Matching,‖ In Proceedings of the 21st Int. Conf. Pattern Recognit. (ICPR), 2012, Icpr, pp. 2681–

2684, 2012. 

[17] A. Canclini, M. Cesana, A. Redondi, M. Tagliasacchi, J. Ascenso, and R. Cilla, ―Evaluation 

of low-complexity visual feature detectors and descriptors‖, In Proceedings of the 18th 

International Conference on Digital Signal Processing, DSP 2013, 2013, pp. 1–7. 

[18] Ş. Işık, ―A Comparative Evaluation of Well-known Feature Detectors and Descriptors,‖ Int. 

J. Appl. Math. Electron. Comput., vol. 3, no. 1, p. 1, Dec. 2014. 

[19] D. Mukherjee, Q. M. Jonathan Wu, and G. Wang, ―A comparative experimental study of image 

feature detectors and descriptors,‖ Mach. Vis. Appl., vol. 26, no. 4, pp. 443–466, May 2015. 

[20] K. Yamada and A. Kimura, ―A performance evaluation of keypoints detection methods SIFT 

and AKAZE for 3D reconstruction,‖ In Proceedings of the 2018 International Workshop on 

Advanced Image Technology, IWAIT 2018, 2018, pp. 1–4. 

[21] B. Allen, B. Curless, and Z. Popoviš, ―The space of human body shapes‖, ACM Trans. 

Graph., vol. 22, no. 3, p. 587, 2003. 

[22] A. S. Jackson, C. Manafas, and G. Tzimiropoulos, ―3D Human Body Reconstruction from a 

Single Image via Volumetric Regression‖, Sep. 2018. 

[23] G. Varol et al., ―BodyNet: Volumetric inference of 3D human body shapes,‖ In Lecture 

Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and 

Lecture Notes in Bioinformatics), 2018, vol. 11211 LNCS, pp. 20–38. 

[24] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, ―DeepHuman: 3D Human Reconstruction from 

a Single Image,‖ Mar. 2019. 

[25] A. Venkat, S. S. Jinka, and A. Sharma, ―Deep Textured 3D Reconstruction of Human 

Bodies,‖ Sep. 2018. 

[26] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, ―Scanning 3D full human bodies using kinects‖, 

IEEE Trans. Vis. Comput. Graph., vol. 18, no. 4, pp. 643–650, Apr. 2012. 

[27] Z. Liu et al., ―3D real human reconstruction via multiple low-cost depth cameras‖, Signal 

Processing, vol. 112, pp. 162–179, Jul. 2015. 

[28] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, and S. Thrun, ―Multi-view 

image and ToF sensor fusion for dense 3D reconstruction‖, In Proceedings of the 2009 IEEE 

12th International Conference on Computer Vision Workshops, ICCV Workshops 2009, 

2009, pp. 1542–1546. 

[29] Y. Furukawa and J. Ponce, ―Accurate, dense, and robust multiview stereopsis‖, IEEE Trans. 

Pattern Anal. Mach. Intell., vol. 32, no. 8, pp. 1362–1376, Aug. 2010. 


 Comparative Evaluation of Feature Detectors for 3D Digital Avatar Reconstruction 393 

[30] A. Weiss, D. Hirshberg, and M. J. Black, ―Home 3D body scans from noisy image and range 

data,‖ In Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 

1951–1958. 

[31] J. L. Schönberger and J.-M. Frahm, ―Structure-from-Motion Revisited‖, In Proceedings of the  

Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 

[32] J. L. Schönberger, E. Zheng, J. M. Frahm, and M. Pollefeys, ―Pixelwise view selection for 

unstructured multi-view stereo‖, In Lecture Notes in Computer Science (including subseries 

Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9907 

LNCS, pp. 501–518. 

[33] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, and A. B. Dahl, ―Large-Scale Data for 

Multiple-View Stereopsis,‖ Int. J. Comput. Vis., vol. 120, no. 2, pp. 153–168, Nov. 2016. 

[34] M. Goesele, B. Curless, and S. M. Seitz, ―Multi-View Stereo Revisited,‖ In Proceedings of 

the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - 

Volume 2 (CVPR’06), vol. 2, pp. 2402–2409. 

[35] S. R. Fanello et al., ―UltraStereo: Efficient learning-based matching for active stereo 

systems,‖ In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern 

Recognition, CVPR 2017, 2017, vol. 2017-Janua, pp. 6535–6544. 

[36] I. Stančiš, M. Brajoviš, I. Oroviš, and J. Musiš, ―Compressive sensing for reconstruction of 

3D point clouds in smart systems,‖ In Proceedings of the 24th International Conference on 

Software, Telecommunications and Computer Networks, SoftCOM 2016, 2016, pp. 1–5. 

[37] V. Tan, I. Budvytis, and R. Cipolla, ―Indirect deep structured learning for 3D human body shape 

and pose prediction,‖ In Proceedings of the British Machine Vision Conference 2017, 2017. 

[38] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, ―End-to-End Recovery of Human 

Shape and Pose,‖ In Proceedings of the IEEE Computer Society Conference on Computer 

Vision and Pattern Recognition, 2018, pp. 7122–7131. 

[39] D. Gajiš, S. Mihiš, D. Dragan, V. Petroviš, and Z. Anišiš, ―Simulation of photogrammetry-

based 3D data acquisition,‖ Int. J. Simul. Model., vol. 18, no. 1, 2019. 

[40] G. Bradski, ―The OpenCV Library,‖ Dr. Dobb’s J. Softw. Tools, 2000. 

[41] D. G. Lowe, ―Distinctive image features from scale-invariant keypoints,‖ Int. J. Comput. Vis., 

vol. 60, no. 2, pp. 91–110, 2004. 

[42] H. Bay, T. Tuytelaars, and L. Van Gool, ―SURF: Speeded up robust features‖, In Lecture 

Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and 

Lecture Notes in Bioinformatics), 2006, vol. 3951 LNCS, pp. 404–417. 

[43] M. Agrawal, K. Konolige, and M. R. Blas, ―CenSurE: Center surround extremas for realtime 

feature detection and matchin‖, In Lecture Notes in Computer Science (including subseries 

Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, vol. 5305 

LNCS, no. PART 4, pp. 102–115. 

[44] F. Tombari and L. Di Stefano, ―Interest points via maximal self-dissimilarities‖, In Lecture 

Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and 

Lecture Notes in Bioinformatics), 2015, vol. 9004, pp. 586–600. 

[45] P. E. Forssén, ―Maximally stable colour regions for recognition and matching‖, In 

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern 

Recognition, 2007, pp. 1–8. 

[46] D. Nistér and H. Stewénius, ―Linear time maximally stable extremal regions,‖ in Lecture 

Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and 

Lecture Notes in Bioinformatics), 2008, vol. 5303 LNCS, no. PART 2, pp. 183–196. 

[47] Jianbo Shi and Tomasi, ―Good features to track‖, In Proceedings of IEEE Conference on 

Computer Vision and Pattern Recognition CVPR-94, 1994, pp. 593–600. 

[48] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ―ORB: An efficient alternative to SIFT 

or SURF‖, In Proceedings of the IEEE International Conference on Computer Vision, 2011, 

pp. 2564–2571. 


394 D. GAJIŠ, G. GOJIŠ, D. DRAGAN, V. PETROVIŠ 

[49] E. Rosten and T. Drummond, ―Fusing points and lines for high performance tracking‖, In 

Proceedings of the IEEE International Conference on Computer Vision, 2005, vol. II, pp. 

1508–1515. 

[50] S. Leutenegger, M. Chli, and R. Y. Siegwart, ―BRISK: Binary Robust invariant scalable 

keypoints‖, In Proceedings of the IEEE International Conference on Computer Vision, 2011, 

pp. 2548–2555. 

[51] E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger, ―Adaptive and generic 

corner detection based on the accelerated segment test‖, in Lecture Notes in Computer 

Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in 

Bioinformatics), 2010, vol. 6312 LNCS, no. PART 2, pp. 183–196. 

[52] P. Alcantarilla, J. Nuevo, and A. Bartoli, ―Fast Explicit Diffusion for Accelerated Features in 

Nonlinear Scale Spaces‖, In Proceedings of the British Machine Vision Conference 2013, 

2014, pp. 13.1-13.11.