AP07_4-5.vp 1 Introduction Face detection is one of the most challenging tasks in ob- ject recognition, because of the high variance among human faces, including facial expression. Furthermore, the typical challenges of object detection, such as variability in scale, ori- entation and pose as well as occlusion and lighting conditions have to be treated by a face detector. In recent years, several methods have been developed for face detection that deal with some but not all sources of variance. Holistic approaches such as Eigenfaces [12] (sometimes called image-based [4] or appearance-based [14] methods in context of detection and recognition)or the boosted cascade of simple features [13] perform object detec- tion by classifying image regions through a sliding window [15]. While they can be made invariant to scale and lighting, the major drawback of these techniques is that they cannot deal with different rotation or pose. In addition, facial expres- sion and occlusion must be treated as intra-class variance, decreasing the performance of the classifier/detector. On the other hand, feature-based approaches handle pose, expres- sion and even partial occlusion very well. Rotation invariance is only limited by the properties of the features used to build the detector. This paper presents a new approach to the detection of facial features. The local features used for detection are in- variant to scale, rotation and change in illumination, and are robust to changing viewpoints. We show that the features chosen are well suited to detect several meaningful parts of the human face like pupils, nostrils and the corner of the mouth. The paper is organized as follows: In section two, the feature extraction process is described. Feature space reduc- tion and feature selection are explained in section three. Section four presents the classifier, and the results are shown in section five. In section six, the results are discussed. 2 Feature extraction This section explains feature extraction from an image and its description in feature space, The features have to be invariant to affine transformations and illumination. While there have been numerous proposals for appropriate feature points, they have usually been chosen on the basis of human observation of face characteristics. In contrast to such knowl- edge-based methods, several methods have been developed to detect structures that are generally easy to locate, can be computed with high reliability and satisfy the demand of scale, rotation and illumination invariance [6, 10]. The local area around these so-called interest points is then extracted and modeled. A 3D scale-space representation of an image is usually taken to detect local features and their corresponding scales (see Fig. 1). Given any image I(x), its scale-space representa- tion L(x, �), is defined by L g I( , ) ( , ) ( )x x x� �� � (1) where g( , )x � denotes the Gaussian kernel function given by g( , ) expx x � � � � �� � � � � � � 1 2 2 2 2 2 . (2) Our approach uses the scale adapted Harris Corner de- tector to detect interest points. This detector is based on the second moment matrix [7]: �( , , ) ( ) ( , ) ( , ) ( , ) x x x x � � � � � � � i D D i D D x y D x y D g L L L L L L � �2 2 D D 2 ( , )x � � � � � � (3) This matrix describes the gradient distribution in a local neighborhood of a point. Its eigenvalues �1, �2 represent the two principal signal changes. Thus it is possible to extract points with a significant signal change in both orthogonal directions indicating edges or junctions, for example. Since it © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 3 Acta Polytechnica Vol. 47 No. 4–5/2007 Detection of Facial Features in Scale-Space P. Hosten, M. Asbach This paper presents a new approach to the detection of facial features. A scale adapted Harris Corner detector is used to find interest points in scale-space. These points are described by the SIFT descriptor. Thus invariance with respect to image scale, rotation and illumination is obtained. Applying a Karhunen-Loeve transform reduces the dimensionality of the feature space. In the training process these features are clustered by the k-means algorithm, followed by a cluster analysis to find the most distinctive clusters, which represent facial features in feature space. Finally, a classifier based on the nearest neighbor approach is used to decide whether the features obtained from the interest points are facial features or not. Keywords: clustering methods, face recognition, feature extraction, interest points, Karhunen-Loeve transforms, object detection, pattern classification. Fig. 1: The image I(x) is embedded into a continuous family L(x, �) of gradually smoother versions of it. Hereby the original image corresponds to the scale � � 0 [11]. Interest points are detected in this scale-space. is easier to compute the trace and the determinant of the sec- ond moment matrix than its eigenvalues, the Harris detector uses the following measure to determine the location of inter- est points [3]: cornerness tracei D � � � � � � � � � � � � � 1 2 1 2 2 2 ( ) det( ( , , )) (� x �( , , )).x � �i D (4) This measure is not suitable for detecting the maximum over scales in a scale-space representation. Thus the normal- ized Laplacian-of-Gaussian is used for automatic scale selec- tion [5]. LoG s L Ln n xx n yy n( , ) ( , ) ( , )x x x� � �� � 2 2 2 (5) The region around interest points is described using local descriptors. Recently, several descriptors have been devel- oped [8]. This paper uses the SIFT descriptor based on the gradient distribution in the detected region around the inter- est point [6]. The size of the region being described depends on the detection scale �. Thus a scale invariant description is obtained. The resulting descriptor maps each interest point into a 128-dimensional vector m. In addition, a dominant ori- entation is calculated for each interest point and descriptors are calculated on the basis of this angle to gain rotation invariance. 3 Feature reduction and selection In this section, dimensionality reduction and feature selection are explained. The Karhunen-Loeve transform is applied to reduce the dimensionality. Then the clustering process and the cluster analysis is described. 3.1 Karhunen-Loeve transform It is a well-known problem in the machine learning do- main that the number of required training samples increases exponentially with the dimensionality of the feature space. This is called the “curse of dimensionality” [1]. Since the SIFT descriptor forms a 128 dimensional vector, the Karhunen- -Loeve transform is applied to reduce dimensionality. Positive side effects of this are a reduction in computing time for the classification and the fact that the elements of the feature vec- tor m become uncorrelated. Hence the covariance matrix � of N training feature vectors m is used to find a linear subspace for better representation. � � � � � � � � � � � 1 1 1 1 1 N N i i N i i T i N m m m( )( ) Therefore the covariance matrix � has to be orthogon- alized by an eigenvector transform: � �� T N � � � � � � � � � � � � � � 1 2 0 0 0 0 0 0 � � � � � � � . (6) The columns of transform matrix � are the eigenvectors of �. These eigenvectors are orthonormal and represent the basis vectors. The use of only a subset of these eigenvectors as transformation matrix �T leads to reduced dimensionality of the feature space [2]: f m� � �� ( )� �T (7) 3.2 Clustering In contrast to knowledge-based methods, the exact facial features are not defined beforehand. For clustering the train- ing feature vectors f in the reduced feature space, we need, however, to estimate their number. A single cluster represents a distribution of a number of feature vectors f that belongs to the same facial feature. The number of clusters must therefore be big enough to offer at least one cluster centroid per facial feature. Choosing too many clusters, however, leads to over- training, i.e. a single facial feature will be represented by a multitude of cluster centroids that are adapted to the training set instead of generalizing a given feature. Hence an evalua- tion with different numbers of clusters was conducted (see section 4) to find the optimum value ex post based on the clas- sifier performance. A detailed description of the clustering process is given in the following paragraphs. First of all, the training feature vectors f are split up into two subsets. The first subset contains all feature vectors de- scribing the facial features (class a � 1). In the second subset all feature vectors describing the rest of the image are aggre- gated (class a � 0): � � � �F a F aface non face� � � � � �f f1 0, (8) Each subset is clustered with the k-means algorithm. Thereby the number of clusters in each subset is kept propor- tional to the respective number of features f. In other words the average number of features represented by each cluster is chosen to be identical for both subsets. 3.3 Cluster analysis A cluster analysis is applied to Fface to find the most char- acteristic clusters of facial features. The cluster precision introduced in [9] is a measure of the representativeness of a cluster: p F F Fja ja ja ja � �� � # # #0 1 (9) 4 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 47 No. 4–5/2007 Fig. 2: Cluster j represents the feature of class a � 0, 1. The cluster precision is pj a � 0 � 0.67 and pj a � 1 � 0.33. This results in the probability pja that a feature of class a is represented by cluster j. In order to determine which feature is represented by which cluster, a vector quantization is applied to all training feature vectors f. Thus the cluster precision for each cluster is obtained. Finally only clusters that represent facial features ( a � 1) and whose cluster precision is higher than 0.9 are kept to form the classifier. In this way the most distinctive clusters representing the facial features in feature space are determined. In the following, these clusters will be denoted as target clusters. 4 Classifier The distances of the feature vectors f to the target clusters determined in section 3.3, are a measure of the performance of the classification process. As shown in Fig. 3, the training features assigned to a face (a � 1) are closer to the target clus- ter than the other features (a � 0). A threshold can therefore be calculated from the training features for a given false-posi- tive rate. In this context, false-positive denotes all features wrongly classified as facial features, whereas true-positive de- notes all features correctly classified as facial features. In this way a classifier is obtained. Up this point, all evalu- ations have been based on the training set only. This is necessary for a cross evaluation study that requires that tests are only performed on data that has not been part of the training process. This dataset is split up into 46 subsets con- taining approx. 49 images each. During cross evaluation, a training corpus is formed of a random selection of 45 subsets, leaving a single subset as a test set. The results of such evalua- tion runs are averaged for the final results presented in section 5. A single run consists of the following steps: 1. All SIFT descriptors of the test set are projected into the feature space found in 3.1. 2. For all such features the respective nearest neighbors of the target clusters from 3.3 is calculated. 3. All features whose distance to their target cluster is smaller than a given threshold, are classified as facial features. 4. From the annotation of the dataset, true positives and false positives are discriminated. 5. By varying the threshold, classifier sensitivity is adjusted. A ROC curve over true positives vs. false positives results. This procedure is repeated for a number of cluster densi- ties (see Fig. 5) to find the ideal number of clusters. 5 Results In this section the results of the previously introduced in- terest point detector and classifier are presented. Our dataset consists of 2254 images showing one to three people in cluttered backgrounds. As described in section 4 the dataset is randomly split into a training set containing 2205 images and a test set containing 49 images. Fig. 4 depicts the distribution of the interest points over the face area. About 150,000 facial interest points have been extracted from all images of the whole dataset. They are mapped into a normalized coordinate system to achieve com- parability. As can be seen, eyes, nostrils and corners of the mouth are extracted very reliably. Thus these interest points are suitable for facial analysis. In order to determine the best setup of the classifier, the influence of the number of clusters on its performance has been analyzed. The result is shown in Fig. 5. Obviously, the effect of overfitting arises for a large number of clusters and a low cluster density, respectively. Fig. 6 shows the distances of the target clusters to the fea- tures f extracted from the test set. Just as has been observed from the training set, the test features assigned to a face (a � 1) are closer to the target cluster than the other features (a � 0). Since only a small part of the image area is covered by human faces, the number of interest points describing facial features is 20 times lower than the number of inter- est points found in the remaining image. That is why only really low false-positive rates (0.1 % – 0.2 %) are acceptable. Setting a threshold for 0.2 % false positives results in a detec- tion rate of about 20 %, which seems acceptable given the fact that this corresponds to about 10 correctly detected facial fea- © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 5 Acta Polytechnica Vol. 47 No. 4–5/2007 Fig. 3: Distances of the training feature vectors f to the target cluster. Distribution of interest in a normalized face Fig. 4: This figure shows the distribution of interest points detected in a face by the scale adapted Harris corner de- tector. The interest points are mapped into the normal- ized coordinate system. ture points per image, whereas approximately 2 features are wrongly classified. This aspect is exemplified in the following. Fig. 7 shows all interest points detected by the scale- -adapted Harris Corner detector. These points, indicated by circles, are especially located in corner like structures. The result of the classifier can be seen in Fig. 8. It shows only the interest points that have been classified as facial features. These features represent eyes, eye brows and corner of the mouth. 6 Conclusion This paper has introduced a new approach to the detec- tion of facial features. Our approach builds on local image de- scriptions that are invariant towards affine image transforma- tions and illumination. We have shown that the scale adapted Harris detector yields feature points that are suitable for the detection of specific facial features like eyes, nostrils and cor- ners of the mouth, and that these features can be appro- priately described by the SIFT descriptor. Our new method is able to detect on average 10 features per face, with a false- -positive rate of 0.2 %, which corresponds to approximately 2 wrongly classified features per image. Acknowledgements The research described in this paper was supervised by Prof. J.-R. Ohm, Institute of Communications Engineering at RWTH Aachen University. References [1] Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [2] Fukunaga, K.: Introduction in statistical pattern recognition. 1972. [3] Harris, C., Stephens: A Combined Corner and Edge De- tector. Alvey Vision Conference, 1988. [4] Hjelmas, E., Low, B. K.: Face Detection: A Survey. Com- puter Vision and Image Understanding,Vol. 83 (2001), No. 3, p. 236–274. [5] Lindeberg, T.: Edge Detection and Ridge Detection with Automatic Scale Selection. International Journal of Com- puter Vision, 1998. [6] Lowe, D. G.: Distinctive Image Features from Scale-In- variant Keypoints. International Journal of Computer Vi- sion, 2004. 6 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 47 No. 4–5/2007 Fig. 5: This figure shows the influence of different cluster densi- ties on the performance of the classifier. Thus the cluster density is a measure for the number of training features represented by a cluster. The smaller the value, the more clusters are computed. Fig. 6: Distances of the test feature vectors f to the target cluster Fig. 7: This image of the test set shows all interest points detected by the scale-adapted Harris corner detector. The diameter of the circle indicates the appropriate scale on which the interest point has been detected. Fig. 8: This image shows only the interest points that have been classified as facial features [7] Mikolajczy, K., Schmid, C.: Scale & Affine Invariant In- terest Point Detectors. International Journal of Computer Vision, 2004. [8] Mikolajczy, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005. [9] Mikolajczyk, K., Leibe, B., Schiele, B.: Local Features for Object Class Recognition. Technical report, Multimodal In- teractive Systems TU Darmstadt, Germany, 2005. [10] Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Van Gool, L., Kadir, T.: A Comparison of Affine Region Detectors. International Journal of Computer Vision, 2005. [11] Ter Haar Romeny, B., Florack, L., Koenderink, J., Viergever, M. (editors): Scale-Space Theory in Com- puter Vision, First International Conference, Scale-Space’97, Utrecht, The Netherlands, July 2–4, 1997, Proceedings. Springer, 1997. [12] Turk, M., Pentland, A.: Eigenfaces for Recognition. Jour- nal of Cognitive Neuroscience, Vol. 3 (1991)No. 1, p. 71–86. [13] Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. Conference on Com- puter Vision and Pattern Recognition 2001, 2001. [14] Yang, M.-H., Kriegman, D. J., Ahuja, N.: Detecting Faces in Images: A Survey. Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 24 (2002), No. 1, p. 34–58. [15] Zhang, J., Yan, Y., Lades, M.: Face Recognition: Eigen- face, Elastic Matching and Neural Nets. Proceedings of the IEEE, Vol. 85 (1997), No. 9. Peter Hosten e-mail: hosten@ient.rwth-aachen.de Ing. Mark Asbach e-mail: asbach@ient.rwth-aachen.de Institute of Communications Engineering RWTH Aachen University 52056 Aachen, Germany © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 7 Acta Polytechnica Vol. 47 No. 4–5/2007