Earth Sci. Res. J. Vol. 11, No. 2 (December 2007): 131-138 CLUSTERING ON DISSIMILARITY REPRESENTATIONS FOR DETECTING MISLABELLED SEISMIC SIGNALS AT NEVADO DEL RUIZ VOLCANO Mauricio Orozco-Alzate, and César Germán Castellanos-Domínguez Universidad Nacional de Colombia Sede Manizales, Grupo de Control y Procesamiento Digital de Señales, Campus La Nubia, km 7 vía al Magdalena, Manizales, Colombia. Corresponding author: Mauricio Orozco-Alzate, email: morozcoa@unal.edu.co ABSTRACT Classification of seismic signals at Colombian volcanoes has been carried out manually by visual inspection. In order to reduce the workload for the seismic analysts and to turn classification reliable and objective, the use of supervised learning algorithms has been explored; particularly classifiers built in dissimilarity spaces. Nonetheless, the performance of such learning methods is subject to the availability of a representative and a priori well classified training sets. To detect mislabeled events, the use of clustering techniques on the dissimilarity representations is proposed. Our experiments, performed on re-analyzed seismic signals, show a significant improvement respect to recognition accuracies for the original data sets. Key words: Clustering, dissimilarity, mislabeling, seismic signals. RESUMEN La clasificación de las señales sísmicas en los volcanes de Colombia ha sido llevada a cabo manualmente mediante inspección visual. Con el fin de reducir la carga de trabajo de los analistas y para tornar la clasificación confiable y objetiva, se ha explorado el uso de algoritmos de aprendizaje supervisado; particularmente, clasificadores construidos en espacios de disimilitud. No obstante, el desempeño de dichos métodos de aprendizaje está sujeto a la disponibilidad de un conjunto de entrenamiento representativo y, a priori, bien clasificado. Para detectar eventos mal clasificados, se propone el uso de técnicas de agrupamiento sobre las representaciones de disimilitud. Los experimentos, realizados sobre las señales sísmicas verificadas, muestran una mejora significativa respecto a las tasas de reconocimiento para los datos originales. Palabras claves: Agrupamiento, disimilitud, etiquetado incorrecto, señales sísmicas. 131 EARTH SCIENCES RESEARCH JOURNAL Manuscript received September 9 2007. Accepted for publication November 30 2007. INTRODUCTION In many applications of pattern recognition, it is extremely difficult or expensive, or even impossible, to reliably label a training sample with its true category (Jain et al., 2000). Particularly, in automatic classification of seismic-volcanic signals, night and rotating shift work schedules, tedious evaluations, and changes of personnel turn the task of recognition by visual inspection susceptible to human errors. Besides, analysts often engage in differences of opinion about interpretations of dubitable signals. In order to reduce the workload for the seismic analyst and the risks associated to subjective judgments, a number of supervised classification methods have been used (Scarpetta et al., 2005; Langer et al., 2006; Orozco-Alzate et al., 2006a). It is supposed for those supervised classification techniques that a well-labeled data set is available. However, due to the same reasons cited above, it is highly likely that training sets include mislabeled events. In Langer et al. (2006), an automatic classification of seismic events at Soufrière Hill volcano was carried out. In addition, a careful manual revision of the original a-priori classification was achieved by an expert not involved in the previous labeling of the data set. It was found that a considerable number of the events were erroneously attributed to other classes. As a result, a remarkable improvement in classification accuracy was obtained when the revised data set was used. The Nevado del Ruiz Volcano is monitored by the Volcanological and Seismological Observatory at Manizales (VSOM). Because of the considerable amount of data, the labelling task of the recorded seismic signals is distributed among several analysts (e.g. one trainee per volcanic station). A second or third opinion is requested just in case of serious doubt. As a result, classifications performed by different experts are not available and an analysis of concordance for such a-priori labels was not conducted. In this study, a revision of the original labelled Nevado del Ruiz volcano (Ruiz) data set is conducted. In contrast to the approach followed by Langer et al. (2006), the revision by using clustering techniques was automated. Several clustering algorithms on a given data set were used due to the lack of a single appropriate clustering algorithm (Jain et al., 2000). Therefore, experiments were conducted by using the most popular clustering approaches, which belong to two basic strategies: hierarchical and partitioning methods. In addition, the Ruiz data set was arranged to consider two separated problems: the Ruiz-VT,LP (two classes) and the Ruiz- all (three classes) data sets. Revised data sets were used according to our previous dissimilarity-based classification approach (Orozco et al., 2006a, Orozco et al., 2006b) and compared against the performances obtained with the original data sets. DISSIMILARITY REPRESENTATION AND CLASSIFIER Differences in spectral content allowed a visual discrimination of different types of volcanic earthquakes. Therefore, spectra of seismic records are commonly used for classification and monitoring of seismic activity (Zobin, 2003). In addition, recent studies have claimed that the dissimilarity- based classification approach is a feasible and sometimes advantageous alternative to the feature-based method (Duin et al., 1998, Pękalska et al., 2001, Pękalska and Duin, 2002, Paclík and Duin, 2003b, Pękalska and Duin, 2005). According to those facts, a dissimilarity representation for the Ruiz data set can be derived as follows: (i) the power spectral density (PSD) for each CLUSTERING ON DISSIMILARITY REPRESENTATIONS FOR DETECTING MISLABELLED SEISMIC SIGNALS AT NEVADO DEL RUIZ VOLCANO 132 Orozco and Castellanos. ESRJ Vol. 11, No. 2. December 2007 133 record is estimated via the Yule-Walker autoregressive method: DC bias must be removed before computing the spectra, (ii) a dissimilarity measure between normalized spectra is calculated as the area difference of the non-overlapping parts (L1-norm) between spectra, see Fig. 1. Figure 1. Dissimilarity measure as the difference between normalized spectra. A dissimilarity matrix D(T,T) was constructed by having those pairwise measures. Each entry dij of D corresponds to the dissimilarity between a pair of seismic records from the training set T. Then, a proper classifier can be defined on such a dissimilarity representation, either by using the entire training set T or a representation set R⊆T. Linear Normal Density Based Classifier A number of studies have showed that normal density based classifiers perform well in dissimilarity spaces (Pękalska et al., 2001, Pękalska and Duin, 2002, Paclík and Duin, 2003b, Paclík and Duin, 2003a, Pękalska et al., 2004, Orozco et al., 2006a). Particularly, in our previous study with the Nevado del Ruiz volcano data set (Orozco et al., 2006b), the linear normal density based classifier (BayesNL) outperformed the nearest neighbor rule 1-NN and the quadratic normal density based classifier (BayesNQ). For a two-class problem, the BayesNL classifier is given by              2 1 21 1 21 log2 1 ),()),(( P P CRxDRxDf T       mmmm (1) where C is the sample covariance matrix; m(1), m(2) are the mean vectors and P(1), P(2) are the class prior probabilities. If C is singular, a regularized version must be used. The following regularization is typically used with λ equals 0.01 or less (Pękalska et al., 2006):    CdiagCCreg   1 . (2) CLUSTERING TECHNIQUES Unsupervised classification refers to Figure 1. Dissimilarity measure as the difference between normalized spectra. situations where the objective is to construct decision boundaries based on unlabeled training data (Jain et al., 2000). Hierarchical and partitioning methods are the two basic strategies to find clusters. In this study, the following clustering techniques are used: single linkage (SL), average linkage (AL), complete linkage (CL), k-means and k-centres. SL, AL, and CL are hierarchical, whereas the latter are partitioning methods. A brief description of these approaches is given below. Hierarchical clustering The most popular hierarchical techniques for clustering are the agglomerative methods. At the beginning, each object is considered as a single cluster; then, the closest two clusters are merged iteratively until a specified number of clusters is reached (Pękalska and Duin, 2005). Let Ck and Cl be two clusters of the cardinalities nk and nl respectively, and let ρkl be a dissimilarity measure between them. Three basic criteria for the agglomerative methods are summarized in Table 1. Partitioning clustering Partitioning methods group the objects into k clusters, usually by using representatives or by assuming a specific geometrical structure. Objects are assigned to the clusters, new representatives are estimated and the process is repeated until a stable solution is reached. Two typical partitioning methods are k-means and k-centres; see Table 2 for a brief description, a detailed one can be found in Pękalska and Duin (2005). TABLE 2. Clustering methods. EXPERIMENTAL RESULTS Volcano-Tectonic (VT) earthquakes, Long- Period (LP) earthquakes and Icequakes (IC) are the seismic signals classes considered in this study. They are contained in the Ruiz-all data set. The Ruiz-VT,LP data set includes only the first two classes. Signals were digitized at 100.16 Hz sampling frequency by using a 12 bits analog to digital converter. 134 Method � kl Emphasis/comment SL  jiCpCp ppdljki ,minmin  Connectedness. Resulting clusters are elongated and chain-like. CL  jiCpCp ppdljki ,maxmax  Compactness. It performs well when the objects form naturally distinct clouds. AL     ki ljCp Cp ji lk ppd NN ,1 Connectedness and Compactness. It performs well for naturally distinct clouds and elongated clusters. Table 1. Hierarchical clustering methods. CLUSTERING ON DISSIMILARITY REPRESENTATIONS FOR DETECTING MISLABELLED SEISMIC SIGNALS AT NEVADO DEL RUIZ VOLCANO 135 Recording stations are located near to the Olleta crater and the glacier at Nevado del Ruiz volcanic complex. In order to explore the level of agreement/ disagreement between the labels given by the experts and the ones produced by the clustering algorithms, the number of mismatches for the entire data sets is considered. The averaged number of mismatches over 10 runs is reported in Table 3. Hierarchical methods report the same number of mismatches over the runs, therefore their standard deviations are zero. SL hierarchical criterion for both the Ruiz- VT,LP and the Ruiz-all problems presents a rate of disagreement considerable high; similarly, mismatches of AL results for the Ruiz-all problem reach 45%. In fact, even though the number of cluster is fixed, SL and AL find second and third clusters of a few objects only. As a result, valid data subsets, i.e. randomly generated and including enough objects per class, are not always warranted. In consequence, AL for the Ruiz-all problem and SL in both cases are not considered in the subsequent classification experiments. Orozco et al. (2006b) observed an asymptotic behaviour for training set sizes greater than 60 examples per class. In addition, the BayesNL provided the best overall performance, outperforming the 1-NN and the BayesNQ. According to that, the experiments were conducted with the BayesNL using training sets of a fixed size of 60 objects per class. Clustering was performed on the entire data sets. Then, training and test sets are randomly extracted for each run. The results are shown in Table 4. For comparison, the results using the original data are also presented. It is clear that performances for re-labelled data sets are much better than those for the original data. DISCUSSION AND CONCLUSION A revision of the original labelled seismic events recorded by the VSOM staff provides a significant improvement in the performance of supervised dissimilarity-based classifiers such as the observed for the BayesNL classifier. The use of events labelled by clustering confirmed that labelling errors are frequent and recurrent. Clustering uses a notion of proximity, judged in a numerical way. In contrast, labels assigned by experts obey to the Method Description k-means Representatives are estimated by cluster mean vectors. The dissimilarity is the Euclidean distance of an object to the cluster means. k-centres Centre objects are chosen such that the maximum of the distances over all objects to the nearest centre is minimized. Results depend on random initialization. Table 2. Clustering methods. Orozco and Castellanos. ESRJ Vol. 11, No. 2. December 2007 136 Clustering method Ruiz-VT,LP Ruiz-all SL 482 1108 CL 367 495 AL 164 861 k-centres 158.2 (25.2049) 507.5 (52.2797) k-means 135.6 (0.5164) 506.6 (0.5164) Total 1063 1891 Table 3. Averaged number of mismatches between the class labels assigned by the VSOM staff and labels assigned by the clustering method over the entire data sets. Clustering method Ruiz-VT,LP Ruiz-all SL — — CL 2.3494 (0.4045) — AL 4.9646 (1.1176) 3.77 (0.76) k-means 2.7524 (0.7405) 5.6722 (0.6109) k-centres 2.8810 (0.7441) 4.6792 (0.8413) Total 13.0075 (1.0354) 20.02 (0.81) Table 4. Classification error (in % and averaged over 25 runs) with its standard deviation (in %) for the RNLC applied to the revised data sets. CLUSTERING ON DISSIMILARITY REPRESENTATIONS FOR DETECTING MISLABELLED SEISMIC SIGNALS AT NEVADO DEL RUIZ VOLCANO 137 visual resemblance between the event and a canonical waveform which analysts have learnt by reference or experience. Obviously, such a method is highly subjective and supposes that differences are easily detected by visual inspection but in many cases this is not true. Since the final rule used (stand-alone) was calculated from all the data, clustering methods were used on the entire data sets instead of applying them to the training sets only. AL and CL offer the smallest errors for the Ruiz-VT,LP and Ruiz-all problems respectively. Even tough, the best clustering is hierarchical in both cases; differences are not enough to claim that hierarchical methods should be preferred over the partitioning ones. Nonetheless, a general conclusion can be drawn from our study: the use of a clustering method to confirm labels assigned by experts is highly beneficial for constructing reliable and accurate supervised classifiers of seismic events. ACKNOWLEDGEMENTS We thank the VSOM staff for providing the raw data set. REFERENCES Duin, R. P. W., de Ridder, D., and Tax, D. M. J. (1998). Featureless pattern classification. Kybernetika, 34, no. 4, 399–404. Jain, A. K., Duin, R. P. W., and Mao, J. (2000). Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Machine Intell., 22, no 1, 4–37. Langer, H., Falsaperla, S., Powell, T., and Thompson, G. (2006). Automatic classification and a-posteriori analysis of seismic event identification at Soufrière hills volcano, Montserrat. Journal of Volcanology and Geothermal Research, 153, 1–10. Orozco, M., García, M. E., Duin, R. P. W., and Castellanos, C. G. (2006a). Dissimilarity- based classification of seismic volcanic signals at Nevado del Ruiz volcano. 2nd Latin-American Congress of Seismology, Bogotá, Colombia, August, CD-ROM. Orozco, M., García, M. E., Duin, R. P. W., and Castellanos, C. G. (2006b). Dissimilarity- based classification of seismic volcanic signals at Nevado del Ruiz volcano. Earth Sciences Research Journal, 10, no. 2, 57– 65. Paclík, P. and Duin, R. P. W. (2003a). Classifying spectral data using relational representation. Proceedings of the Spectral Imaging Workshop, Graz, Austria, April, 31- 34. Paclík, P. and Duin, R. P. W. (2003b). Dissimilarity-based classification of spectra: computational issues. Real Time Imaging, 9, no. 4, 237–244. Pękalska, E. and Duin, R. P. W. (2002). Dissimilarity representations allow for building good classifiers. Pattern Recognition Lett., 23, no. 8, 943–956. Pękalska, E. and Duin, R. P. W. (2005). The Dissimilarity Representation for Pattern Recognition: Foundations and Applications. World Scientific, Singapore, 636pp. Pękalska, E., Duin, R. P. W., Günter, S., and Bunke, H. (2004). On not making dissimilarities Euclidean. Proceedings of Structural and Statistical Pattern Recognition, Lisbon, Portugal, August, 1143–1151. Pękalska, E., Duin, R. P. W., and Paclík, P. (2006). Prototype selection for dissimilarity- based classifiers. Pattern Recognition, 39, no. 2, 189–208. Orozco and Castellanos. ESRJ Vol. 11, No. 2. December 2007 138 Pękalska, E., Paclík, P., and Duin, R. P. W. (2001). A generalized kernel approach to dissimilarity based classification. J. Mach. Learn. Res., 2, no. 2, 175–211. Scarpetta, S., Giudicepietro, F., Ezin, E. C., Petrosino, S., Pezzo, E. D., Martini, M., and Marinaro, M. (2005). Automatic classification of seismic signals at Mt. Vesuvius volcano, Italy, using neural networks. Bulletin of the Seismological Society of America, 95, no. 1, 185–196. Zobin, V. (2003). Introduction to Volcanic Seismology. Elsevier, Amsterdam, The Netherlands, 302pp. CLUSTERING ON DISSIMILARITY REPRESENTATIONS FOR DETECTING MISLABELLED SEISMIC SIGNALS AT NEVADO DEL RUIZ VOLCANO