Microsoft Word - 476hernandez.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 43, 2015 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Chief Editors: Sauro Pierucci, Jiří J. Klemeš Copyright © 2015, AIDIC Servizi S.r.l., ISBN 978-88-95608-34-1; ISSN 2283-9216 Ensemble Clustering for Fault Diagnosis in Industrial Plants Sameer Al-Dahidia, Francesco Di Maioa*, Piero Baraldia, and Enrico Zioa,b aEnergy Department, Politecnico di Milano, Milan, Italy bChair on Systems Science and the Energetic challenge, Fondation EDF, Centrale Supélec, Paris, France francesco.dimaio@polimi.it In this paper, we propose an unsupervised ensemble clustering approach for fault diagnosis in industrial plants. The basic idea is to combine multiple base clusterings of operational transients of industrial equipment, when the number of clusters in the final ensemble clustering (P*) is unknown. In practice, a Cluster-based Similarity Partitioning Algorithm (CSPA) is employed to quantify the co-association matrix that describes the similarity among the different base clusterings and, then, a Spectral Clustering technique embedding an unsupervised K-Means algorithm is used to find the optimum number of clusters of P* based on Silhouette validity index calculation. The identified clusters allow distinguishing different operational behaviors of the equipment. The proposed approach is verified with respect to an artificial case study representative of the signal trend behavior of an industrial equipment during shut-down operations. The obtained results have been compared with those achieved by a state-of-art approach, known as Cluster-based Similarity Partitioning and Serial Graph Partitioning and Fill-reducing Matrix Ordering Algorithms (CSPA-METIS): the results show that the novel approach is able to identify the final ensemble clustering with a lower misclassification rate than the CSPA-METIS approach. 1. Introduction In industries such as chemical, oil and gas, and nuclear, equipment is subjected to several causes of performance degradation (e.g., presence of manufacturing defects, wear and tear) that lead systems to work in anomalous conditions (Baraldi et al. 2013a). Capturing the different operational conditions of this equipment, detecting the onset of abnormal conditions and classifying them in different types can aid the decision maker to decide a proper maintenance intervention and, hence, increase equipment availability and system safety, while reducing overall corrective maintenance costs (Piccinini and Demichela, 2008; Al-Dahidi et al. 2014; Demichela and Camuncoli, 2014). Fault diagnosis aims at partitioning the collected data representative of different operational conditions of the equipment into dissimilar groups (whose number may be “a priori” unknown) such that data belonging to the same group are more similar than those belonging to the other groups. Once the groups are identified, one can distinguish, among these, anomalous behaviors of the equipment (Baraldi et al. 2013a). In this paper, we consider the practical case in which the number of groups is a priori unknown and formulate the problem as an unsupervised classification problem aimed at partitioning the data into homogeneous clusters so that those data belonging to the same cluster are very similar to each other and dissimilar to those of the other clusters (Salvador, 2002). Several clustering algorithms have been proposed and practically used to solve unsupervised classification problems, for example K-Means (Liao and Bolt, 2002), Self-Organizing Maps (SOM) (Al-Dahidi, 2014), Fuzzy C-Means (FCM) (Di Maio et al. 2011; Di Maio et al. 2012; Baraldi et al. 2013a), Classification Tree (Baraldi et al. 2012) and Spectral clustering (Von Luxburg, 2007; Baraldi et al. 2013a; Baraldi et al. 2013b; Baraldi et al. 2014). However, there is no unique clustering algorithm capable of correctly identifying the underlying structure of any kind of dataset. Even the application of different clustering algorithms to the same set of data, or of the same algorithm with different parameter settings, leads to different clustering results (Fred and Jain, 2005; Fern and Lin, 2008; Vega-Pons and Ruiz-Shulcloper, 2011). Without any prior knowledge on the DOI: 10.3303/CET1543205 Please cite this article as: Al-Dahidi S., Di Maio F., Baraldi P., Zio E., 2015, Ensemble clustering for fault diagnosis in industrial plants, Chemical Engineering Transactions, 43, 1225-1230 DOI: 10.3303/CET1543205 1225 underlying structure of the dataset, one can only say that the different results obtained are equally plausible (Vega-Pons and Ruiz-Shulcloper, 2011). To proceed with this uncertainty in the clustering model, ensemble approaches have been proposed (Strehl and Ghosh, 2002). A typical ensemble clustering scheme is shown in Figure 1. For a given dataset X , ensemble clustering usually entails the collection of the results from multiple base clusterings. The base clusterings composing the ensemble can be different because of the different clustering algorithm used or because of the different data features extracted from X upon which (some) clustering algorithm is applied. The outcome of the individual base clusterings are, then, aggregated into the final ensemble clustering P* by a given method of aggregation (Strehl and Ghosh, 2002; Topchy et al. 2005; Chen, 2007; Vega-Pons and Ruiz- Shulcloper, 2011). Figure 1: Scheme of ensemble clustering approach Several methods have been used to obtain the final ensemble clustering. For example Graph and Hypergraph partitioning algorithms, such as the Cluster-based Similarity Partitioning (CSPA), construct a graph from the similarities among the base clusterings, and cluster it using a graphic-based clustering algorithm such as Serial Graph Partitioning and Fill-reducing Matrix Ordering Algorithm (METIS) (Karypis and Kumar, 1995; Strehl and Ghosh, 2002), for a predetermined number of clusters M in the final ensemble clustering (Topchy et al. 2005). In this paper, CSPA and METIS algorithms have been taken as reference for comparison because CSPA-METIS is the simplest and often best performing method for ensemble aggregation (Strehl and Ghosh, 2002). However, for most industrial applications, the “a priori” knowledge of the number of clusters M to be found in the final clustering is rarely available (Chakaravathy and Ghosh, 1996; Strehl and Ghosh, 2002). For this reason, an objective of the present work is to develop a novel unsupervised ensemble clustering approach capable of identifying the final ensemble clustering without any previous knowledge of M. To this aim, we propose to replace METIS algorithm with spectral clustering (Von Luxburg, 2007; Baraldi et al. 2013a) and Silhouette validity index (Rousseeuw, 1987) to automatically determine M. Practically, the base clustering results are summarized in a co-association matrix by pairwise similarity computation. Then, a spectral clustering technique (Von Luxburg, 2007; Baraldi et al. 2013a), embedding the unsupervised K- Means algorithm, is fed by the similarity matrix values for mining the clusters that are formed by the most similar data. The optimum number of clusters C* is selected among several candidates Ccandidate, based on the morphology of the obtained final ensemble clusters, evaluated by the Silhouette validity index that measures the similarity of the data belonging to the same cluster and the dissimilarity of these in the other clusters (a large silhouette value indicates that the obtained clusters of the final ensemble clustering are well separated and compacted (Rousseeuw, 1987)). The proposed approach is tested with respect to an artificial case study representative of the signal trend behavior of industrial equipment during shut-down transients. The results obtained have been compared with those achieved by CSPA-METIS approach. The remaining of this paper is organized as follows. In Section 2, the novel unsupervised ensemble clustering approach is proposed. The artificial case study representative of the signal trend behavior of industrial equipment during shut-down transients is introduced in Section 3. Furthermore, the results obtained with the application of the novel approach to the artificial case and the comparison with CSPA-METIS, are discussed in Section 4. Finally, Section 5 concludes the paper with some considerations. 2. The novel unsupervised ensemble clustering approach In this Section, the novel unsupervised ensemble clustering approach is proposed to overcome the need of having “a priori” knowledge of the number of clusters M in the final ensemble clustering. The flowchart for the method is sketched in Figure 2. The algorithm goes along the following two phases: a procedure (i.e., CSPA) for establishing a similarity matrix S and a novel ensemble procedure for revealing the “hidden” structure P* of S by adopting spectral clustering and Silhouette validity index. Dataset matrix X … … Base Clustering Base Clustering Base Clustering Ensemble Clustering *P 1226 Figure 2: Flowchart of the proposed approach We consider N data belonging to the dataset X that are clustered into H base clusterings. For each j-th base clustering, j=1,…,H, each datum is labeled by an integer number ranging in [1, ]joptC , where j optC is the number of clusters for each j-th base clustering. The problem of clustering the N data is, thus, transformed into an aggregation problem of the base clusterings outcomes Y of size NxH. The algorithm entails five main steps, where the first two are the CSPA procedure and the last three are the novel ensemble procedure: Step 1: Adjacency matrix computation. In practice, for each j-th base clustering, if two data belong to the same cluster they are considered similar, i.e., similarity μ=1, and if not they are dissimilar, i.e., similarity μ=0. Thus, an adjacency binary similarity matrix, A , is built by aggregating the similarities of the H base clusterings (Strehl and Ghosh, 2002). Step 2: Similarity matrix computation. From the adjacency binary similarity matrix, A , the overall similarity matrix S , is computed as the entry-wise average of the H base clusterings, i.e., 1 T S A A H = (Strehl and Ghosh, 2002). In this way, each entry of the similarity matrix has a value in [0,1], which is proportional to how likely a pair of data is, when grouped together. Step 3: Spectral Clustering. Spectral clustering aims at transforming the similarity matrix S into a normalized laplacian matrix rsL . Then, for the obtained rsL , the eigenvectors 1 2, ,..., ,...,candidate NCu u u u are calculated corresponding to the computed eigenvalues in ascending order 1 2, ,..., ,...,candidate NCλ λ λ λ , and are stored in a matrix U with a size NxN, where Ccandidate=[Cmin,Cmax], and Cmin and Cmax are the possible minimum and maximum numbers of clusters in the final ensemble clustering, respectively. Step 4: Clustering algorithm. For each candidate number of clusters Ccandidate, the eigenvectors 1 2, ,..., candidateCu u u associated to the Ccandidate smallest eigenvalues of its laplacian matrix rsL are calculated. In this way, the reduced matrix of U with a size NxCcandidate is fed to a clustering algorithm to find the final ensemble clustering * candidateC P . In this work, we resort to the K-means algorithm as one of the most popular clustering methods (Su and Chou, 2001; Fern and Lin, 2008). Step 5: Final ensemble clustering selection. For each Ccandidate the obtained ensemble clustering * candidateC P is evaluated by computing its Silhouette validity index candidateC SV (Rousseeuw, 1987). The most appropriate ensemble clustering * * C P is the one for which the Silhouette reaches a maximum, i.e., clusters are well separated and compacted (Rousseeuw, 1987). Original dataset Matrix (N data) Similarity matrix computation, Adjacency matrix computation, Clustering algorithm (H base clusterings) j.. ..1 2 H N x H1 optC 2 optC j optC H optC.... Labels matrix Y ENSEMBLE CLUSTERING CSPA PROCEDURE N x N .. 1 2 i N .. .. 1u 2u candidateCu Nu.. N x 1 candidateC * candidateC P NOVEL ENSEMBLE PROCEDURE N x 1 *C * * C P Final ensemble clustering A S X Computation of the eigenvectors , of the normalized laplacian matrix, (Spectral Clustering) U Ccandidate ϵ [Cmin ,Cmax ] Clustering algorithm K-means Silhouette validity index computation, SV Ccandidate Determine C* at max SV 1 2 i N .. .. 1 2 i N .. .. 1 2 i N .. .. 1227 3. Artificial case study An artificial case study has been designed to generate N=149 data representative of the signal trend behavior of an industrial equipment, e.g., a rotating machine, during shut-down operations. Each datum is described by H=3 features representative of the equipment condition, e.g., vibration signals, and of the environmental and operational conditions that can influence the equipment behavior, e.g., vacuum and temperature signals. These data are stored in a matrix X of a size 149x3. The objective is to reveal the “hidden” structure P* of the dataset X by identifying groups of data with similar functional behaviors, representative of different operational conditions of the equipment. Without loss of generality, it is assumed that the operational conditions of the industrial equipment are M=7: 1) three classes of normal condition (NC1, NC2, NC3), 2) three classes of abnormal condition (AC1, AC2, AC3), and 3) one class of outliers (i.e., unknown behaviours). The dataset X is pictorially shown in Figure 3. The values of each j-th feature jX , j=1,2,3, for different classes of data have been created by randomly sampling their realization from different univariate distribution functions (1 to 10 in Figure 3) whose combination characterizes the class. Figure 3: The seven operational conditions of the artificial case study As shown in Figure 3, clustering each j-th feature independently may reveal only some groups of the “hidden” operational conditions of the industrial equipment indicated in Figure 3, whereas only a final ensemble clustering would enlighten all the M=7 clusters. For example, clustering the feature j=1 cannot reveal any abnormal operational condition. The objective is, thus, to aggregate these base clusterings into a final ensemble clustering P* capable of identifying the “true” grouping of the shut-down transients of the industrial equipment. To mine the clusters shown in Figure 3, the j-th base clustering outcomes are obtained by a K-means unsupervised learning algorithm (Vlachos et al. 2003). For identifying the correct number of clusters joptC for each base clustering, Davies-Bouldin (DB) validity criterion has been used (Davies and Bouldin, 1979): the minimum DB value is reached for the number of clusters which gives optimal separation and compactness (Davies and Bouldin, 1979). Table 1 reports the optimum number of clusters joptC obtained for each base clustering. For validation of the DB validity criterion to decide joptC , we use the information on the real classes to which the data belong, to calculate the misclassification rate (Table 1) (it is worth noticing that in real industrial applications the real class is unknown). Table 1: Optimum numbers of clusters and misclassification rates of clustering for the three features Features j optC Misclassification rate j=1 2 7.4% j=2 2 4.6% j=3 4 6.8% The obtained base clustering labels for each feature have been, then, stored in a matrix Y of a size 149x3. The application of the clustering ensemble approach aims at finding the final ensemble clustering of the data. In the following Section, our novel approach is applied and compared with CSPA-METIS approach. 1 2 149 148 NC1 NC2 NC3 AC1 AC2 AC3 Outliers Function 7 Function 8 Function 9 Function 10 1 2 3True Label (Target) Function 1 Function 2 Function 3 Function 4 Function 5 Function 6 1 30 60 31 61 99 100 119 120 134 135 146 147 149 Features i . . . . . . 1228 4. Application of the novel approach to the artificial case study In this Section, the application of the novel ensemble clustering approach is described according to the steps presented in Section 3 and then, the results obtained are compared to those achieved by the CSPA-METIS approach. Given the similarity matrix ,S we calculate rsL and its eigenvectors 1 2 149, ,..., ,...,candidateCu u u u and the corresponding eigenvalues 1 2 149, ,..., ,..., .candidateCλ λ λ λ The obtained eigenvectors are stored in the matrix U with size 149x149. The number M of clusters of final ensemble clustering is selected according to the values of Silhouette for different numbers of clusters Ccandidate that span the interval [2,16], where the lower bound (2) is the minimum number of base clusters (see Table 1), whereas the upper bound (16) is the number of the largest combination of the three base clusters (i.e., 2x2x4): the optimum number of clusters C* in the final ensemble clustering is the value at which the Silhouette value is maximized, i.e., C* =6 (star in Figure 4 (left)). Figure 4 (right) shows the results of the aggregation obtained by using the novel and the reference approaches. The Figure shows the N=149 data in chronological order from top to bottom with the associated true clustering labels represented in different shades of color. It is worth mentioning that there is no correspondence between the similar colors of the true and the aggregation results. The application of CSPA- METIS approach leads us to distinguish clearly only three clusters, i.e., NC1, NC2 and NC3, whereas the remaining data have not been correctly clustered. Comparing the obtained clustering results with the true clustering, one can calculate the misclassification rate to be equal to 34.9% (52 out of 149 data incorrectly classified), which is not a satisfactory result. On the other hand, the application of the novel approach leads us to recognize clearly six out of seven operational conditions with a reduced misclassification rate 3.4% (5 out of 149 data incorrectly classified) compared to the CSPA-METIS approach. Figure 4: Silhouette values vs. cluster numbers (left) and the obtained final ensemble clusterings obtained by the novel and the reference approaches vs. the true clustering (right) As last remark, it is worth mentioning that the outliers (three transients – class 7) using the proposed approach have not been grouped together in a separate cluster: this depends on the capability of the base clustering algorithm in recognizing the outliers (Topchy et al. 2005). For example, the optimum number of clusters for the feature j=1 is 1 2optC = (see Table 1), whereas it should be equal to 3 (see Figure 3). 5. Conclusions In this work, a novel unsupervised ensemble clustering approach is proposed to construct a final ensemble clustering P* from H individual base clustering outcomes. The method is based on Spectral Clustering, embedding an unsupervised K-Means algorithm that is fed by a pairwise similarity computation, so that a co- association matrix summarizes the similarity among the data and the clusters are formed by the most similar data. The optimum number of clusters is selected among several candidates based on Silhouette validity index that quantifies the morphology of the obtained clusters and gives reason of the similarity of data belonging to the same cluster and, at the same time, of dissimilarity with those in the other clusters: a large silhouette value indicates that the obtained clusters of the final ensemble clustering are well separated and compacted. The proposed approach has been successfully tested with respect to an artificial case study properly designed to reproduce the signal trend behavior of industrial equipment during shut-down transients. The results obtained have been compared to those achieved by the CSPA-METIS approach of literature. The continuation of this work will consider the application of the method to real datasets collected during past operation of industrial equipment such as, for example, a nuclear power plant turbine. 2 4 6 8 10 12 14 16 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Number of clusters S il h o u et te i n d ex 1229 Acknowledgements The participation of Sameer Al-Dahidi and Piero Baraldi to this research is supported by the European Union Project INNovation through Human Factors in risk analysis and management (INNHF, www.innhf.eu) funded by the 7th framework program FP7-PEOPLE-2011- Initial Training Network: Marie-Curie Action. The participation of Enrico Zio to this research is partially supported by the China NSFC under grant number 71231001. References Al-Dahidi S., 2014, The Use of Self Organizing Maps for Diagnosing Faults in Motor Bearings, Safety and Reliability: Methodology and Applications - Proceedings of the European Safety and Reliability Conference, ESREL 2014, Wroclaw, Poland, September 2014, 895-902. Al-Dahidi S., Baraldi P., Di Maio F., Zio E., 2014, A novel fault detection system taking into account uncertainties in the reconstructed signals, Annals of Nuclear Energy, 73, 131–144. Baraldi P., Di Maio F., Zio E., Sauco S., Droguett E., Magno C., 2012, Sensitivity Analysis of the Scale Deposition on Equipment of Oil Wells Plants, Chemical Engineering Transactions, 26, 327-332. Baraldi P., Di Maio F., Zio E., 2013a, Unsupervised Clustering for Fault Diagnosis in Nuclear Power Plant Components, International Journal of Computational Intelligence Systems, 6 (4), 764-777. Baraldi P., Di Maio F., Rigamonti M., Zio E., Seraoui R., 2013b, Transients Analysis of a Nuclear Power Plant Component for Fault Diagnosis, Chemical Engineering Transactions, 33, 895-900. Baraldi P., Di Maio F., Rigamonti M., Zio E., Seraoui R., 2014, Unsupervised clustering of vibration signals for identifying anomalous conditions in a nuclear turbine, accepted, Journal of Intelligent and Fuzzy Systems. Chakaravathy S. V., Ghosh J., 1996, Scale based clustering using a radial basis function network, IEEE Transactions on Neural Networks, 2(5), 1250–61. Chen K., 2007, Trends in neural computation, Springer. Davies D.L., Bouldin D.W., 1979, A cluster separation measure, IEEE Trans. Pattern Analysis and Machine Intelligence, 1, 224-227. Demichela, M., Camuncoli, G., 2014, Risk based decision making. Discussion on two methodological milestones, Journal of Loss Prevention in the Process Industries, 28 (1), 01-108. Di Maio F., Secchi P., Vantini S., Zio E., 2011, Fuzzy C-Means Clustering of Signal Functional Principal Components for Post-Processing Dynamic Scenarios of a Nuclear Power Plant Digital Instrumentation and Control System, IEEE – Transactions on Reliability, 60(2), June 2011, 415-425. Di Maio F., Hu J., Tse P., Pecht M., Tsui K., Zio E., 2012, Ensemble-approaches for clustering health status of oil sand pumps, Expert Systems with Applications 39, 5 (2012) 4847-4859. Dimitriadou E., Weingessel A., Homik K., 2001, Voting-merging: an ensemble method for clustering, In Proc. 2001 Int. Conf. Artificial Neural Networks (ICANN'01), 217-224. Fern X. Z., Lin W., 2008, Cluster ensemble selection, Statistical Analysis and Data Mining, 1(3), 128-141. Fred A. L., Jain A. K., 2005, Combining multiple clusterings using evidence accumulation, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(6), 835-850. Karypis G., Kumar V., 1995, METIS - Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0 (Technical report). Liao T., Bolt B., 2002, Understanding and projecting the battle state, 23rd Army Science Conf., Orlando, FL. Piccinini, N., Demichela, M., 2008, Risk based decision-making in plant design, Canadian Journal of Chemical Engineering, 86 (3), pp. 316-322. Rousseeuw P., 1987, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 53–65. Salvador A., 2002, Faults diagnosis in industrial processes with a hybrid diagnostic system, In MICAI 2002: Advances in Artificial Intelligence, 536-545. Springer Berlin Heidelberg. Strehl A., Ghosh J., 2002, Cluster ensembles-a knowledge reuse framework for combining partitionings, In AAAI/IAAI, 93-99. Su M. C., Chou C. H., 2001, A modified version of the K-means algorithm with a distance based on cluster symmetry, IEEE Transactions on pattern analysis and machine intelligence, 23(6), 674-680. ISO 690. Topchy A., Jain A. K., Punch W., 2005, Clustering ensembles: Models of consensus and weak partitions, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(12), 1866-1881. Vega-Pons S., Ruiz-Shulcloper J., 2011, A survey of clustering ensemble algorithms, International Journal of Pattern Recognition and Artificial Intelligence, 25(03), 337-372. Von Luxburg U., 2007, A Tutorial on Spectral Clustering, Statistics and Computing, 17(4), 395-416. 1230