Microsoft Word - 211.docx CHEMICAL ENGINEERING TRANSACTIONS VOL. 61, 2017 A publication of The Italian Association of Chemical Engineering Online at www.aidic.it/cet Guest Editors: Petar S Varbanov, Rongxin Su, Hon Loong Lam, Xia Liu, Jiří J Klemeš Copyright © 2017, AIDIC Servizi S.r.l. ISBN 978-88-95608-51-8; ISSN 2283-9216 A Novel Training Sample Selection Approach for Near- Infrared Spectroscopy Model and Its Industrial Application Kaixun Hea,*, Yiran Lib, Kai Wangc aCollege of Electrical Engineering and Automation, Shandong University of Science and Technology, Qingdao 266590 China bSchool of Information and Control Engineering, China University of mining and technology, Xuzhou 221116 China cEast China University of Science and Technology, Shanghai 200237, China kaixunhe@sdust.edu.cn Near-infrared (NIR) spectroscopy has been widely applied for the real-time measurements of quality variables, which plays an important role in process control, monitoring and optimization. Since the prediction accuracy of NIR model strongly depends on the structure of training samples, it is important to optimize the process of training samples selection. Therefore, in the present work, a cross validation based approach which combined with kmeans++ algorithm is developed for this optimization. Based on the results, an efficient adaptive multi- model approach can be developed. During online application, according to the similarity distance between query sample and sub-models, the optimal sub-model can be selected and the high-performance predictions can be achieved. The usefulness and superiority of the proposed method is demonstrated and compared with other modeling algorithms in a real-world gasoline blending process in China. 1. Introduction In process industry, key product quality should be measured accurately and timely in order to produce high- quality products (Bakirov et al., 2017). However, traditional lab analyses are expensive, time consuming, and introduce a significant time delay to the optimal control system. During recent years, near-infrared (NIR) spectroscopy has been widely employed as an online process analytical tool (PAT) to address these issues. The dominant advantage of this method is its ability to provide estimation results much more rapidly with little or no sample preparation (Mei et al., 2016). By using NIR-based analytical tool, difficult-to-measure key properties are estimated by the NIR spectra using statistical or machine learning techniques based on Beer’s law. He et al. (2015) has reported this application in online gasoline blending process and a dual updating strategy was adopted to improve the accuracy of NIR model. In this strategy, Local weighted strategy was used in sampling intervals and recursive method is adopted when new reference samples become available. Obviously, for its application, the key step is to establish NIR quantitative calibration model. Based on the estimated properties, online optimal control can be carried out. Due to the ability to deal with co-linearity as well as high dimensionality, principal component regression (PCA) and partial least-squares regression (PLS) have long been widely adopted (Quiñones et al., 2014). For the properties which have nonlinear relationship with NIR absorbance, nonlinear PLS, artificial neural networks (ANNs), the support vector machine based regression method (SVR) and Gaussian process regression (GPR) are used (Balabin and Lomakina, 2011). All these mentioned methods are static and global based strategy, which have been adopted as useful methods for online prediction in the last decades. Nevertheless, the static based models cannot always function well due to changes of process raw materials, process fouling, and etc. (Kadlec et al., 2011). To cope with such issue, various adaptive strategies have been proposed, such as recursive or incremental based algorithm, moving window strategy, Just In Time Learning (JITL), local weighted regression (LW) and etc. The typical representatives of incremental methods is recursive partial least-squares (RPLS), which expands training dataset by adding every available new sample. When the sampling operation is continuously, it can update the original model and capture the new variation of the process. However, since the reference properties DOI: 10.3303/CET1761236 Please cite this article as: He K., Li Y., Wang K., 2017, A novel training sample selection approach for near-infrared spectroscopy model and its industrial application, Chemical Engineering Transactions, 61, 1429-1434 DOI:10.3303/CET1761236 1429 have to be analysed offline, sampling interval is long and not uniform generally. Therefore, the model cannot be updated timely and does not deliver satisfactory predictions in real world application. The moving window approach abandons old data while new samples are acquired. Hence, it has the similar issues as recursive algorithms. Recently, JITL and LW strategies gained popularity because of their ability to deal with nonlinearity as well as abrupt changes. Due to the updating process does not depend on the new sample’s reference information, both of them can adjust the calibration model to capture the current state of process timely. However, the number of local training samples is difficult to determine. More samples be included will lead to a large online computation load, while fewer samples lead to a deterioration of the performance (Ge and Song, 2010). Additionally, training samples are only selected based on similarity distance, in this manner, the dependent variable information and process knowledge are not taken into consideration arbitrarily. Hence, with an unsuitable similarity criterion, it is possible to construct a training dataset which leads to a larger prediction error. All of these features limit its application in the real-world industry. As discussed above, for the process characteristic with large sampling interval, strong nonlinearity, and multi operation condition, it is not possible to achieve an adequate exactitude in the predictions using just one model for a wider or the entire range. As we know, LW strategy is a suitable approach in practical. But, in order to get a high-performance model, the training samples should be carefully selected. Motivated by these issues, this paper intends to develop a novel supervised training sample selection method and establish a local weighted partial least square (LWPLS) NIR model for online prediction. The proposed method includes two steps: (1) offline process: the main task of this procedure is to divide the original dataset into several sub sets using kmeans++ algorithm and then optimize each sub set based on cross-validation. (2) online process: this procedure selects the optimal sub-dataset for each query sample and build the corresponding local model using local weighted strategy. Based on the two steps, the most relevant model can be determined for each new sample and the good prediction performance can be desired. To verify the effectiveness of the novel strategy, an industry case study is provided. 2. Theory and algorithm Historical NIR data are sampled from multiple operating processes, which are characterized by inherent nonlinearity and shifting dynamics. Hence, it is difficult to select appropriate training dataset and construct an accurate NIR model for a specific process. In practical, this procedure is mainly depended on the process knowledge of experienced engineers, however, it is time-consuming. Additionally, for online application, the nonlinear and adaptive forms of NIR model should be adopted to cope with nonlinearity and time variance issues. This motivated us to explore a new sample selection and model updating approach named ‘k-means and cross validation based local weighted PLS (KmCv-LWPLS), the details of the proposed method is presented in the following. 2.1 Dataset partition strategy The procedure of our proposed method is summarized below. Step 1. Determine the number of partitions k . In this work, the parameter k is determined by trial and error. Step 2. Pre-process the training dataset { , } train train X Y , and extract feature components by PCA, then we can get the low-dimensional input variables pca X , Step 3. Carry out k-means++ algorithm, get the initial clustering labels tag j and clustering centers u j , Step 4. According to the acquired label vector tag j , the original dataset { , } train train X Y is divided as { , } train train X Y = {( , ) , ..., ( , ) , ..., ( , ) } 1 sub sub sub sub sub subj k X Y X Y X Y (1) where 1, 2, ...,j k , and then establish PLS model for each sub-dataset, Step 5. Detect the boundary points of each sub-dataset, Step 6. Holdout all the boundary points, and establish PLS model subM j for each new sub-dataset subD j , Step 7. Calculate the square error SEb for all the boundary points using each subM j , 2 ( )SE y y b b b   (2) 1430 where yb is laboratory analysis value of the boundary sample x b and y b denotes its predicted value. Step 8. Allocate the boundary points to subD j which give the minimum prediction error. To detect the boundary points of each sub-dataset (Step 5), a leave-one-out cross validation method is adopted. The basic idea of the proposed strategy is to evaluate the contribution of each training sample and sort them. Hereby, the points which deteriorate the performance of NIR model can be detected. The details of this method for each sub-dataset ( , )sub sub jX Y are presented as follows: 1: For each si , holdout ( , )x y si and build PLS model using the remaining samples. Where si N j and N j is the sample number of sub-dataset ( , )sub sub jX Y , 2: Calculate the prediction value ysi using the established PLS model, 3: Carry out leave-one-out cross validation algorithm to get the prediction value of the remaining samples ysj (where 1, 2, ...,sj N j and sj si ), 4: Calculate the root-mean-square error (RMSE), 1~ 2 2 ( ) ( ) sj N j y y y ysi si sj sj sj sij RMSEsi N j       (3) 5: Sort ( , )x y si according to the ascending order of j RMSEsi , 6: Holdout the former sk ( 1, 2, ...sk  ) samples and establish PLS model using the remaining data, 7: Calculate j RMSE sk using the method described in Step 2- Step 4, 8: Increase the value of sk , and repeat Step 6-7 until j RMSE sk reaches the minimum value, 9: The samples {( , ) , ( , ) , ..., ( , ) } 1 2 x y x y x y jsk are denoted as boundary points and the new sub-dataset is denoted as subD j . 2.2 Model selection and updating strategy According to the procedure mentioned in Section 2.1, the optimal partition of the original dataset can be obtained. Then, for each query sample, the optimal sub-dataset should be selected and the corresponding sub- model can be established. In this study, the Euclidean distance , d q k between the query sample xq and k x sub are calculated to detect the optimal sub-model. Here, ( )( ) , k k T d x x x xq qq k sub sub    (4) And k x sub denotes the mean value of sub-dataset k . The sub-dataset with the minimum , d q k will be adopted. As long as we can get the optimal sub-dataset, a prediction model can be established using PCR, PLS and etc. In Section 2.1, PLS is adopted to build NIR model due to its simplicity. However, as mentioned before, PLS is a static, global and linear based method, it may lead to inaccurate estimations in some local regions and its robustness is often jeopardized by process variations. Thereby, adaptive modelling and updating methods are necessary. Consider the uneven and low frequency sampling of reference data as well as the large time delay of its lab analysis, the traditional bias updating and recursive algorithms are not available in practical. Both of LW and JITL methods are widely applied to address such issues. LW method weights training sample xi according to the similarity distance between xi and xq . While, JITL selects new samples from historical dataset 1431 to build a new model for every query sample xq . In this way, the predictive accuracy may drop due to changes of training samples and the number of training samples. Compared with JITL, LW does not change the structure of the original training dataset, thereby enabling improved the stability of NIR model within the updating process. Hence, in this research, local weighted method is adopted. The procedure of the adopted LW method has been mentioned in Section 2.2 and the weight evaluates the similarity between xq and xi is calculated as follows: 2 ( ) , ,1 exp( ) 2 ( ( ) ) , ,1 m x x i d q dd i m std x x i d q dd            (5) Here  is a localization parameter. When  is small, the similarity decreases steeply, otherwise, it changes gradually (Kim et al., 2013). 3. Case study In this section, one case study is provided to validate the practicability of our proposed method. The dataset was obtained from the real-world gasoline blending process. Four modelling approaches, namely, PLS, LWPLS, k-means based local weighted PLS (Km-LWPLS) and JITL are applied to NIR model development for comparison. Precisely, both the root mean square error (RMSE) and coefficient of determination (R2) are defined as follows for quantity comparisons of different algorithms. The model which gives the lowest RMSE and the highest R2 is considered best. 1 2 ( ) 1n n RMSE y yi i i    (6) 2 ( ) 2 1 1 2 ( ) 1 n y yi i i R n y yi i i         (7) The four comparing algorithms investigated are as follows: (1) PLS: A global PLS model is established. Its structure remains unchanged during the whole process. (2) LWPLS: For every query sample xq , different weights are assigned to the training samples based on i , and then a new local PLS model is trained to predict the output. (3) Km-LWPLS: In this approach, the original training dataset is divided into k clusters by k-means++ algorithm. Then, we can get k sub-models. For each query sample xq , the optimal sub-model is selected and LWPLS is carried out to give the predicted values. (4) JITL: The JITL method can establish local model for each query sample based on similarity distance. In this study, the similarity i is used for the sake of simplicity. Local training samples were selected according to the equation i  , where  is the similarity threshold. In addition, PLS algorithm is used to build a local model. 3.1 Gasoline Blending Process Gasoline blending is a crucial unit operation in the gasoline industry. It is the final step before gasoline product be delivered (He et al., 2016). In this study, NIR model is adopted to predict the research octane number (RON) which is the key property of gasoline. A total of 312 samples have been collected from daily process records and the corresponding laboratory analysis. The spectra range was restricted to 1,100 nm to 1,300 nm, each NIR sample consists of 201 wavelength variables. Reference values of RON were measured using standard ASTM testing methodologies. In addition, for proprietary reasons, the property values (RON) were normalized between -1 and 1. The original samples were divided into training and testing dataset, randomly: 175 samples were utilized for training and the remaining samples were used as testing dataset. Then, the training dataset was segregated into 3 clusters using kmeans++ algorithm. The parameters of all the methods are tabulated in Table 1, and the comparison results of all the methods are listed in Table 2. According to the RMSE and R2, 1432 PLS gives considerably higher error than the other methods. It illustrates that this method cannot capture the change of the process well. Table 1: Optimal parameters of each algorithm Method k α γ PLS _ _ _ LWPLS _ 0.01 _ Km-LWPLS 3 0.01 _ JITL _ _ 0.001 KmCV-LWPLS 3 0.01 _ As a result, in an industry application, the PLS model need to be updated frequently, which is very time- consuming. Therefore, it is not suitable for online prediction. The proposed KmCv-LWPLS has the lowest RMSE and the highest R2. It indicates that this method can able to estimate the future data(testing data)effectively. The Km-LWPLS approach gives better performance than JITL, and the results of JITL algorithm are better than that of LWPLS. These clearly show that a global, static and linear model cannot function well when the processes are characterized with nonlinearity and time-varying, while updating and multi-model strategy such as LWPLS, JITL, Km-LWPLS, KmCv-LWPLS and etc. can improve prediction accuracy. Although the training samples of LWPLS are the same as that of PLS, LWPLS gives better results. This indicates that local weighted strategy enables PLS to account for nonlinearity as well as the time-varying issues. However, the high-level samples in the training dataset influence the performance hugely, and lead to large prediction error. Hence, the performance of this method is worse than JITL and KmCv-LWPLS. The conventional JITL based method selects local training samples based on the Euclidean distance and establishes local model for each query sample. Based on this, for each new sample, the high level points are not included in the local model. Hence, JITL approach performs better than LWPLS. However, as shown in Figure 1, for some abrupt change, the prediction error is large. One reason for this phenomenon seems to be that the information of the objective variable y and process knowledge are not taken into consideration when select the training samples. As a result, for these query samples, JITL approach cannot obtain an optimal training dataset and leads to poor performance. In addition, the number of similar samples is changed each time, which lead to the instability of the prediction accuracy. For example, modelling information is insufficient with fewer samples, and the high-level points may be included and jeopardize model performance if more samples are chosen. Figure 1: Error results of JITL and KmCv-LWPLS Figure 2: Error curve of Km-LWPLS and KmCv-LWPLS Compared to JITL, Km-LWPLS and KmCv-LWPLS build local models offline based on the historical data. For online application, the optimal local model is selected based on similarity criterion. This strategy remains the structure of each sub-model stability and eliminates the influence of high level data. Hence, as showed in Table 2, both of Km-LWPLS and KmCv-LWPLS are more effective than JITL, LWPLS and PLS methods. According to the results, the proposed KmCv-LWPLS improves the RMSE and R2 in comparison with Km-LWPLS approach. In addition, the only difference of Km-LWPLS and KmCv-LWPLS is the strategy of dataset partition. These clearly show the importance to optimize clustering process when a multi-model be established. Besides, as illustrated in Figure 2, the improved strategy can handle abrupt change more effective and gives a smaller error. As analysed previously, the multi-model based strategy is suitable for the nonlinearity, multi-operation 0 20 40 60 80 100 120 140 -1 -0.5 0 0.5 Sample number P re d ic ti o n e rr o r JITL Kmcv-lwpls 0 20 40 60 80 100 120 140 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Sample number P re d ic ti o n e rr o r Km-lwpls Kmcv-lwpls 1433 process. Additionally, it is important to optimal the dataset partition process combined with process knowledge and dependent variable information. Typically, the cross-validation strategy proposed in this paper is effective. Besides, for the model updating, new sampling points can improve the performance, however, it is difficult to implement because of the large sampling interval and low sampling frequency. Therefore, local weighted approach is more practical than the new sample based updating strategy. Since the proposed method can take full use of the advantage of LW and cross validation, it can provide the best results. Table 2: Model performance of each algorithm Method RMSE R2 PLS 0.3252 0.9255 LWPLS 0.2336 0.9616 JITL 0.2221 0.9652 Km-LWPLS 0.1986 0.9722 KmCV-LWPLS 0.1955 0.9731 4. Conclusions This paper expounds the importance of modelling algorithms for NIR system and points out that the traditional modelling strategies are insufficient to establish an effective NIR model, especially for the industrial process characteristic with high nonlinearity and large sampling interval. In current work, we propose a cross validation based multi-model modelling strategy to handle this issue. Through applications to a real industry data set, it is demonstrated that the proposed KmCv-LWPLS algorithm generally outperforms the global based, local weighted based and JITL based methods. In order to reduce the computation complexity, a further modification of the proposed method could be taken into account. Acknowledgments We would like to acknowledge financial support for this work from Shandong University of Science and Technology and the financial support for this work from Shandong Provincial Natural Science Foundation, China (ZR2017BF026, ZR2017PF002). References Bakirov R., Gabrys B., Fay, D., 2017, Multiple adaptive mechanisms for data-driven soft sensors, Computers & Chemical Engineering, 96, 42-54 Balabin R. M., Lomakina, E. I., 2011, Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data, Analyst, 136, 1703-1712 Ge Z., Song Z., 2010, A comparative study of just-in-time-learning based methods for online soft sensor modeling, Chemometrics and Intelligent Laboratory Systems, 104, 306-317 He K., Qian F., Cheng H., Du W., 2015, A novel adaptive algorithm with near-infrared spectroscopy and its application in online gasoline blending processes, Chemometrics and Intelligent Laboratory Systems, 140, 117-125 He K., Qian F., Cheng H., Du W., 2016, Improved integrated optimization method of gasoline blend planning and real-time blend recipes, Industrial & Engineering Chemistry Research, 55, 4632-4645 Kadlec P., Grbić R., Gabrys B., 2011, Review of adaptation mechanisms for data-driven soft sensors. Computers & Chemical Engineering, 35, 1-24 Kim S., Kano M., Hasebe S., Takinami A., Seki T., 2013, Long-Term Industrial Applications of Inferential Control Based on Just-In-Time Soft-Sensors: Economical Impact and Challenges, Industrial & Engineering Chemistry Research, 52, 12346-12356 Mei Q.-P., Li T.-F., Yao L.-Z., Huang D., Yang Y.-L., 2016, Study of an adaptable calibration model of near- infrared spectra based on KF-PLS, Chemometrics and Intelligent Laboratory Systems, 157, 152-161 Quiñones L., Velazquez C., Obregon L., 2014, A novel multiple linear multivariate NIR calibration model-based strategy for in-line monitoring of continuous mixing, AIChE Journal, 60, 3123-3132 1434