Microsoft Word - 01_R.doc


HUNGARIAN JOURNAL 
OF INDUSTRIAL CHEMISTRY 

VESZPRÉM 
Vol. 35. pp. 75-84 (2007) 

WHAT HAPPENS TO PROCESS DATA IN CHEMICAL INDUSTRY?  
FROM SOURCE TO APPLICATIONS – AN OVERVIEW  

B. BALASKO, J. ABONYI 

University of Pannonia, Dept. of Process Engineering, H-8201, Veszprem, P.O.Box 158, HUNGARY 
 

It is globally accepted that information is a very powerful asset that can provide significant benefits and a competitive 
advantage to any organization, like production technologies in the chemical industry, which was driven by market forces, 
customer needs and perceptions, resulting in more and more complex multi-product manufacturing technologies. These 
technologies, due to their highly automated level, provide mountains of process data, which is applied only in daily operation 
and control, but it definitely can give access to the underlying structure of any system. To enhance this automation level 
while keep operation safe and efficient, one needs more information, i.e. knowledge about the process, which can be 
extracted from process data, and more tools, which can extract effectively this knowledge. To meet the growing expectations 
for future chemical engineering tasks, like multi-scale modelling, simulation and control or process and product design, 
advanced data analysis techniques can lead a way to solution. This paper briefly overviews some of the commercial products 
on market and the applicable data analysis techniques which guide process data from source to its application: from 
technology to expert knowledge with the help of knowledge discovery in databases (KDD) process. Numerous citations and 
their evaluation are given to show that data mining in chemical engineering can efficiently solve many data analysis related 
problems. 

Keywords: process data, data analysis, data mining, chemical engineering, review 

Introduction 

Chemical engineering is said to be a profession of 
applied natural science, but besides applying the 
common practice for design, maintenance and control of 
industrial processes, it always faces challenges to 
continuously improve these techniques, thus improve the 
efficiency, effectiveness and reliability of all the chemical 
engineering activities. Charpentier defines the future 
main objectives of chemical engineering in four areas: 
(1) total multi-scale control of the process to increase 
selectivity and productivity; (2) equipment design of 
based on scientific principles and new operation modes 
and methods of production: process intensification; (3) 
product design; (4) an implementation of the multi-scale 
and multidisciplinary computational chemical engineering 
modelling and simulation to real-life situations [1]. It is 
clear that in every area, process data plays an essential 
role to fulfil these high expectations, hence it needs to 
be well structured and reliable. 

The boom of the information systems in the past 
decades had its effect in every field of life, which is 
especially true for chemical industry, where a high level 
of automation and integration takes place. The high 
automation level provides the opportunity to collect more 
information (more variables) from the process and due 
to the integration of these components of the technology, 
the collected information in chemical industry can be 

larger than ever before. Additionally, due to the large 
developments in data storage capacities, the sampling 
frequency of the collected data has increased significantly 
as well. On the other hand, the availability of these 
modern data acquisition systems has increased as well: 
compared to a system 20-25 years ago, modern data 
acquisition systems cost 20 times less while running on 
higher performance level [2]. 

To serve this horizontal and vertical increase in data 
amount – it doubles every year – an exceptional 
hardware and software development takes place for a 
huge amount of application fields, and from being 
under-informed in the past we turned into over-informed: 
information mountains have arisen, but only ten percent 
of the enormous amount of collected and stored data is 
analyzed for further aims [3]. This means that there is a 
clear need for tools and applications that are able to 
handle all the relevant tasks regarding data produced in 
a process. 

This paper aims to review the available solutions in 
the areas of data acquisition and data analysis for the 
above mentioned problems, highlighting the importance 
of process data analysis in chemical engineering. The 
first half of the paper presents recent solutions to data 
acquisition in industrial environments while the second 
half provides the various ways how these data can be 
analyzed to achieve process-related knowledge and meet 
the continuous development requirements in chemical 
industry. 


76

Data acquisition and retrieval 

The two main weaknesses of data acquisition systems 
are not handling heterogeneity and data inaccessibility:  
1. Data from different sources and in different format 

cannot be handled in one environment, e.g. a priori 
knowledge, empirical or phenomenological knowledge 
cannot be incorporated into sampled data. Lots of 
research has been done on these problems: data 
compression and data integrity, the next section 
deals with several solutions to these problems.  

2. A mid-size chemical plant has about few thousand 
measured variables sampled from seconds to hours, 
a hundred manipulated variables to control a few 
critical product quality related variables, which 
results in terabytes of data every year. It would mean 
inefficiently large data storage capacity if one wants 
to analyze not only prompt but historical data.  

 
In this section solutions to these problems and already 

available commercial products on market are presented. 

Integrated information storage and query 

To solve the problem of heterogeneous data integrity 
several approaches have been developed. Complexity of 
integrating the information with their various describing 
models is not easy to handle, hence solution methods 
are different. Two main solution groups can be identified: 
where the integrality problem is solved at the query 
level or at the construction level of the integrated 
information system.  

Collins et al. developed an XML based environment 
[4], while Wehr suggests an object-oriented global 
federated layer above information sources [5]. In [6], 
Bergamaschi et al. presents an object-oriented language 
as well with an underlying description logic, which was 
introduced for information extraction from both 
structured and semi-structured data sources based on 
tool-supported techniques. Paton et al. developed a 
framework for the comparison of systems, which can 
exploit knowledge based techniques to assist with 
information integration [7].  

Another approach to handle the heterogeneity of 
information sources is the application of data warehouses 
(DWs) to construct an environment filled by consistent, 
pre-processed data [8]. The main advantage of a DW is 
that it can be easily adapted to a DCS and other process 
information sources of a process while it works 
independently. Table 1 shows a comparison of a DCS 
related database and a data warehouse [9]. 

Table 1: Main differences of a DCS related database 
and a data warehouse [9] 

 DCS related database Data warehouse

Function 
Day-to-day data 
storage for operation 
and control 

Decision 
supporting 

Data Actual Historical 
Usage Iterative Ad-hoc 
Unit of work General transactions Complex queries 

User Operator Plant manager, engineer, 
Design Application-oriented Subject-oriented 
Accessed 
records Decimal order Million order 

Size 100 MB-GB 100 GB-TB 
Degree Transactional time Inquiry time 
Region Unit, product line Product  
 

Obviously, beside database integration among 
particular parts of the whole process, there is a need for 
information integration in the level of the whole 
enterprise as well for the purpose of optimal operation 
and planning. This task cannot be fully automated, there 
is a need for permanently improved methods and 
approaches for creation, storage and dissemination of 
experience, know-how and judgment embedded in the 
organization [10]. 

Appropriate time-series representation for 
data compression 

Data compression is rather a contribution of the signal 
and image processing society where lossless information 
transmission is a key feature within limited time or 
bandwidth, in chemical engineering society data 
compression has beside storage capacity rationalization 
another important issue: retrieve the data in a manner 
that renders it easily interpretable for the execution of 
later engineering tasks. In this manner, data compression 
problem is turned into trend representation problem. Lin 
et al. gave a classification of process trend representation 
methods in [11], which can be seen in Fig. 1. Many of 
these representation techniques refer to segmentation of 
time series, which means finding time intervals where a 
trajectory of a state variable is homogeneous [12], 
representing data by its segments and storing only the 
segments instead of raw data.  
 

77

 
 Figure 1: Hierarchy of various time series representations for data mining [11]. 
 

Products on the market 

The modern distributed control systems (DCSs), which 
are widely implemented in modern, automated 
technologies have the direct access to the field instrument 
signals and measurements, while have data storage 
functions as well. Today several software products in 
the market provide the capability of integration of 
historical process data of DCS’s: e.g. Intellution  
I-historian [13], Siemens SIMATIC [14], the PlantWeb 
system of Fisher-Rosemount [15], Wonderware 
Factory- Suite 2000 MMI software package [16] or the 
Uniformance PHD modul (Process History Database) 
from Honeywell [17], which structural components are 
shown in Fig. 2. These elements are typical in modern 
data collection systems.  
 

Figure 2: Structure of the data flow in Honeywell 

Uniformance PHD software. 
 
There are two main operations:  

● Data collection: Data originates from real-time 
system and is collected by a real-time data 
interface (RDI). Tag parameters for all the 
variables are stored in a reference database. A tag 
contains all important information about a process 
variable (name, type, unit, etc.). RDI sends data 
to PHD server which places the collected data for 
a tag in the raw data queue and applies data 
processing, such as smoothing, compression, and 
so on, to move raw data queue entries to the data 
queue of the tag. Data queue of the tag then 
holds processed data that is ready for insertion 

into the active logical archive files using the 
continuous store thread. 

● Data retrieval: An application program makes a 
call to the PHD application programming interface 
(API) indicating the desired tag and time range 
for data. The PHD system checks the data queues 
to see if the data is still held in the queues, 
otherwise PHD accesses the data from the 
connected archive files. 

 
Data flow goes as follows: First, the tag names of 

the relevant process variables are selected from all the 
possible tags in the plant. Process data belonging to the 
selected tags are accessed in PHD by the Uniformance 
Desktop application program (by Honeywell). While the 
Uniformance runs as an MS Excel add-in, the results of 
data queries are saved in Excel files. 

Concluding, modern data acquisition systems need 
to be capable to handle diverse types of data in a way 
that data is applicable for further analysis. Rationally 
constructed data warehouses are needed for these 
purposes. Some of the above mentioned commercial 
historical data handling products assist DW maintenance 
interfaces as well, but in most cases there is no 
integrated software solution. Moreover, to get valuable 
knowledge that guides process development, appropriate 
information storage is not sufficient, process data 
analysis indispensable. The next section deals with this 
topic where a widely-applied procedure is presented. 

Information extraction from process data 

Knowledge Discovery in Databases (KDD)  

Integration of heterogeneous data sources is highly 
related to knowledge discovery and data mining [18, 19], 
All in all this is one of its main purposes: store data in 
such a logically constructed way that some deeper 
information and knowledge can be extracted through 
data analysis. Knowledge discovery in databases (KDD) 
is a well known iterative process in the literature, which 
involves several steps that interactively take the user 
along the path from data source to knowledge [20].  

 
78

 
Figure 3: Knowledge Discovery in Databases process (left) and the data-driven process development scheme (right). 

 
Fig. 3 shows the KDD process and its connection to 

the process development scheme: KDD can be considered 
as the analysis step of the process development process. 
This connection was published by many researchers 
who used the elements of KDD for solving several 
engineering tasks, like system identification, process 
monitoring and fault diagnosis, time-series analysis. In 
the following, we go through the steps of KDD 
highlighting the presence of “data mining in chemical 
engineering” (Note, that although data mining is a 
particular step of KDD, it is often associated to it as an 
independent technique). 
1. Data selection. Developing and understanding of the 

application domain and the relevant prior knowledge, 
and identifying the goal of the KDD process.  

2. Data pre-processing. This step deals with data 
filtering and data reconciliation. In process data 
warehouses and integrated KDD environments it is 
made preliminary during collection of relevant data. 

3. Data transformation. Finding useful features to 
represent the data depending the goal of the task. 
Dimensionality reduction or transformation methods 
are applied to reduce the effective number of 
variables under consideration or to find invariant 
representation of data.  

 
Data selection, pre-processing and transformation 

activities are often referred to as the data preparation 
step. It corresponds to the feature selection step of the 
pattern recognition process, which means to select a 
subset of original features that is good enough regarding 
its ability to describe the training data set and to predict 
for future cases. A wealth of approaches have been used 
to solve the feature selection problem, such as principal 
component analysis [21], Walsh analysis [22], neural 
networks [23], kernels [24], rough set theory [25, 26], 
neuro-fuzzy scheme [27], fuzzy clustering [28], self-
organizing maps [29], hill climbing [30], branch and 
bound algorithms [31], and stochastic algorithms like 
simulated annealing and genetic algorithms (GAs) [32-33]. 

Process data have several undesirable attributes 
which need to be handled before any analysis can take 
place: time-dependent, multi-scale, noisy, variant and 
incomplete. All these problems need to be solved in the 
data preparation steps, hence it takes the largest part, 
approx. 60 % of the efforts in the whole KDD process.  
For industrial data reconciliation, OSIsoft and Invensys 
have developed packages such as Sigmafine and 
DATACON [34, 35].  
4. Data mining. It is an information processing 

method, the extraction of interesting (non-trivial, 
implicit, previously unknown and potentially useful) 
information or patterns from data (corresponds to 
feature extraction in pattern recognition).  
a) The goals of data mining are achieved by various 
methods: 
● Clustering. Cluster is a group of objects that are 

more similar to one another than to members of 
other clusters. The term “similarity” should be 
understood as mathematical similarity, measured 
in some well-defined sense. In metric spaces, 
similarity is often defined by means of a distance 
norm, which can be measured among the data 
vectors themselves, or from a data vector to some 
prototypical object of the cluster. The prototypes 
are usually not known beforehand, and are sought 
by the clustering algorithms simultaneously with 
the partitioning of the data. The prototypes may 
be vectors of the same dimension as the data 
objects, but they can also be defined as “higher-
level” geometrical objects, such as linear or 
nonlinear subspaces or functions. Data objects 
belong to a cluster by their membership value, 
which is zero or one for hard clustering and 
between zero and one for fuzzy clustering 
techniques. Note, that in the case of fuzzy 
clustering the sum of the membership values 
equals one, i.e. a data object is more or less part 
of every cluster. On Fig. 4, clustering of data of a 
dynamic crystallizer cascade model (reconstructed 
in a 4-dimensional state space) projected by PCA 
is shown to analyze the cyclic operation [36]. 
 

79

 
Figure 4: Fuzzy clustering of crystallizer cascade model 
data. Data points are denoted by dots, cluster prototypes 

by stars, cluster membership value levels by lines 
(darker means lower).  

 
Clustering is widely used for feature selection 

[28], feature extraction method, which is applied 
in operating regime detection [36, 37], fault 
detection [38, 39] or system identification, like 
model order selection [40-42], state space 
reconstruction [43]. 

● Segmentation. Time series segmentation means 
finding time intervals where a trajectory of a state 
variable is homogeneous. In order to formalize 
this goal, a cost function with the internal 
homogeneity of individual segments is defined. 
This cost function can be any arbitrary function, 
usually it is defined by distances between the 
actual values of the time-series and the values 
given by a simple function (constant, linear or a 
polynomial function of a higher but limited 
degree) fitted to the data of each segment. Hence, 
the segmentation algorithms simultaneously 
determine the parameters of the describing models 
and the borders of the segments by minimizing 
the sum of the costs of the individual segments. 

The linear, steady-state or transient segments 
can be indicative for normal, transient or 
abnormal operation, hence segmentation based 
feature extraction is a widely known technique 
for fault diagnosis, anomaly detection and process 
monitoring or decision support [44-47].  

Fig. 5 shows a second-order segmentation of 
1-D polymerization data during a process transition. 
Second-order means, segment borders are captured 
where the first or second derivative of a trend 
changes sign, thus at extrema and inflexion points. 

● Classification. Map the data into labelled subsets, 
i.e. classes, which are characterized by their 
specific attribute called the class attribute. The 
goal is to induce a model that can be used to 
discriminate new data into classes according to 
class attributes. The induction is based on a 
labelled training set. The objective of the 
 

0 50 100 150 200 250 300 350 400 450 500
-0.02

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

0.02

Data Points

Y

C
D

A
D

A
D

A
B

C
D

A

 
Figure 5: Second-order segmentation of filtered process 
transition data of polypropylene plant projected into 1-D 
by PCA.  Segment boundaries are noted as vertical lines 

at extrema or inflexion points. 
 

classification is to first analyze the training data 
and develop an accurate description or a model 
for each class using the attributes available in the 
data. Such class descriptions are then used to 
classify future independent test data or to develop 
a better description for each class. Many methods 
have been studied for classification, including 
decision tree induction, support vector machines, 
neural networks, and Bayesian networks [20]. In 
chemical engineering problems, classification is 
used in fault detection, anomaly detection problems 
[27, 45, 47-50]. On Fig. 6, a typical classification 
example is shown, where a decision tree was 
applied for the problem of the classification of 
operating regions related to the runaway of a 
chemical reactor. In [51] a new approach has 
been proposed, which is allows the transparent 
and interpretable representation of the boundaries 
of the operating regions. 

 
285 290 295 300 305 310 315 320
1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

TW,in

p
G

,in

 
Figure 6: Classification example for the classification 
of operating conditions regarding the runnaway of a 

chemical reactor. The decision tree representation of the 
related classifier is shown in Fig. 8 


80

 
Figure 7: Cubic spline interpolation of a semi-

mechanistic model for online Melt Index prediction in a 
polyethylene process. 

 
● Regression. The purpose of regression problems 

is to give prediction for process or so called 
dependent variables based on the existing data 
(independent variable), in other words,  regression 
learns a function which maps a data item to a 
real-valued prediction variable and the discovers 
functional relationships between variables [52-53]. 
Uses of regression include curve fitting, prediction 
(forecasting), modelling of causal relationships and 
testing scientific hypotheses about relationships 
between variables. Applied mainly in system 
identification problems, e.g. [54]. In [55], cubic 
spline interpolation-regression is applied to estimate 
variable derivatives for a semi mechanistic neural 
network model (Fig. 7). 

b) Representation, i.e. output of data mining, of 
patterns of interest can be in form of several 
techniques as well: 
● Regression models. Model interpretation of a 

system’s behavior is possible by several techniques 
for numerous tasks. The extracted model structure 
can be various: from linear autoregressive models 
[53] to artificial neural networks [27, 45, 48], 
semi mechanistic models [37], self-organizing 
maps [29, 50], etc. On Fig. 8 component planes 
of a SOM model for Melt Index prediction of a 
polypropylene polymer grade are presented for 8 
independent variables of the technology [56].  

● Association rules. General form of association 
rules is an ‘IF X … THEN Y …’ (noted as 
X�Y) implication. The two parts of a rule are 
the antecedent (X) and the consequent (Y). The 
association rules are constructed from frequent 
item sets [57]. The occurrences of an item (or 
item sets) in a data set are called support, which 
value could be seen as a probability value: how 
many percent of the transactions is the specific 
item (are the items of an item set together). An 
item is called frequent item if its support is 
higher than a given (user defined) threshold, 
namely the minimal support. The support of a 

U-matrix

 
0.0289

3.65

7.28
Melt index

 
n 
-0.962

1.52

4.01
H2 conc(R201)

 
n 
-0.742

1.92

4.58
H2 conc(R202)

 
n 
-0.783

1.91

4.61

H2 inlet(R201)

 
n 
-0.962

1.62

4.21
H2 inlet(R202)

 
n 
-1.06

1.46

3.98
CAT inlet

 
n 
-2.49

-0.244

2
Temp.(R200)

 
n 
-1.45

-0.296

0.856

Temp.(R201)

 
n 
-4.88

0.129

5.13
Temp.(R202)

 
n 
-4.88

0.0809

5.04
Slurry Dens.(R201)

 
n 
-2.5

-0.619

1.27
Slurry Dens.(R202)

 
n 
-2.61

-0.701

1.21

 
Figure 8: Self-organizing map representation based 

regression of process variables for Melt Index 
prediction in a polypropylene plant. 

 
rule is equal to the support of the item sets 
contained in the rule. While support says only 
the probability of joint occurrence of X and Y, 
the confidence (conditional probability) of an 
X�Y rule serves information about relationships 
between the X and Y.  

Association rules are applied in the field of 
decision support, process monitoring, process 
control [58]. 

● Decision trees. Common representation for 
classification problems [47, 49]. The goal of tree 
induction method is to get an input attribute 
partitioning which warrants the accurate 
separation of the samples. A decision tree has 
two types of nodes (internal and terminal) and 
branches between the nodes. The possible 
outputs for an internal node (cut) are represented 
by the branches. The terminal nodes of the tree 
are called leaves where the class labels are 
represented. The paths from the root to the leaves 
(sequences of decisions, or cuts) represent the 
classification rules. Therefore, as data partition 
representation, it represents the data as a hyper-
rectangle. The most of the decision tree induction 
algorithms (e.g. ID3, C4.5) are based on the 
divide and conquer strategy. In every iteration 
steps the cut which serves topically the highest 
information gain (greedy algorithms) is realized. 

In Fig. 9. a decision tree is presented for 
reactor runaway detection of a fixed bed tube 
reactor [59]. There are two class attributes: class 
attribute 1 and 2 refers to reactor conditions 
where reactor runaway takes (1) and takes not 
place (2). Decision variables are: cooling water 
inlet temperature (TW,in), reactor mixture inlet 
temperature (TW,in), inlet pressure of reactor 
mixture (pG,in) and mass feed flow of reactants 
(BG,incA

G,in and BG,incB
G,in). 

5. Interpretation of mined patterns, i.e. discovered 
knowledge about the system or process. The 
interpretation depends on the chosen data mining 
representation.  

 
81

 
Figure 9: Example of decision tree representation of a 

two-class problem for classification of reactor 
conditions regarding if there is (1) or is not reactor 
runaway (2). TW,in: cooling water inlet temperature; 
TG,in: reactor mixture inlet temperature; pG,in: inlet 

pressure of reactor mixture; BG,incA
G,in and BG,incB

G,in: 
mass feed flow of reactants. 

 
For visualization of the mined patterns, Exploratory 

Data Analysis (EDA) has been developed. Although it 
is often stated as an independent analysis technique, it 
can be considered as a special application of the KDD 
process, where the knowledge is presented by the 
information embedded into several types of visualization 
tools. It focuses on a variety of mostly graphical 
techniques to maximize insight into a data set. 

The seminal work in EDA is written by Tukey [60]. 
Over the years it has benefited from other noteworthy 
publications such as Data Analysis and Regression by 
Mosteller and Tukey [61], and the book of Velleman 
and Hoaglin [62]. 

Data preprocessing step in EDA refers to several 
projection methods in order to be able to visualize high 
dimensional data as well: techniques of principal 
component analysis (PCA) [63], Sammon-mapping [64], 
Projection to latent structure (PLS) [65], Multidimensional 
Scaling (MDS) [66] or Self-Organizing Map (SOM) [67] 
are applied. Data mining methods also use these 
techniques, but in EDA, projection is used for 
visualization purpose hence in most cases into two or 
three dimensions. 

The graphical techniques of EDA have a wide 
spectrum including plots of raw data (histograms, 
probability plots, block plots), basic statistics (median, 
quantile plot, quantile-quantile plot, box plot) or 

advanced multidimensional plots (scatterplot matrices, 
radar plots, bubble charts, coded maps, etc.). In Fig. 10, 
Fig. 11 and Fig. 12, some examples are presented. 

 
Figure 10: Example of a process variable (reactor 

temperature) and its cumulated distribution function 
(q0.25, q0.50, q0.75 refer to quantiles) plotted by MATLAB 

 
Figure 11: Box-plot of variable on Fig. 5 plotted by 

MATLAB, i.e. 5-number-summary from Tukey: 
minimum, maximum, median (q0.50),  

1st and 3rd quartile(q0.25 and q0.75) 
 

Mg

SrMn

Ba

T i

Zr Zn

V

 
Figure 12: Star plot of a South African clinker  

(code number 159SA17). The standard on the right side 
can be used as comparison [68] 

 
The most common software for EDA is MS Excel 

with free add-ins, but there are several products on the 
market as well: IBM’s DB2 Intelligent Miner (which is 
no longer supported), Mathworks’s MATLAB Statistics 
Toolbox [69] and the open-source WEKA developed by 
Waikato University [70]. 

Note, that most EDA techniques are only a guide to 
the expert to understand the underlying structure in the 
data in a visual form. Hence their main application is 
process monitoring [71, 72], but these tools are already 
used for system identification [73], ensuring consistent 
production [74] and product design as well [75]. 


82

Conclusions 

Chemical industry is a highly automated industry, which 
produces a huge amount of production related data in 
every minute, which obviously has the potential to mine 
useful information and knowledge about the whole 
process. This paper reviewed how process data is stored 
and what types of scientific approaches are developed to 
guide this knowledge discovery.  

The brief description of KDD and EDA techniques 
is presented, emphasizing their high correlation to 
chemical engineering tasks. From all the results in these 
scientific areas, one can conclude, that process data 
analysis has high contribution to the solution of 
problems that chemical engineers will face in the near 
future: optimal multi-scale control, process and product 
intensification, modeling and simulation of complex 
systems.  

KDD gives users tools to shift through vast data 
stores to learn and recognize patterns, make classifications, 
verify hypotheses, and detect anomalies. These findings 
can highlight previously undetected correlations, influence 
strategic decision-making, and identify new hypotheses 
that warrant further investigation.  

As it can be seen from the numerous citations, 
solutions based on the KDD process were proven to be 
extremely useful in solving chemical engineering tasks 
as well and showed that instead of simple queries of 
data, potential profit – through knowledge – can be 
mined by data analysis. The mined and discovered 
knowledge about the system or process is fed back to 
the beginning of the process to help continuous 
development (see Fig. 3).  

 
REFERENCES 

1. CHARPENTIER J. C.: Four main objectives for the 
future of chemical and process engineering mainly 
concerned by the science and technologies of new 
materials production, Chemical Engineering Journal 
107, 3-17, 2005  

2. AUSTERLITZ H.: Data acquisition techniques using 
PCs, Second edition, 2003 

3. FAYYAD U., SIMOUDIS E.: Data mining and 
knowledge discovery. Tutorial Notes at PADD ’97 
– 1st Int. Conf. Prac. App. KDD & Data Mining, 
London. 

4. COLLINS S. R., NAVATHE S., MARK L.: XML shema 
mappings for heterogeneous database access. 
Information and Software Technology, 44, 251-257, 
2002. 

5. WEHR H.: Integrating heterogeneous data sources 
into federated information systems. Proceedings of 
the 4th European GCSE Young Researchers Work- 
shop, pages 1-11, October 2002. IESE-Report No. 
053.02/E by Fraunhofer IESE. 

6. BERGAMASCHI S., CASTANO S., VINCINI M., 
BENEVENTANO D.: Semantic integration of 

heterogeneous information sources. Data and 
Knowledge Engineering, 36, 215-249, 2001. 

7. PATON N. W., GOBLE C. A., BECHHOFER S.: 
Knowledge based information integration systems. 
Information and Software Technology, 42, 299-312, 
2000. 

8. INMON W. H.: Building the Data Warehouse. John 
Wiley and Sons Inc., 3rd edition, 2002. 

9. PACH F. P., BALASKO B., NEMETH S., ARVA P., 
ABONYI J.: Black-Box and First-Principle Model 
Based Optimization of Operating Technologies. In 
Proceedings of 5th MATHMOD, Vienna, 2006.  

10. ZAHAYA D. GRIFFIN A., FREDERICKS E.: Sources, 
uses, and forms of data in the new product 
development process. Industrial Marketing 
Management, 33, 657-666, 2004. 

11. LIN J., KEOGH E., LONARDI S., CHIU B.: A Symbolic 
Representation of Time Series, with Implications 
for Streaming Algorithms. In proceedings of the 8th 
ACM SIGMOD Workshop on Research Issues in 
Data Mining and Knowledge Discovery. 2003. 

12. KEOGH E., CHU S., HART D., PAZZANI M.: An 
Online Algorithm for Segmenting Time Series, 
IEEE International Conference on Data Mining, 
2001. 

13. CAPOCACCIA G.: Intellution production is the heart 
of manufacturing ebusiness, I-historian. Distributed 
Control Systems 7th Meeting, Miskolc, Hungary, 
2001. 

14. SEIDL U., SIMATIC Pcs 7: Efficient integration for 
tomorrow’s DCS applications. Distributed Control 
Systems 5th Meeting, Miskolc, Hungary, 1999. 

15. FÜLE S.: Integration of distributed - and enterprise 
control systems. Distributed Control Systems 5th 
Meeting, Miskolc, Hungary, 1999. 

16. AJTONYI I., BALLAGI, A.: Integration of DCS in the 
complex producing system with wonderware 
factorysuite 2000 mmi software package. Distributed 
Control Systems 7th Meeting, Miskolc, Hungary, 
2001. 

17. GRINER, S.: New Rules of data management, 
InTech Magazin, Februar 2004. 

18. GIANNADAKIS N., ROWE A., GHANEM M., GUO Y.-
K.: Infogrid: providing information integration for 
knowledge discovery. Information Sciences, 155, 
199-226, 2003. 

19. SCOTNEY B., MCCLEAN S.: Efficient knowledge 
discovery through the integration of heterogeneous 
data. Information and Software Technology, 41, 
569-578, 1999. 

20. FAYYAD U., PIATESTKU-SHAPIO G., SMYTH P.: 
Knowledge discovery and data mining: Towards a 
unifying framework, Advances in Knowledge 
Discovery and Data Mining, AAAI/MIT Press, 
1994. 

21. MALHI A., GAO R. X.: PCA-based feature selection 
scheme for machine defect classification. IEEE 
Transactions on Instrumentation and Measurement, 
53(6), 1517-1525, 2004. 


83

22. SALCEDO-SANZ S., CAMPS-VALLS G., PEREZ-CRUZ 
F., SEPULVEDA-SANCHIS J., BOUSONO-CALZON C.: 
Enhancing genetic feature selection through 
restricted search and Walsh analysis. IEEE 
Transactions on Systems, Man, and Cybernetics-
Part C: Applications and Reviews, 34(4), 398-406, 
2004. 

23. VERIKAS A., BACAUSKIENE M.: Feature selection 
with neural networks. Pattern Recognition Letters, 
23(11), 1323-1335, 2002. 

24. SHIMA K., TODORIKI M., SUZUKI A.: SVM-based 
feature selection of latent semantic features. Pattern 
Recognition Letters, 25(9), 1051-1057, 2004. 

25. JENSEN R., SHEN Q.: Semantics-preserving 
dimensionality reduction: rough and fuzzy-rough-
based approaches. IEEE Transactions on 
Knowledge and Data Engineering, 16(12), 1457-
1471, 2004. 

26. SWINIARSKI R. W., SKOWRON A.: Rough set methods 
in feature selection and recognition. Pattern 
Recognition Letters, 24(6), 833-849, 2003. 

27. CHAKRABORTY D., PAL N. R.: A neuro-fuzzy scheme 
for simultaneous feature selection and fuzzy rule-
based classification. IEEE Transactions on Neural 
Networks , 15(1), 110-123, 2004. 

28. MARCELLON F.: Feature selection based on a 
modified fuzzy C-means algorithm with supervision. 
Information Sciences, 151(5), 201-226, 2003. 

29. YE H. L., LIU H. C.: A SOM-based method for 
feature selection. Proceedings of the 9th 
International Conference on Neural Information 
Processing, IEEE, 1295−1299, 2002. 

30. FARMER M. E., BAPNA S., JAIN A. K.: Large scale 
feature selection using modified random mutation 
hill climbing. Proceedings of the 17th International 
Conference on Pattern Recognition. IEEE, 287-290, 
2004. 

31. SOMOL P., PUDIL P., KITTLER J.: Fast branch and 
bound algorithms for optimal feature selection. 
IEEE Transactions on Pattern Analysis and 
Machine Intelligence, 26(7), 900-912, 2004. 

32. BHANU B., LIN Y.: Genetic algorithm based feature 
selection for target detection in SAR images. Image 
and Vision Computing, 21(7), 591-608, 2003. 

33. OH I. S., LEE J. S., MOON B. R.: Hybrid genetic 
algorithms for feature selection. IEEE Transactions 
on Pattern Analysis and Machine Intelligence, 
26(11), 1424-1437, 2004. 

34. OSIsoft Inc., SigmafineTM, online access, URL: 
http://techsupport.osisoft.com/Products/Layered+ 
Products/Sigmafine/Sigmafine+Overview.htm 

35. Simsci Esscor, DATACON, online access, URL: 
http://www.simsci-esscor.com/us/eng/products/ 
productlist/datacon/DATACON.htm 

36. FEIL B., BALASKO B., ABONYI J.: Visualization of 
fuzzy clusters by fuzzy Sammon mapping 
projection – application to the analysis of phase 
space trajectories. Soft Computing, 11, 478-488, 
2007. 

37. ABONYI J., NEMETH S., VINCZE CS., ARVA P.: 
Process analysis and product quality estimation by 
Self-Organizing Maps with an application to 
polyethylene production, Computers in Industry, 
52(3), 221-234, 2003, 

38. ZOGG D., SHAFAI E., GEERING H. P.: Fault diagnosis 
for heat pumps with parameter identification and 
clustering. Control Engineering Practice, 14, 1435-
1444, 2006. 

39. PICIARELLI C., FORESTI G. L.: On-line trajectory 
clustering for anomalous events detection. Pattern 
Recognition Letters, 27, 1835-1842, 2006. 

40. GARCIA C., BERNI C., NERI DE OLIVEIRA C. E.: 
Hardware/firmware implementation of a soft sensor 
using an improved version of a fuzzy identification 
algorithm. ISA Transactions, 47, 157-170 , 2008. 

41. KILIC K., UNCU O., BURHAN TÜRKSEN I.: 
Comparison of different strategies of utilizing fuzzy 
clustering in structure identification. Information 
Sciences, 177, 5153-5162, 2007. 

42. AZEEM M. F., AHMAD N., HANMANDLU M.: Fuzzy 
modeling of fluidized catalytic cracking unit. 
Applied Soft Computing, 7, 298-324, 2007. 

43. LAN L. W., SHEU J. B., HUANG Y. S.: Investigation 
of temporal freeway traffic patterns in reconstructed 
state spaces. Transportation Research, Part C, 16, 
116-136, 2008. 

44. VENKATASUBRAMANIAN V.: A Syntactic Pattern-
recognition Approach for Process Monitoring and 
Fault Diagnosis. Engineering Applications of 
Artificial Intelligence, 8(1), 35-51, 1995. 

45. WONG J. C., MCDONALD K. A., PALAZOGLU A.: 
Classification of process trends based on fuzzified 
symbolic representation and hidden Markov 
models. Journal of Process Control, 8(5-6), 395-
408, 1998. 

46. SUNDARRAMAN A., SRINIVASAN R.: Monitoring 
transitions in chemical plants using enhanced trend 
analysis. Computers and Chemical Engineering, 27, 
1455-1472, 2003. 

47. CHARBONNIER S., GARCIA-BELTAN C., CADET C., 
GENTIL S.: Trends extraction and analysis for 
complex system monitoring and decision support. 
Engineering Applications of Artificial Intelligence, 
18, 21-36, 2005. 

48. ZHOU Y., HAHN J., MANNAN M. S.: Fault detection 
and classification in chemical processes based on 
neural networks with feature extraction. ISA 
Transactions, 42, 651-664, 2003. 

49. ZHOU Y., HAHN J., MANNAN M. S.: Process 
monitoring based on classification tree and 
discriminant analysis. Reliability Engineering and 
System Safety, 91, 546-555, 2006. 

50. YAN X., CHEN D., CHEN Y., HU S.: SOM integrated 
with CCA for the feature map and classification of 
complex chemical patterns. Computers and Chemistry, 
25, 597-605, 2001. 

51. VARGA T., ABONYI J., SZEIFERT F.: Applying decision 
trees to investigate the operating regimes of a 


84

production process, Acta Agraria Kaposváriensis, 
(in press), 2008 

52. POLLOCK D. S. G.: Classical Regression Analysis. 
Handbook of Time Series Analysis, Signal 
Processing, and Dynamics, 201-225, 1999. 

53. RUSLING J. F., KUMOSINSKI T. F.: Analyzing Data 
with Regression Analysis. Nonlinear Computer 
Modeling of Chemical and Biochemical Data, 7-31, 
1996 

54. DAYAL B. S., MACGREGOR J. F.: Multi-output 
process identification, Journal of Process Control, 
7(4), 269-282, 1997. 

55. FEIL B., ABONYI J., PACH P. F., NEMETH S., ARVA 
P., NEMETH G., NAGY G.: Semi-mechanistic Models 
for State-Estimation - Soft Sensor for Polymer Melt 
Index Prediction. Lecture Notes in Computer 
Science, 3070 (2004) 1111-1117. 

56. BALASKO B., NEMETH S., NAGY G., ABONYI J.: 
Application of integrated process and control 
system model for simulation and improvement of an 
operating technology. In Proceedings of European 
Congress of Chemical Engineering (ECCE-6), 
Copenhagen, 2007. 

57. AGRAWAL R., IMIELINSKI T., SWAMI A.: Mining 
association rules between sets of items in large 
databases, Proceedings of the ACM SIGMOD 
International Conference on Management of Data, 
207-216, 2003. 

58. LIN C. T., LEE C. S. G.: Neural-network-based fuzzy 
logic control and decision system, IEEE Transactions 
on Computers, 40(12), 1320-1336, 1991. 

59. VARGA T., SZEIFERT F., RÉTI J., ABONYI J.: Analysis 
of the Runaway in an Industrial Heterocatalytic 
Reactor, Computer Aided Chemical Engineering, 
24, 751-756, 2006. 

60. TUKEY J.: Exploratory Data Analysis. Addison-
Wesley, 1977.  

61. MOSTELLER F., TUKEY J.: Data Analysis and 
Regression. Addison-Wesley, 1977. 

62. VELLEMAN P., HOAGLIN D.: The ABC’s of EDA: 
Applications, Basics, and Computing of 
Exploratory Data Analysis. Duxbury, 1981. 

63. SMITH L. I.: A tutorial on Principal Component 
Analysis. 2002. 

64. SAMMON J. W.: A Non-Linear Mapping for Data 
Structure Analysis, IEEE Trans. on Computers, C-
18(5), 1969. 

65. ZHAO S. J., XU Y. M., ZHANG J.: A Novel Nonlinear 
Projection to Latent Structures Algorithm, Advances 
in Neural Networks - ISNN Chapter 11, Springer 
Berlin / Heidelberg, 2004. 

66. COX M. F., COX M. A. A.: Multidimensional 
Scaling, Chapman and Hall, 2001. 

67. VESANTO J., HIMBERG J., ALHONIEMI E., 
PARHANKANGAS J.: Self-organizing map in 
MATLAB: the SOM toolbox, Proceedings of the 
Matlab DSP Conference, Espoo, Finland, 35-40, 
1999. 

68. TAMAS F. D., ABONYI J.: Trace Elements in clinker I. 
– A graphical representation, Cement and Concrete 
Research, 32(8), 1319-1323, 2002 

69. The Mathworks Inc., Statistics ToolboxTM, URL: 
http://www.mathworks.com/products/statistics/ 

70. WITTEN I. H., FRANK E.: Data Mining: Practical 
machine learning tools and techniques, 2nd Edition, 
Morgan Kaufmann, San Francisco, 2005. 

71. WANG D., ROMAGNOLI J. A.: Robust multi-scale 
principal components analysis with applications to 
process monitoring. Journal of Process Control, 
15(8), 869-882, 2005. 

72. URAIKUL V., CHAN C. W., TONTIWACH-WUTHIKUL 
P.: Artificial intelligence for monitoring and 
supervisory control of process systems. Engineering 
Applications of Artificial Intelligence, 20, 115-131, 
2007. 

73. MACGREGOR J. F., KOURTI T.: Statistical process 
control of multivariate processes. Control Eng. 
Practice, 3(3), 403-414, 1995. 

74. MARTIN E. B., MORRIS A. J., PAPAZOGLOU M. C., 
KIPARISSIDES C.: Batch process monitoring for 
consistent production. Computers and Chemical En- 
gineering, 20, 599-605, 1996. 

75. LAKSHMINARAYANAN S., FUJII H., GROSMAN B., 
DASSAU E., LEWIN D. R.: New product design via 
analysis of historical databases. Computers and 
Chemical Engineering, 24:671-676, 2000.