CET Volume 86


                                                                      DOI: 10.3303/CET2186153 

 
Paper Received: 26 August 2020; Revised: 6 February 2021; Accepted: 3 May 2021 
Please cite this article as: Ferranti F., Vingerhoets G., Manenti F., Vallerio M., 2021, Data-driven Digital Twin of a Chemical Productionsite for 
Production and Utilities Planning, Chemical Engineering Transactions, 86, 913-918  DOI:10.3303/CET2186153 

CHEMICAL ENGINEERING TRANSACTIONS 

VOL. 86, 2021 

A publication of 

The Italian Association 
of Chemical Engineering 
Online at www.cetjournal.it 

Guest Editors: Sauro Pierucci, Jiří Jaromír Klemeš
Copyright © 2021, AIDIC Servizi S.r.l. 
ISBN 978-88-95608-84-6; ISSN 2283-9216 

Data-Driven Digital Twin of a Chemical Production site For 
Production and Utilities Planning 

Filippo Ferrantia, Geoff Vingerhoetsb, Flavio Manentia, Mattia Valleriob,* 
a
Dipartimento di Ingegneria Chimica e Materiali, Politecnico di Milano 

b
Data Analytics, BASF Antwerpen  

mattia.vallerio@basf.com 

The BASF Antwerp site is one of the most advanced and integrated chemical production sites of the BASF 
group and worldwide. The chemical plants on site are heavily interconnected, i.e., the product of one plant is 
the raw material used in the next one and some of the plants use the steam produced from neighboring ones. 
This high degree of interconnection makes it quite difficult to assess the feasibility let alone the economic 
optimality of production and maintenance plans for the entire site. This research explores the use of a data-
driven digital twin to simulate and assess these plans. Two value chains, i.e., a collection of plants converting 
raw material to valuable end products have been selected as a proof-of-concept for the entire site. Each plant 
has been modeled by utilizing simple or multiple regression. Each regression model correlates the final 
product of a plant with the needed raw materials or utilities (e.g., steam or electricity). All regression models 
have been found using the software JMP. Additionally, all relevant tanks used to stock raw materials, 
intermediates and final products have been modelled. This allowed for the visualization and troubleshooting of 
a particular component excess or shortage. The resulting system of 76 variables has been solved in MATLAB 
in a multi-period fashion, where a period represents a day. The simulation has been first performed on the 
training data set, and then on a validation period to verify the models’ performance. 

1. Introduction

In the last decades, the chemical industry has faced many challenges. The increase in competitiveness and 
the research for sustainability pushed many companies to optimize production and reduce emissions. One 
way to reach this goal is sharing facilities and creating interconnected processes. This aggregation also 
increases the complexity of the site and a greater effort is required to schedule and plan the production. This 
paper focuses on building a digital twin for two value chains located at the BASF site in Antwerp. A chemical 
value chain can be defined as a group of chemical plants, where the finished product of one serve as raw 
material for the next. From an industrial point of view, the main objective of this work is the creation of a tool to 
assess the effects of planning decisions on the whole site and therefore increasing operations’ transparency. 
This will enable the plant and site management to make better decisions in different scenarios. From an 
academic perspective the purpose is to verify if it is possible to perform this task using only simple regression 
models. The aim is to simulate only stationary operations on a long-term period, excluding dynamic and non-
operating periods. Several methods exist to create a model or a digital twin of an existing system. The main 
differentiation worth making is between physics-based and data-driven approaches (Wright and Davidson, 
2020). A physics-based approach is achieved through the derivation of models based on all the physic-
chemical phenomena occurring in the analyzed system, while a data-driven one aims at deriving the 
governing phenomena by studying the correlation observed on the available historic measurements. The first 
approaches are normally characterized by high complexity and computational effort while data-driven methods 
usually lump many parameters making the models simpler and limiting the amount of necessary information. 
The main drawbacks of these latter approaches are mainly that: (i) results might be difficult to interpret and (ii) 
their validity is limited to the input data range. The use of a data-driven digital twin for planning and production 
simulation in the chemical sector is not new.  

913


A similar concept applied to batch plants can be found in Fumero et al. (2016), where the consumption of raw 
material is determined through fixed conversion. In our case, the resulting models render a variable 
consumption according to the, e.g., different production loads or outside temperature. A data-driven approach 
has also been investigated by Muteky and MacGregor (2008), where models were made for the purchasing of 
coals for coke making in the steel industry. Finally, in Li et al. (2016) a data-driven model approach is used for 
part of the units of a highly integrated refinery-petrochemical complex to reduce the computational effort. 
Similarly, this work aims at finding models correlating the raw material and energy flows entering a plant with 
the product and energy flows leaving it. The collection of all resulting models will give a complete digital twin of 
the two chemical value chains considered as use case. This article is structured as follows: section 2 
discusses the relevant state of the art, section 3 introduces the use case, section 4 presents the results and 
finally section 5 discusses the conclusions. 

2. State of the art

For this work only single and multiple regression models have been considered. The main reason is to 
evaluate the accuracy that could be achieved with the simplest model. The resulting model structure can be 
described as follows: = ( , ) +  (1) 
where y is a p-by-1 vector of measurements of the dependent variable, f is any function of X and β; X is a p-
by-n matrix of predictors, with one row for each measurement, and one column for each predictor; β is an n-
by-1 vector of unknown parameters that have to be estimated; s is a p-by-1 vector of identically distributed 
independent random disturbances. To obtain these equations the least-squares method has been used in 
JMP. In its most simple form, it consists in the estimation of the parameters through the minimization of: 

with respect to the parameters (Elster et al., 2015). The quality of the obtained models has been assessed 
based on two parameters: the root mean square error (RMSE) and . Where, the RMSE represents the 
average error between the model prediction and the actual values, while  is the coefficient of determination 
which indicates the proportion of the variance in the dependent variable y that is predictable from the 
independent variable(s) xi. Additionally, a detailed analysis of the residuals was also performed. Residuals are 
defined as the difference between the real values and the predicted ones. These should be normally 
distributed with zero mean. For more details on how these two terms are calculated the reader is referred to 
Steel R. and Torrie J. (1960) or any other book on statistics or regression. 

3. Use case introduction

Due to confidentiality reasons, the names of plant and chemicals have been anonymized and all data and 
results have been normalized. The normalization has been realized by dividing all the hourly data by the 
highest value of the parameter. The results shown here are the summation of the normalized results into a 
daily value. The case study selected for this work is based on two production value chains of the BASF site in 
Antwerp. Figure 1 shows a graphical representation of the studied system. In particular: 

• in yellow all plants related to the production value chain of Product 1;
• In green all plants related to the production value chain of Product 2;
• Plant A and C are a cluster processing the raw materials arriving on the site to obtain Product 1 and 2.

The two value chains are made of 10 plants indicated with a letter from A to L. The raw materials entering the 
value chain are called “Reagent X”, with a number between1 and 6, while the final products and the 
intermediates are called "Product X", where X is a number between 1 and 23. Tank farms are present 
between plants to collect and store the chemicals on site. The tanks have also been numbered from 1 to 24. 
The last part of the digital twin consists of the utilities net that serves every plant. As it is possible to notice, all 
plants are interlinked between them, with many products from a plant serving as reagents for the next one. 

( − ( ))  (2) 

914


Figure 1: High-level representation of the value chains 

3.1 Data 

The data considered for this work has been divided into two groups: the training data goes from 1/11/2017 to 
19/12/2019, while the validation data goes from 20/12/2017 to 12/04/2020. The collected observations consist 
of hourly averages. The focus of this work is to predict the consumption of raw materials and utilities while 
plants are in stationary operating conditions. Therefore, only data that match this description has been 
retained for analysis and validation, the remaining part was excluded in subsequent steps by removing: (i) the 
data related to plant turnarounds and shut-downs, (ii) data related to dynamic and transition periods, i.e., load 
changes, production ramp up and down and (iii) uni-variate and multivariate outliers. 
The data related to the dynamic periods was identified by using a stability factor µ, defined as follows: μ = ( ( , ), ( , − 1),…, ( , − 12)) (3)
where std(M(i,j)) represents the standard deviation of material i at time j. It is possible to exclude the non-
stationary modes in a systematic way by selecting a limit, based on the sensor accuracy, for the stability 
coefficient. The intervals showing a value lower than the defined limit are kept and the rest is excluded. 

3.2 Data-driven modelling 

All models have been built with the statistical software package JMP. For the sake of brevity, only one model 
will be discussed in detail (see Eq (4)). The equation describes the consumption of Reagent 1 in Plant A as a 
function of Product 3, 4 and 5 (see Figure 1). Figure 2a shows the data used for the model, in black, and the 
excluded one corresponding to failures and a major turnaround in red. 1 = −0.0965	 + 	1.235 ∗ 	P3	 +	−0.096875 ∗ P4 + 0.11027 ∗ P5 (4) 

Figure 2: (a) Consumption of reagent 1 in the training period, (b) Correlation between Reagent 1 and the 
finished goods presented in the equation 4, (c) Distribution of the errors of the model 

Figure 2b visualizes the actual values (black points) vs the predicted ones (red line) obtained through Eq (4). 
The resulting statistical parameters for this regression are: 

= 0.813       = 6.24	% 

915


Moreover, the residuals are normally distributed, as shown in Figure 2c, where the green line represents a 
normal distribution with the same mean and standard deviation of the sample of errors. The same approach 
has been used for all models.      

3.3 Tank farm 

The site is coping with enormous quantities of chemicals and tanks are needed to store materials before 
selling or using them in other plants. For that reason, every raw material and finished good relies on a certain 
number of tanks. The level in each tank is calculated as follows: 	( , ) = − + ( , − 1) (5) 
where ( , ) and ( , − 1) represent the tank level of Tank i at the current and previous period, ∑  
and ∑  are the sum of all materials entering and exiting the tank, respectively. 
3.4 Simulation platform 

A multi-period simulation has been set up in MATLAB by dividing the entire time interval on a daily basis. The 
core of the simulation is the solution of the nonlinear algebraic system consisting of all the models and the 
material balances of the supply chain. In Table 1 the number of variables per simulation is shown. The 
simulation has been repeated for every day of the training and validation data-set. 

Table 1: Number of variables and number of simulations per data-set 

Training data-set Validation data-set 

Variables number 76 76 

Simulation days 760 115 

The system of algebraic nonlinear equations is solved by minimizing the sum of squares of the components. 
The system is considered solved when the sum of squares approximates the value of zero. The method used 
by the solver is the Trust-Region-Dogleg Algorithm (see Conn et al., 2000, Nocedal and Wright, 2006 and 
Powell, 1968). Being the algebraic system nonlinear, it is necessary to provide an initial guess to the solver. 
This is randomly selected within the variables’ defined boundaries. 

4. Results

 Two different simulations have been done to analyze the quality of the models and the whole simulation 
environment: (i) first the training data set and the results have been compared to the historical training data; 
(ii) then on the validation data-set. The results evaluation has been performed based on the following five 
criteria: (i) residuals normally distributed, (ii) residuals variance as small as possible, (iii) residuals mean as 
close as possible to zero, (iv) RMSE as small as possible and (v) no prolonged deviations from historical data. 
Training and validation results should respect all the criteria expressed before. If all the criteria are meet, the 
parameters of the results should be compared between them to see if there are significant differences. If that 
is the case, it will be necessary to understand the reasons behind it. The possible causes are: (i) an error in 
the sensor observations, or a (ii) a change in the studied process. 

4.1 Raw materials 

Looking at the two graphs in Figure 3a and 3b, it is immediately evident that the model is quite accurate. The 
distance between predicted and real values remains almost constant for the validation data set as well, 
confirming the validity of the model. This is also confirmed by the statistical reported in Table 2. These 
parameters are expressed as a percent of the average of the historical data for that same period. Their 
similarity shows that it is possible to use this model also in future predictions with similar results.  

Table 2: Reagent 1 errors for training and validation data-sets expressed in % of average value 

Training data-set Validation data-set 

RMSE 5.72 7.09 

Residuals mean -0.05 -1.71 

Residuals std 5.73 6.92 

916


Figure 3: Comparison between simulation and training data (a) and validation data (b)for reagent 1 

4.2 Utilities 

For the utilities, let us consider the 16 bar steam consumption in plant B. The main reaction in this process is 
endothermic, while the side reactions are exothermic. In addition, the selectivity towards the main reaction it is 
inversely proportional to the catalyst age. Therefore, the steam is mainly consumed in the first part of the 
catalyst life while it is produced in the second part. Figure 4a shows a decreasing trend until May 2019, when 
the catalyst was changed. After that there is a steep increase, followed by another decreasing trend. Despite 
the complexity, the model is able to predict the whole period (see Figure 4a,b). The model equation is: = 1.445 − 0.0000152 ∗ − 0.8415 ∗ 24 − 0.655 ∗ 11 − 0.2634 ∗ 12 + 0.003506 ∗  (6) 
where P24, P11 and P12 are the flows of Product 24, 11 and 12 entering plant B and OT is the outside 
temperature. SB is the 16 bar steam consumed, while CLT represents the catalyst lifetime, defined as the 
cumulative sum of Product 1 up till that day. The steam consumption peaks in Figure 4a are due unexpected 
stops of the plant, during which an excess of steam is introduced to guarantee safety. Figure 4b shows a 
constant difference between the real steam consumption and the simulation results for the validation period. 
The model based on data from the training period is not able to predict the data in the validation period. This 
bias mainly derives from a catalyst change and from a lower production load in the first months of 2020. An 
overview of the statistical parameters per period is reported here below in Table 3: 

Table 3: Statistical parameters for the 16 bar steam model of plant B for the training and validation data sets 

Training data-set Validation data-set 

RMSE 32.41 66.51 

Residuals mean -2.03 62.30 

Residuals std 32.38 23.38 

The parameters are expressed as percentage of the mean value for the respective period. The deviation 
between the model and the actual values in the validation period appears significant. The best approach to 
correct this would be to retrain the model by including the validation period. 

Figure 4: Comparison between simulation and training (a) and validation data (b) for steam used in plant B. 

4.3 Tanks 

Figure 5 shows the volume of resources stored in any given period throughout the simulation for three tanks. 
Tank 5 shows the accumulation of Product 2. The stored amount is continuously increasing except for the 
period when plant A undergoes a shutdown. In the shutdown interval the level of the tank decreases. The 
volume in Tank 8, which is storing Product 8, is instead mainly constant for the entire period. Note that only 
simulation results are showed since not all tanks are present on site. 

917


Figure 5: Level of the tanks number 5, 8 and 11 in normalized unit 

5. Conclusions

This work introduces a data-driven digital twin of two production value chains. The goal of this work was dual: 
(i) from an industrial point of view, the main objective was to create a simulation tool for the production 
planning of a complex interconnected chemical site and (ii) from an academic perspective it was to verify the 
possibility to use a combination of simple regression models to achieve the industrial goal. The results showed 
in section 4 support the latter assumption. The created digital twin can be used to simulate and do a scenario-
based analysis of the production planning for a complex chemical production site. The main challenge related 
to its implementation arises from the monitoring of the digital-twin accuracy and its maintenance. While the 
monitoring step can be automated by implementing statistical process control, the retraining steps will require 
human activity to assess the overall quality of the model. It is worth mentioning that this approach focuses on 
long term production planning, i.e., a month ahead or more, where the actual status of the production plant 
does not play a significant role. A short-term planning tool, on the other hand should incorporate this 
information, e.g., fouling level, mechanical wear, raw material quality. The next steps will introduce a multi-
objective optimization framework in order to optimally plan production and energy consumption under different 
product and energy market conditions (Vallerio, M. et al., 2015 & Nimmegeers, P. et al., 2019).  

References 

Conn, A., Gould, N., and Toint, P.L., 2000, Trust-region methods. mps-siam series on optimization siam and 
mps. Society for Industrial and Applied Mathematics: Philadelphia, PA, USA. 

Elster, C., Klauenberg, K., Walzel, M., W ̈ubbeler, G.,Harris, P., Cox, M., Matthews, C., Smith, I., Wright, L., 
Allard, A., et  al., 2015, A guide to Bayesian inference for regression problems. deliverable of EMRP 
Project NEW04 “Novel Mathematical and Statistical Approaches to Uncertainty Evaluation,”. 

Fumero, Y., Moreno, M.S., Corsano, G., and Montagna, J.M., 2016, A multiproduct batch plant design model 
incorporating production planning and scheduling decisions under a multiperiod scenario. Applied 
Mathematical Modelling, 40(5), 3498 – 3515. 

Li, J., Xiao, X., Boukouvala, F., Floudas, C.A., Zhao, B.,Du, G., Su, X., and Liu, H., 2016, Data-driven 
mathematical modeling and global optimization framework for entire petrochemical planning operations. 
AIChEJournal, 62(9), 3020–3040. 

Muteki, K. and MacGregor, J., 2008, Optimal purchasing of raw materials: A data-driven approach. 
AIChEJournal, 54, 1554 – 1559. 

Nimmegeers, P., Vallerio, M., Telen, D., Van Impe, J. and Logist, F. (2019), Interactive Multi‐objective 
Dynamic Optimization of Bioreactors under Parametric Uncertainty. Chemie Ingenieur Technik, 91: 349-
362. https://doi.org/10.1002/cite.201800082 

Nocedal, J. and Wright, S., 2006, Numerical optimization. Springer Science & Business Media. 
Powell, M.J., 1968, A fortran subroutine for solving systems of nonlinear algebraic equations.  Technical re-

port, Atomic Energy Research Establishment, Harwell, England (United Kingdom). 
Steel, R. G. D. and Torrie, J. H., 1960, Principles and Procedures of Statistics with Special Reference to the 

Biological Sciences.  
Vallerio, M., Vercammen, D., Van Impe, J., Logist, F., 2015. Interactive NBI and (E) NNC methods for the 

progressive exploration of the criteria space in multi-objective optimization and optimal control, Computers 
& Chemical Engineering, 

Wright, L. and Davidson, S., 2020, How to tell the difference between a model and a digital twin. Adv. Model. 
and Simul. in Eng. Sci., 7(13). 

918