CHEMICAL ENGINEERING TRANSACTIONS 

VOL. 96, 2022 

A publication of 

The Italian Association 
of Chemical Engineering 
Online at www.cetjournal.it 

Guest Editors: David Bogle, Flavio Manenti, Piero Salatino 
Copyright © 2022, AIDIC Servizi S.r.l. 
ISBN 978-88-95608-95-2; ISSN 2283-9216 

Machine Learning Based Framework for Biorefinery 
Environmental Assessment 

Nancy Prioux*, Rachid Ouaret, Jean-Pierre Belaud
Laboratoire de Génie Chimique, Université de Toulouse, CNRS, INPT, UPS, Toulouse, France 
nancy.prioux@ensiacet.fr 

The transformation of actual processes into sustainable processes is a major study subject over recent years, 
particularly through the circular economy. However, the environmental assessments require a huge quantity of 
data and many of these data are heterogeneous. Environmental evaluation tools would clearly benefit from Data 
Science approaches in the Big Data context. This paper focuses on developing a framework for decision-making 
in Process System Engineering by coupling Machine Learning techniques and environmental assessment. Five-
steps framework have been deployed in a framework and tested on the comparison of biomass pretreatment 
processes for glucose production. Some scientific articles have been selected thanks to specific keywords in 
Science Direct and Web of Science. The data architecture and in particular the data analysis allows us to bring 
data to higher quality such as a material balance check. The approach gives access to a process-impact matrix 
which is analyzed through Dimensional Reduction methods in order to highlight similar impacts and/or 
processes. 

1. Introduction
According to the French Agency for Environment and Energy Management (ADEME), the Circular Economy 
(CE) takes into account three areas of action: (1) consumption through consumer demand and behavior, (2) 
supply and economic actors for whom industrial ecology is an accepted and promising path from the initial 
design of a territorial area and (3) waste management (Belaud et al., 2019a). These three areas describe as the 
entire life cycle of a process or a product. Life cycle thinking is used in sustainable models to improve 
environmental performance while maximizing economic and social benefits. In recent years, several global 
methods have emerged to design biorefineries in CE models (Grimaud et al., 2017). One of the challenges in 
biorefineries is to design processes that are as sustainable as possible. The supply chain includes several 
operational steps, from biomass selection to waste disposal, and goes through various processing steps. Each 
step in this chain can be described with its material flows (inputs and outputs), operating parameters, energy 
and economic data. All these data are required so that the environmental assessment could be carried out 
considering the data diversity and its heterogeneous sources. However, without proper data science tools, it 
can be difficult to valorize the collected data and better decision-making. Environmental evaluation tools such 
as Life Cycle Assessment (LCA) would clearly benefit from Data Science approaches in the Big Data context 
(Belaud et al., 2019b).  
The present paper examines the use of Machine Learning (ML) approaches to LCA for the evaluation of 
biorefinery processes. The main topics are concentrated on the decision-oriented problems of sustainability and 
eco-design. It structures around the valorization and representativeness of data with the help of data 
Dimensional Reduction (DR) (Cox and Cox, 2001) and clustering. Despite the importance of ML, a review of the 
literature revealed that sustainability and environmental assessment are largely not yet part of the popular 
lexicon of Data Science in action. This study addresses this gap. This paper focuses on developing a framework 
for decision-making in Process System Engineering by coupling Machine Learning techniques and 
environmental assessment. It also aims to identify the driving factors of the bio-process that have a major impact 
on the outputs of LCA.  

Paper Received: 18 January 2022; Revised: 5 August 2022; Accepted: 29 July 2022
Please cite this article as: Prioux N., Ouaret R., Belaud J.-P., 2022, Machine Learning Based Framework for Biorefinery Environmental 
Assessment, Chemical Engineering Transactions, 96, 517-522  DOI:10.3303/CET2296087

517

  DOI: 10.3303/CET2296087


2. General approach
2.1 Materials and methods

Big data can be used at various levels of sustainability management. One of the challenges in sustainability 
management is designing the “best” process in the three areas of sustainability – environmental, economic and 
social (Santoyo-Castelazo and Azapagic, 2014).  A supply chain includes several operational stages, from input 
choice to waste disposal, and it passes through various transformation stages and upstream/downstream 
processes. Indicators of sustainability impacts can describe each stage. The main goal of this approach is to 
analyze the different systems and provide support for group-based decision-making. The presented approach 
was adapted to any industry by making use of concepts from industry 4.0 and sustainability management. In 
particular, we retained the Big Data pillar from Industry 4.0 and sustainability assessment from sustainability 
management. Figure 1 illustrates one path of digital transformation based on the integration of big data into an 
industry.  

Figure 1: Industry 4.0 and sustainability for industries. 

Each supply chain is described by several categories of data that are heterogeneous and can influence the 
other categories. For example, the type of input can influence the type of transformation processes. All data are 
importance and influence the social, economic and environmental indicators. Obtaining such data is time-
consuming and requires expensive experiments. Alternatively, data can be obtained from scientific publications 
and other sources, with automatic or manual use of data analysis. It is indeed possible to use these data to 
obtain foreground data for sustainability analysis, whereas background data are usually available from the LCA 
database. 

2.2 Detailed Framework 

Our approach’s goal is the development of a methodological framework centered on intensive data and 
knowledge extractionfor an economically viable and ecologically responsible design of industrial processes or 
systems. The framework is divided into five major steps: (i) goal and scope (ii) data architecture (iii) sustainability 
assessment (iv) results visualization and analysis and (v) decision. 
In the first step, the goal and scope of the study must be clearly defined. The life cycle thinking being the 
foundation of the approach, it is recommended to follow the LCA ISO norm (ISO 14040:2006, 2006). A “cradle-
to-grave” approach is preferred or a “cradle-to-gate” if the logistics of a value chain are difficult to obtain. After 
the goal and the scope – the system boundaries - are well defined, the functional unit, the study function, the 
supply chain, technologies, and transformation processes should also be described. 
The data architecture step consists of the treatment of the processing of data from scientific papers or private 
databases. It is directly inspired by the construction of big data architecture and consists of five sub-steps: data 
collection and extraction, data enrichment and storage, data processing, (raw) data analysis, and (raw) data 
visualization. 

518


This step can be automatic, semi-automatic, or manual and it uses data technic e.g., machine-learning methods 
for the (raw) data analysis. These substeps are detailed in Belaud et al. (2019). The last two sub-steps – analysis 
and visualization – can take benefit from ML methods like we describe in the fourth step. 
The third step consists on sustainability assessment, which is divided into two parts. First, life cycle inventory 
lists and quantifies every input and output required for the sustainability assessment. There are two categories, 
the foreground data which is the process data from the previous step and the background data are available 
into specific LCA database. Then, sustainability assessment involves choosing the impact methods, the 
indicators, and the dispersion methods in accordance with each area of sustainability management. A [Process-
Impacts] Matrix is the result of this step but this matrix is difficult to analyze.  
The fourth step, results visualization and analysis, summarizes the analysis of the impact-process matrix. The 
step provides the methods derived from ML to help this analysis. Based on statistical literature, a combination 
of traditional DR and unsupervised clustering techniques was chosen to extract information from the impacts. 
More precisely, this hybrid approach is based on the Multi-Dimensional Scaling (MDS) using the Canberra 
distance and k-means (Lance and Williams, 1967). The objective is to search for “hidden” structures in 
multidimensional data and to help interpret the area of clustered midpoints in the assessment matrix. The 
advantage of this approach is that data-based methods require very little knowledge of processes to perform. 
Figure 2 summarizes the treatment of the [Process-Impacts] matrix. 
 

Figure 2: Treatment of the [Process-Impacts] Matrix (Step 4) 

For the final step, the researcher can use the previous step – the analysis and the visualization of clusters - as 
a basis for his decision. This decision can be made by the researcher himself or by a group composed of different 
engineers/researchers from different fields 

3. Case study 
The approach is tested on the comparison of biomass pretreatment processes for glucose production. Only the 
environmental area is considered. 

3.1 Goal and scope step 

The goal of the study is to help a researcher select a process for glucose production. The boundaries range 
from biomass to the enzymatic hydrolysis step i.e., a “cradle-to-gate” approach. Biomass is considered as a 
waste – the impacts of agricultural phases are attributed to the end product. If the biomass is considered as a 
co-product, the impacts of the production of the final product will be split between it and the biomass. The 
biomass transport phase impact is minor – the biorefinery is close to the field. The functional unit is “1 g of 
glucose” and all results are expressed based on this unit.  
 

519


3.2 Data architecture step 

Thanks to specific keywords in Science Direct and Web of Science, twenty articles have been selected. Relevant 
data from these articles are extracted semi-automatically using an ontology (Lousteau-Cazalet et al., 2016). 
This represents more than 23.000 data (numeric or text). Each scientific article is entered in the ontology with 
its meta-information (source type, reputation, citation data). The ontology structures the process data and 
ensures an export in CSV files supplying internal software. This software developed on Microsoft Excel conducts 
a first “cleaning” of the data by simulating the processes to calculate and check the mass balance. After this 
sub-step, we remove the data of three articles because they contained inconsistencies or many missing data 
points that are not amenable to be verified by the simulation. For this paper, no raw data analysis and simulation 
are carried out.  

3.3 Environmental assessment  

For the life cycle inventory, the process data comes from the end of the previous step – the cleaned process 
data - and the background data from EcoInvent v2.2. Then an attributional LCA method is applied: the ReCiPe 
2016 method. The calculation comes from SimaPro® a LCA software. The environmental assessment evaluates 
17 "midpoint" impacts. The result of this step is a [Process-Impacts] matrix of dimensions 17x17.  

3.4 Results visualization and analysis  

The previous matrix is then analyzed by MDS. The interpretation of an MDS result is simple the closer the 
objects are in the scatter plot, the more similar they are. That is the projected points are arranged in such a way 
that the grouped ones (small geometrical dissimilarity between them) will reflect original closeness in the data. 
After that, a clustering algorithm has been applied to the MDS projection to highlight the most similar objects 
(Impacts - Process) as illustrated in Figure 2. There are two types of k-means clustering possible: one based 
on impacts distance matrix and one based on the process distance matrix. The two-dimensional of MDS results 
of projected impacts (17 impacts) is shown in Figure 3. The abbreviation of the impacts is introduced to facilitate 
the visualization (Table 1). It presents 4 sub-figures of the first 4 dimensions with the most significant 
combinations. For example, in the first figure (top left), we have represented the projection of the 17 impacts on 
the first two dimensions, which represent a total variance of 45%. The percentage of explained variance for the 
first four components is 70%. The visualization of the four dimensions shows the same three groups and we 
can clearly distinguish three clusters using k-means: 

- Group 1: Almost all impacts that concern chemical pollution of soil and water are found in this group. 
The exemption is the marine ecotoxicity found in group 2. 

- Group 3: This group mainly includes impacts related to land use and land transformation. 
- Group 2: This group forms three sub-clusters with superposed points (from the 2-D perspective). This 

suggests that these points are highly similar based on the Canberra distance. Here, we find a group 
quite heterogeneous where impacts not presented in groups 1 and 3 are found. The marine ecotoxicity 
expected rather in group 1 is found in this group. 

Table 1: Abbreviations of impacts 

Impacts Abbreviations Impacts Abbreviations 
Climate change Human Health CCHH Terrestrial ecotoxicity TecoX 
Ozone depletion OD Freshwater ecotoxicity FrEco 
Human toxicity HT Marine ecotoxicity MaEco 
Photochemical oxidant formation Pohto_ChOx Agricultural land 

occupation 
AgLOcc 

Particulate matter formation PM Urban land occupation UrbLOcc 
Ionising radiation IR Natural land 

transformation 
NLTran 

Climate change Ecosystems CCE Metal depletion MeDe 
Terrestrial acidification TA Fossil depletion FossDe 
Freshwater eutrophication FrEu   
 
 
520


The two-dimensional of MDS results of process projected reveal results quite similar to those obtained by using 
MDS on the impacts matrix. In this case, the percentage of explained variance for the first four components is 
98%, which is an excellent representation in lower-dimensional spaces. Three distinct groups of processes can 
be identified. Very tight and separate clusters appear in the process data, which may suggest that each cluster 
is a domain or subdomain that needs to be analyzed individually. A group referring to processes whose pre-
treatments are purely mechanical. Going back to the impacts, we find that these two pre-treatments have a very 
significant impact on the depletion of fossils compared to the others.  
 

Figure 3: Scatter plot of MDS projections (two dimensions) and k-means clustering based on impacts distance 
matrix. Percentage of explained information for the first four components is 70%. 

3.5 Decision 

The hybrid ML techniques, here MDS and k-means clustering allows the researcher, engineer or decision maker 
to make rapid clusters on processes without loss of information. Following the goal of the study, the choice could 
be made on a process of a specific group. The specialists can also make a loop of the framework for a specific 
group to detail the result.  

4. Conclusions 
This paper proposes a generic and practical approach for the researcher or research and development engineer 
in the use of Machine Learning in the field of sustainability within the Big Data context. The improvement of the 
traditional LCA method by coupling the tools of (big) data science and artificial intelligence algorithms allows a 
different discussion of environmental impacts. Indeed, on the one hand, data science tools allow extracting and 
collecting data directly from scientific articles. On the other hand, the MDS can simplify the discussion of 
environmental impacts.  

521


Composed of five steps, the approach is shown as a decision help in a pre-study. It is designed to save time 
and money by including no experiments and using public scientific data as a database. After structuring and 
processing process data from scientific literature, the LCA step give an environmental impact-process matrix 
which can be analyzed by MDS method. In the case study, the comparison of biomass pretreatment processes 
for glucose production, this MDS clustering methods highlight major findings: (i) a group includes impacts related 
to land use, and land transformation is detected, and (ii), a cluster of all impacts related to chemical pollution of 
soils and water. 
Several limitations have been identified:  

- The data from the scientific literature are by nature data from a series of batch experiments in the 
laboratory. The life cycle analysis (LCA) is therefore performed for a low level of technology readiness 
level (TRL) or maturity (TRL 1/2). 

- The approach does not integrate the change of scale required to implement a semi-industrial pilot, 
especially if the process becomes semi-continuous  

- The abundance and the quality of the data are not sufficient for these new technological processes. 
The most ambitious perspective is the automation of the database enrichment phase. A further research 
objective will include the comparison of several ML clustering tools. Other points for progress are to reconsider 
the functional unit, the global environmental assessment strategy by integrating the upstream agricultural phase 
(consequential LCA, system allocation and system extension policy) and considering the global supply chain 
according to a dynamic analysis, spatial, or even temporal. 

Acknowledgments 

This work has been sponsored by the French government research program “Investissements d’Avenir” 
through the Research National Agency (ANR-18-EURE-0021). 

References 

Belaud, J.-P., Adoue, C., Vialle, C., Chorro, A., Sablayrolles, C., 2019a. A circular economy and industrial 
ecology toolbox for developing an eco-industrial park: perspectives from French policy. Clean Technologies 
and Environmental Policy 21, 967–985. https://doi.org/10.1007/s10098-019-01677-1 

Belaud, J.-P., Prioux, N., Vialle, C., Sablayrolles, C., 2019b. Big data for agri-food 4.0: Application to 
sustainability management for by-products supply chain. Computers in Industry 111, 41–50. 
https://doi.org/10.1016/j.compind.2019.06.006 

Cox, T.F., Cox, M.A.A., 2001. Multidimensional scaling. Chapman & Hall/CRC, Boca Raton. 
Grimaud, G., Perry, N., Laratte, B., 2017. Decision Support Methodology for Designing Sustainable Recycling 

Process Based on ETV Standards. Procedia Manufacturing 7, 72–78. 
https://doi.org/10.1016/j.promfg.2016.12.020 

ISO 14040:2006, 2006. Environmental management - Life cycle Assessment - Principles and Framework. 
International Organization for Standardization, Geneva, Switzerland. 

Lance, G.N., Williams, W.T., 1967. Mixed-data classificatory programs i - agglomerative systems. Australian 
Computer Journal 1, 15–20. 

Lousteau-Cazalet, C., Barakat, A., Belaud, J.-P., Buche, P., Busset, G., Charnomordic, B., Dervaux, S., 
Destercke, S., Dibie, J., Sablayrolles, C., Vialle, C., 2016. A decision support system for eco-efficient 
biorefinery process comparison using a semantic approach. Computers and Electronics in Agriculture 127, 
351–367. https://doi.org/10.1016/j.compag.2016.06.020 

Santoyo-Castelazo, E., Azapagic, A., 2014. Sustainability assessment of energy systems: integrating 
environmental, economic and social aspects. Journal of Cleaner Production 80, 119–138. 
https://doi.org/10.1016/j.jclepro.2014.05.061 

 
522


	Machine Learning Based Framework for Biorefinery Environmental Assessment