Acta Polytechnica CTU Proceedings https://doi.org/10.14311/APP.2022.38.0090 Acta Polytechnica CTU Proceedings 38:90–96, 2022 © 2022 The Author(s). Licensed under a CC-BY 4.0 licence Published by the Czech Technical University in Prague ENERGY PERFORMANCE ESTIMATION FOR LARGE BUILDING PORTFOLIOS WITH MACHINE LEARNING-BASED TECHNIQUES Frédéric Monteta, ∗, Alessandro Pongellib, Jonathan Riala, Stefanie Schwabc, Jean Henneberta, Thomas Jusselmeb a University of Applied Science of Western Switzerland (HEIA-FR, HES-SO), iCoSys Institute, Bd. de Pérolles 80, 1705 Fribourg, Switzerland b University of Applied Science of Western Switzerland (HEIA-FR, HES-SO), ENERGY Institute, Bd. de Pérolles 80, 1705 Fribourg, Switzerland c University of Applied Science of Western Switzerland (HEIA-FR, HES-SO), TRANSFORM Institute, Bd. de Pérolles 80, 1705 Fribourg, Switzerland ∗ corresponding author: frederic.montet@hefr.ch Abstract. Building operation is responsible for 28 % of the world’s carbon emissions. In this context, establishing priorities in refurbishment strategies at the scale of a city or a group of buildings is important. Such procedures are usually led by experts in energy performance and, therefore, they are rarely carried out due to their long and costly nature. This research aims at the estimation of building energy performance to pave the way towards finding near-optimal refurbishment strategies. Thanks to the identification of easily-accessible building characteristics, the method applies machine learning models to scan a building portfolio based on a low level of details. The results show good potential to identify low-performer buildings with simple machine learning methods. It also opens the door for further improvements through the inclusion of supplementary building features at the input of the predictive system. This work includes (a) the integration of a knowledge database thanks to the Swiss CECB energy performance certificates, referencing more than 70 000 buildings, (b) the preparation of a training data set through the selection of relevant physical characteristics of buildings (input) and the corresponding energy consumption labels (output), (c) the development of predictive models used in a supervised way, (d) their evaluation on an independent test set. Keywords: Refurbishment strategies, machine learning, energy performance certificates. 1. Introduction In a world where climate change imposes major adjust- ments in order to slow down the rise in environmental temperature, one important area to improve is the planning of building renovations. Thanks to changing construction techniques and equipment, building ren- ovations became a major focus of energy strategies of governments. Switzerland, for example, is implementing a strat- egy called “Energy Strategy 2050” which aims, among other things, to reduce the energy consumption of buildings through incentives [1]. This, in the coming years, may cause a rush to renovate old buildings and to prioritise the management of renovations due to the costs and timeframes imposed. For building stock managers it means having to priorities and create a ranking of interventions that can lead to energy im- provements, but which do not involve high costs. In Switzerland there is the Cantonal Energy Certificate for Buildings (CECB) [2], which allows the assessment of the current state of the building and the planning of a possible renovation. It allows to attribute an en- ergy label to the efficiency of the building envelope, which describes the quality of the thermal envelope including roof, wall, floor and window insulation, and also takes into account thermal bridges and the shape of the building. A second label is given to the overall energy efficiency including heat demand, electricity demand, own production of electricity as well as the building’s equipment for heat and domestic hot water. The labels are divided into 7 classes: from A, the best class, to G, the worst class compared to a refer- ence. This certificate is compulsory in cantons such as Geneva, Vaud, Fribourg, Neuchâtel, Nidwald, Zug and Zürich, in case of sale and/or renovation of the building. This has led to the creation of a database with more than 70 000 certificates describing the phys- ical characteristics and energy performance of each building. However, this certificate oblige experts to travel to the building site in order to collect the various data needed for in-depth analysis, increasing the cost and time required to analyse a large building stock. In recent years it has also become possible, through portals such as Registre fédéral des bâtiments et des logements (RegBL) [3], Cantonal geoportals [4, 5], Google street view [6] and Google maps [7], to find precise data or to collect it remotely. Therefore, in the 90 https://doi.org/10.14311/APP.2022.38.0090 https://creativecommons.org/licenses/by/4.0/ https://www.cvut.cz/en vol. 38/2022 Energy performance estimation for large building portfolios . . . Figure 1. Map of Switzerland showing the concentration of building energy performance certificates (CECB) per cantons. following chapters we will try to answer the main ques- tion: “How can the available online data be used to quickly classify a building stock energy performance?”. In this paper, we investigate the use of Machine Learning (ML) approaches as a solution to estimate automatically the less performing buildings according to the CECB methodology. The proposed method is to take as input of the ML systems easy-to-find building characteristics that do not need the interven- tion of an energy expert. The output of the system is a prediction of the CECB energy label of the building. If functional, such approaches could provide a quick and easy way to rank buildings by priorities of reno- vations. The CECB association gave us access to his data, under a data agreement for the protection and sharing of sensitive data. We used these data to train and test our ML systems. The paper is organized as follows. Section 2 presents the methods used to prepare the data and to select and optimize the best performing models. Section 3 presents the obtained results according to specific met- rics that we propose to evaluate their performances. Finally, Sections 4 and 5 present discussions, conclu- sions and future works. 2. Methods This section introduces exploratory and preprocessing phases performed on the data. Details regarding the machine learning models selection, their optimization and the process used to assess the results quality are provided. 2.1. Data Exploration As indicated in the previous Section 1, more than 70 000 certificates are in the dataset. To understand the nature of the data, statistics were computed as Figure 2. Evolution of the average heat transfer coefficient in relation to the year of construction of buildings. a preliminary analysis. Knowing that in some cantons the certificate is compulsory, it was verified that this is reflected in the form of certificates in the database. In Figure 1 it is possible to see how the dataset is divided according to the different cantons and it is possible to identify the cantons that have a compulsory certificate. A second analysis was carried out to verify data con- sistency. In Figure 2, it is possible to see the average heat transfer coefficient for each year. The evolution of this value is getting lower through time, which cor- relates with the increase in building insulation from 1970. This conclusion is similar to the findings of the Energy and Renovation (eREN) project [8]. To get a better understanding of the subdivision of the dataset, we then checked which categories are present. They are subdivided according to the cat- 91 F. Montet, A. Pongelli, J. Rial et al. Acta Polytechnica CTU Proceedings Categories CECB Qty. Percentage Single-family building 40 590 55.10 % Multi-family building 27 780 37.71 % Administration 1 932 2.62 % Mixed 1 802 2.45 % School 1 374 1.87 % Retail 79 0.11 % Hotel 77 0.20 % Restaurant 29 0.04 % Table 1. Representation of the various building cate- gories in the CECB dataset. Original data After cleaning A 1620 277 B 10695 4594 C 16036 10056 D 17271 10718 E 11615 7355 F 6950 4374 G 9474 4986 Table 2. Number of certificates per label before and after cleaning the dataset. egories of RegBL [3]. In the Table 1, it is possible to see that the two most represented categories are single-family building and multi-family building. The subdivision in classes before the cleaning of the dataset is present in the Table 2 where we can see that the most represented class is D followed by other classes and ended by the least represented A. 2.2. Data Preprocessing The data received was in the form of several tables. To use them, the latter were merged to combine the data in a single structure. Subsequently, a data cleaning was carried out. After manual checks, outliers were identified and removed when samples were outside two standard deviations from the mean, thus removing 5 % of the data. In addition, missing values were removed when their percentages were more than 90 % per column or more than 80 % per samples, i.e. per certificate. The result of this part of the method is presented in Table 2 as a number of certificates after the cleaning. For all following machine learning (ML) tasks, a balanced dataset is preferred1. To achieve this, sub- sampling can be used. Since the class A is clearly underrepresented with its 277 samples, sub-sampling the data would induce a high loss of samples. For this reason, class A and B are merged before sub- sampling for all further ML tasks. Moreover, as A and B building classes are best performers, they are not targeted by refurbishment plans. 1A dataset is said balanced when the number of samples per class is equal. 2.3. Model Performance Exploration The modelling method includes 3 steps. (1.) Pre-processing routines were carried out. The latter include an ordinal encoding of the data as well as a normalization by computing the Z-score on all variables. (2.) Then, eighteen different models from the Scikit- learn library were selected when appropriate for a classification task and trained with a k-fold cross validation with k = 3 [9]. (3.) The evaluation of all models is made based on their F1-scores and accuracy metrics to select the two most promising ones. 2.4. Optimization and Model Selection On the two best models from the previous section, optimization and training is performed to allow for final model selection. Since the optimization tasks is memory expensive, the research for optimal parame- ters has been done on a reduced number of samples per class selected in a random manner. The optimization includes a cross-validated random- ized search where k = 3 on a selection of parameters (see Section 3). Model training is computed as in step 3 from Section 2.3. Finally, the overall best model is selected given its per-class metrics (Accuracy and F1) and confusion matrix. 2.5. Special Cases Identification As a last step of the method, a more in-depth analysis of special cases is performed to identify the reasons behind a high prediction error. A distance between classes is computed with a method taken from ordinal regression problems [10]. To compute this distance, let y = A and ŷ = G be a sample’s class and its estimation. Their en- coded version would be y′ = (1, 0, 0, 0, 0, 0, 0) and ŷ′ = (1, 1, 1, 1, 1, 1, 1). The distance d between the two classes can then be calculated with d = ∑n i=1 |y ′ i − ŷ ′ i|. Once obtained, all samples were ordered by descend- ing order from their distance d. 3. Results 3.1. Model Performance Exploration The eighteen algorithms were trained on a resampled dataset to make it balanced. Performances obtained in the Table 3 allow for the comparison of all algorithms in order to identify the two best candidates for energy performance certificates predictions. The DummyClassifier performs a classification in a random manner with 0.17 accuracy. The latter result sets a baseline above which an algorithm learns the characteristics of the data. Since most algorithms performed with a F1-score and accuracy between 0.3 and 0.5, the problem at hand can be characterized as difficult. 92 vol. 38/2022 Energy performance estimation for large building portfolios . . . Model Accuracy F1-score HistGradientBoostingClassifier 0.50 0.49 RandomForestClassifier 0.50 0.48 ExtraTreesClassifier 0.49 0.48 BaggingClassifier 0.49 0.48 GradientBoostingClassifier 0.49 0.47 MLPClassifier 0.47 0.46 SVC 0.46 0.45 NuSVC 0.46 0.45 LogisticRegression 0.45 0.43 DecisionTreeClassifier 0.43 0.43 AdaBoostClassifier 0.43 0.42 LinearDiscriminantAnalysis 0.43 0.41 ExtraTreeClassifier 0.39 0.38 KNeighborsClassifier 0.39 0.38 LinearSVC 0.41 0.37 SGDClassifier 0.40 0.35 RidgeClassifier 0.39 0.34 NearestCentroid 0.35 0.34 Perceptron 0.33 0.32 PassiveAggressiveClassifier 0.32 0.31 BernoulliNB 0.32 0.31 GaussianNB 0.30 0.30 DummyClassifier 0.17 0.05 Table 3. Model performances exploration. Models are ordered by F1-scores and then, by their accuracy metrics. Before After Gain Catboost 0.59 - - HistGradientBoosting 0.49 0.58 18 % RandomForest 0.48 0.57 19 % Table 4. Comparison of the best classifiers with their F1-score average across class. The top 5 algorithms include recent classi- fication models performing with similar result around 0.48 F1-score. The MLPClassifier2 per- formed with moderately good result. The best selected candidates for the next step of the method are the HistGradientBoostingClassifier and RandomForestClassifier. 3.2. Optimization and Model Selection The fine tuning procedure was performed on HistGradientBoostingClassifier and RandomForestClassifier models. In addition, one state-of-the-art gradient boosted tree implemen- tation was added from the Catboost library, without fine-tuning [11]. As Table 4 shows, all models are reaching perfor- mances above or close to 0.5 without optimization, thus already allowing for an educated guess. When parameters from Listing 1 are used, substantial gains can be obtain within the order of ∼20 %; thus proving the value of randomized search. 2Stands for Multi-layer Perceptron, i.e. a neural network with the default parameters from the Scikit-learn library. b e s t _ p a r a m s _ h i s t _ g r a d i e n t _ b o o s t i n g = { ’ l 2 _ r e g u l a r i z a t i o n ’ : 0 . 0 0 0 8 1 , ’ l e a r n i n g _ r a t e ’ : 0 . 1 1 8 8 9 , ’ max_bins ’ : 6 0 , ’ max_leaf_nodes ’ : 1 1 , ’ m i n _ s a m p l e s _ l e a f ’ : 87 } best_params_random_forest = { ’ n _ e s t i m a t o r s ’ : 8 0 0 , ’ m i n _ s a m p l e s _ s p l i t ’ : 5 , ’ m i n _ s a m p l e s _ l e a f ’ : 1 , ’ m a x _ f e a t u r e s ’ : ’ s q r t ’ , ’ max_depth ’ : 8 0 , ’ b o o t s t r a p ’ : F a l s e } Listing 1. Best parameters. Unbalanced Balanced AB 0.76 0.82 C 0.69 0.62 D 0.60 0.54 E 0.45 0.44 F 0.31 0.47 G 0.68 0.67 Accuracy 0.60 0.59 F1 avg. 0.59 0.59 Table 5. Global building efficiency F1-scores. A final F1-score average across classes of 0.59 makes the Catboost based model the more performant of the selection. Also, both algorithms from Scikit-learn library had a notably high performance gain, making them reach a scores comparable to Catboost. 3.3. Best Model Performances The Catboost model, using gradient boosting on de- cision trees, is the final model selected. This section introduces it in three phases. First, with explaining an additional data processing task. Second, by pre- senting the model performances in depth. Finally, by analyzing the errors produced by the model. 3.3.1. Model Performance The final model performances are summarized in Table 5. In average, the model reaches an accu- racy of ∼0.6. This represents an improvements of ∼350 % compared to the baseline of 0.17 given by the DummyClassifier from Section 3. On unbalanced data, scores have a great variability, which makes the model quality hard to evaluate with confidence. Since unbalanced data has generally more samples, the awaited behavior is a higher score, but this isn’t the case for all classes. A possible expla- nation lies in the selection of more representatives samples while sub-sampling. 93 F. Montet, A. Pongelli, J. Rial et al. Acta Polytechnica CTU Proceedings (a). Unbalanced dataset. (b). Balanced dataset. Figure 3. Confusion matrices for global efficiency. On balanced data, the comparison between classes gives an insight on the difficulty to evaluate each class. From AB to F , scores are decreasing from 0.82 to 0.47. This shows that the lower the efficiency of a building is, the harder it becomes to predict its class. Both end of the class spectrum – AB and G – have higher scores. This behavior is probably due to the absence of strict intervals i.e. a certificate made on a highly inefficient building would have a class higher than G, but since it doesn’t exist, it is classed as G. More details about the predictions uncertainty are given by the confusion matrices in Figure 3 where the precision of the models are presented visually. A highlighted diagonal is predominant in both the unbalanced and balanced plots; showing the correct predictions. On each side of the diagonal, the predic- tions made with an error of one class above or below the true label are represented. 3.3.2. Prediction Reliability To evaluate how the false predictions are spread around the diagonal, the plot on Figure 4 shows the cumulative density of the distances between predicted and true value. The accuracy of ∼60 % is visible where the distance is 0. Then, in case of a wrong prediction, there is a ∼90 % probability that the true class is only a letter away; ∼100 % for two letters, and so on. This highlights that wrong predictions are generally not far away from their target. 3.4. Online variables To assess whether the necessary input variables are available online to speed up the classification process with ML-based techniques, a wide search was carried out on the various portals listed in the Section 1. The Figure 5 presents the most discriminant variables in the Catboost algorithm. For brevity, only the ten firsts are presented. Of the most important variables for the operation of the algorithm, we can easily find the year or era of construction of the building on the RegBL site through Figure 4. Cumulative distribution of the distances between true and predicted global efficiency given two two different metrics. the interactive building map. A second variable that can easily be found is the building width through Google Maps for example as the measurement tool can be used to obtain a value of the desired building. A third and a fourth that can be found on the internet, but is subject to restrictions, are the year of construction of the energy agent and the energy source. They are available in the RegBL database, but this type of data is only granted after authorization and verification of credentials. All other variables are not accessible as they are not present on any platform at the moment. 4. Discussion The applicability of the methods depends on (1.) the accuracy of the model that is used to predict performance classes, and (2.) the easy online access of inputs of the predicting model describing the physical properties of buildings to be classified. 94 vol. 38/2022 Energy performance estimation for large building portfolios . . . Figure 5. Feature importance of the first ten variables used by Catboost. Regarding (1.), the Catboost model is promising as it has close to 60 % accuracy. Moreover, the prediction is only 1 class away in 90 % of the buildings. This seems to be highly acceptable if the method decreases dramatically the time spent to assess the building performance. In order to be able to understand how to further im- prove prediction, it is necessary to start by analysing the process used step by step, starting with the collec- tion of data for the generation of the CECB certificate. The initial dataset is the result of certificates exe- cuted by many CECB experts. Several points in the creation of a CECB certificate are decided through visual inspection and thus based on the knowledge of the individual expert. For example, in order to get a U-value for walls, the expert can enter a proposal for the composition of the wall by visual inspection and the program calculates a U-value based on what the expert insert. This means that each expert, according to his knowledge, can enter or omit data, but in the end the program still manages to generate a complete certificate. This, for example, explains the missing data for some certificates. In order to have a more complete dataset, it would be important to reduce the possibility of omitting data for the generation of a certificate to a minimum. Regarding the type of certificate, before 2012, a sim- pler type of certificate with fewer parameters was in place, then a more detailed certificate was adopted. This certainly causes a lack of data on some of the certificates used. Filling the gaps and convert less de- tailed certificates into more detailed ones by entering the missing data could be a solution to improve the base set. During the data cleaning, arbitrary choices were made to eliminate values with the 3 sigma method, to eliminate columns with 90 % of the data missing as well as certificates with 80 % of the data missing. This system can clearly decrease the accuracy by removing variables and especially population from the data set. A more thorough analysis for cleaning the data should be done, so as to be sure that all inconsistent values are removed and that all consistent values are kept. The same can be done by keeping variables considered important as well as certificates to increase the population. Finally, when it comes to prediction, changing the settings of the algorithm could lead to further im- provement of the dataset. A better tuning can lead to a more precise prediction. Moving on to the second point (2.), it was identified in Section 3.4 that only two variables are easily acces- sible. In the next part of the article we will analyse each variable in the Figure 5, exploring possible meth- ods of providing representative values of the analysed building for the data required. In order, the most important variable is the heat transfer coefficient of walls, which cannot be found online. One solution would be to send a person on site to do an analysis in order to give a value to this variable. A second possibility that should be explored is to use artificial intelligence to reconstruct this value, for example from an image of the wall of the building and the year of construction, so that it can recognize some key features and then give a value based on other similar buildings. Moving on to the second variable in the list is the Energy source. As already discussed in the Section 3.4, it can be found on the RegBL website with restrictions. However, this data may be in the possession of the owners or managers of the building and could therefore be entered easily. The year of construction of the heating, as men- tioned in the Section 3.4, can be found on the RegBL website with restrictions. Also this variable should be checked if the owner or manager of the building is in 95 F. Montet, A. Pongelli, J. Rial et al. Acta Polytechnica CTU Proceedings possession of it. The envelope coefficient is not found directly but could be calculated from other variables. Online it is possible to measure the building perimeter through Google Maps. On the RegBL website it is possible to find the number of floors in the building and the ground area. Using for example an average value for the height of a floor, the various parameters can be combined. Multiplying the perimeter by the num- ber of floors and the average height and then adding the floor area twice gives an approximation of the envelope surface. Dividing the envelope surface by the number of floors multiplied by the floor surface gives an approximate value of the envelope coefficient. Clearly an approximate value that should be checked for potential and especially for possible errors brought with it. The utilisation rate of heat generator for hot water cannot be found online and so there are two possibili- ties to find this data. The first is that the manager or owner of the building knows this value. The second is to send an expert on site to make an assessment. The energy reference surface can also be estimated from other variables. On the RegBL website we find the floor area of the building and the number of floors, multiplying these variables together we find a rough estimate. Clearly, the reliability of predicting the correct energy class using this approximation must be checked. The climate station is not available online, but knowing the address of the building it is possible to indicate which weather station is to be used for the calculation. The linear thermal bridge building base lWF is like the heat transfer coefficient of walls, is not available online and it is possible to use the same solution proposed. It must be said that all missing variables could be estimated or found easily with an expert on site. Having only 10 variables to find would simplify and speed up the work to be done on site. 5. Conclusion In this paper we have highlighted the preliminary reliability results of using a classification algorithm to analyse a building. The result of ranking the building in the good class with 60 % accuracy is a promising result for future developments. It should be noted that there is a 90 % probability of being in the adjacent class, which brings value to the work done. In addition, the work carried out to check the online presence of the most important variables for prediction has shown that it is still premature to find the exact value online. Nevertheless, it may be possible to recreate some them by developing further techniques. Some recommendations for future work are neces- sary. For example, A special attention must be taken when merging the different data available during the preparation phase, as this could lead to consecutive errors in the other phases of the work. A special atten- tion must also be taken during the cleaning phase to ensure that the maximum amount of data is available to carry out the work. Acknowledgements The authors would like to express their gratitude to Ms Karine Wesselmann and the CECB association for provid- ing the data for this work. They would also like to thank Professor Mylène Devaux and the iTEC institute for their collaboration. Financial support is gratefully acknowledged from the HEIA-FR Smart Living Lab research program. References [1] Swiss Federal Office of Energy. Stratégie énergétique 2050, 2020. [2021-11-05]. https://www.bfe.admin.ch/ bfe/fr/home/politik/energiestrategie-2050.html [2] Association GEAK-CECB-CECE. Le Certificat énergétique cantonal des bâtiments (CECB). [2021-11-05]. https://www.cecb.ch/ [3] Office fédéral de la statistique. Registre fédéral des bâtiments et des logements (RegBL). [2021-11-05]. https://www.housing-stat.ch/fr/index.html [4] Canton de Fribourg. Portail cartographique du canton de Fribourg. [2021-11-05]. https://map.geo.fr.ch/ [5] Canton de Neuchâtel. Portail cartographique du canton de Neuchâtel. [2021-11-05]. https://sitn.ne.ch/ [6] Google LLC. Google Street View. [2021-11-05]. https://www.google.com/intl/en_ch/streetview/ [7] Google LLC. Google Maps. [2021-11-05]. https://www.google.ch/maps/ [8] S. Schwab, L. Rinquet, M. Devaux, et al. Rénovation énergétique. Tech. rep., Fribourg, Suisse, 2018. [9] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830, 2011. [10] J. Cheng, Z. Wang, G. Pollastri. A neural network approach to ordinal regression. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1279–1284. 2008. https://doi.org/10.1109/IJCNN.2008.4633963 [11] L. Prokhorenkova, G. Gusev, A. Vorobev, et al. CatBoost: unbiased boosting with categorical features, 2019. arXiv:1706.09516 96 https://www.bfe.admin.ch/bfe/fr/home/politik/energiestrategie-2050.html https://www.bfe.admin.ch/bfe/fr/home/politik/energiestrategie-2050.html https://www.cecb.ch/ https://www.housing-stat.ch/fr/index.html https://map.geo.fr.ch/ https://sitn.ne.ch/ https://www.google.com/intl/en_ch/streetview/ https://www.google.ch/maps/ https://doi.org/10.1109/IJCNN.2008.4633963 http://arxiv.org/abs/1706.09516 Acta Polytechnica CTU Proceedings 38:90–96, 2022 1 Introduction 2 Methods 2.1 Data Exploration 2.2 Data Preprocessing 2.3 Model Performance Exploration 2.4 Optimization and Model Selection 2.5 Special Cases Identification 3 Results 3.1 Model Performance Exploration 3.2 Optimization and Model Selection 3.3 Best Model Performances 3.3.1 Model Performance 3.3.2 Prediction Reliability 3.4 Online variables 4 Discussion 5 Conclusion Acknowledgements References