Home page


Data science offers promising prospects for improving the energy efficiency of buildings. Thanks to the availability of smart meters and sensor networks, along with increasingly accessible algorithms for data processing and analysis, statistical models may be trained to predict the energy use of HVAC systems or the indoor conditions. These trained models and their predictions then lead to various inferences: assessing the real impact of energy conservation measures; identifying HVAC faults or physical properties of the envelope in order to provide incentive for retrofitting; minimizing energy consumption through model predictive control; detecting and diagnosing faults; etc.

The availability of measurements and computational power have given data mining methods an increasing popularity. The field of data analysis applied to building energy performance assessment however faces two main challenges to this day. Ironically, the first challenge is the abundance of data. Smart meters and building management systems deliver large amounts of information which can hide the few readings which are the most relevant to energy conservation. Automated monitoring and fault detection algorithms only do what they are told, and will hardly replace human intervention when it comes to understanding readings. The second challenge is the difficulty of data science. Without a principled methodology, it is very easy to draw erroneous conclusions, by incorrectly assuming that a model is properly trained. By lack of a background in statistics, building energy practitioners often lack the tools to ensure their inferences are correct.

Content of the book

The topic of this book is statistical modelling and inference applied to building energy performance assessment. It has two target audiences: building energy researchers and practitioners who need a gentle introduction to statistical modelling; statisticians who may be interested in applications to energy performance.

The first part of the book covers the motivation and theoretical background.

  • Chap. 1 is an overview of the possibilities of data analysis applied to building energy performance assessment, and of the main categories and challenges of data analysis methods.
  • Chap. 2 quickly describes building physics and how they can be formulated as statistical models.
  • Chap. 3 will describe the main steps of a Bayesian workflow for statistical modelling and inference, which aims at making sure that models are well defined and trained for a given application.

Then, the rest of the book shows some applications. It is a series of R and Python notebooks classified into chapters, each focusing on a type of model. The notebooks are self-sufficient, either based on R or Python, and mention whether non-standard libraries or other software should be installed.

  • Regression and mixture models
  • Time series analysis
  • State-space models for
  • Gaussian Process models

The target of the book is that the basics of Bayesian data analysis are explained to building energy practitioners who don’t necessarily have a large background on statistics.

The book does not cover:

  • Data acquisition and pre-processing, although a crucial step of data analysis, will not be explained in detail for each problem.
  • Big data. The typical size of our data files is a few MB, up to a few GB for the largest sets. This is far from what computer scientists consider “big data”.
  • Machine learning. Even when the target of a particular problem is prediction rather than inference of physical properties, our statistical models will always have some degree of physical interpretability.
  • Classification. Most of the methods shown are variations of regression problems, as our models will almost always have quantitative responses. There are however strong links with classification problems, especially when it comes to identifying occupant presence and behaviour.

Programming languages

This book is written and maintained with bookdown.

I don’t believe there is an obvious winner in the war of “which language is better for data science”. An analyst may use more than one for a variety of reasons.

  • Part II tutorials are written in R
  • Part III tutorials (time series) are written in R and Python
  • Part IV tutorials (state space models) are written in Python

The first language used in this book is R. Not because of its performance, but because it offers a comfortable environment for data analysis: the elegance of the tidyverse is unmatched, and Rstudio is simply my favourite scientific IDE by a long shot.

Still, some tutorials are written in Python because I have more experience with it, and because it is the default language for most researchers in building energy performance. Python is always a safe choice for performance and versatility, and is currently very strong in machine learning (scikit-learn, TensorFlow, Keras, Pytorch, Pyro…)


I am Simon Rouchier, lecturer at the Université Savoie Mont Blanc, Chambéry, France.

This website is the beta version of a future book. It possibly still has some typos and inconsistencies: I welcome all feedback by email or through the book’s Github repo.

Licence Creative Commons
Ce(tte) œuvre est mise à disposition selon les termes de la Licence Creative Commons Attribution - Pas d’Utilisation Commerciale - Partage dans les Mêmes Conditions 4.0 International.