# Chapter 2 Building energy statistical modelling

The positioning of this book is statistical modelling and inference, applied to building energy performance assessment. In the previous chapter, this approach was presented as a compromise between physical interpretability of parameter estimates, and flexibility of predictive models.

The first step into this approach, before taking data into consideration, is the probabilistic modelling of the energy balance of buildings. Rather than using Building Energy Simulation (BES) software, our approach is generally to start from simple models and gradually increase complexity if required. This means writing each equation individually, and formulating uncertainty into the probabilistic framework.

## 2.1 Building physics in a nutshell

BES decomposes a building into separate thermal zones, each of which is assumed to have a uniform air temperature. A single-family house may be split into one or two zones, while larger housings and office buildings have more, usually denoted by their orientation and usage. The temporal evolution of temperature, humidity and other variables of comfort or indoor air quality, are calculated in each zone as consequences of: influence of the weather; exchange between zones; HVAC settings; occupancy. BES software can reach quite a high level of detail when modelling the phenomena that influence these indoor variables: long-wave radiative heat exchange between all walls; position of the sunspot; influence of furniture on indoor humidity; CFD modelling of air flow…

Since we will give our building energy models a statistical formulation, these models should stay as simple as possible, while staying close to the main phenomena that govern the heat and energy balance of an observed building. A schematic outlook of heat gains and losses in a heated space is shown here, which we will summarise by three main equations (Fig. 2.1 below assumes the example of a heated building in which gas is the main fuel used by the heating appliances. A similar reasoning can be done in other conditions, such as an air conditioned building in summer).

The first equation of our simplified building energy modelling framework is the temporal evolution of the indoor temperature in a thermal zone. It is shown on the right side of Fig. 2.1 by the imbalance between all heat gains (\(\Phi_\mathit{in}\), red lines) and all heat losses (\(\Phi_\mathit{out}\), blue lines).

\[\begin{equation} C \frac{\partial T}{\partial t} = \Phi_\mathit{in} - \Phi_\mathit{out} \tag{2.1} \end{equation}\]

Where \(C\) is an effective heat capacity of the thermal zone.

The second equation is the breakdown of \(\Phi_\mathit{in}\), which includes all heat gains of the room other than transmission and ventilation exchanges. This part is shown by red lines on Fig. and will depend on the specificities of each building. A common list of usual heat inputs can be the following. \[\begin{equation} \Phi_\mathit{in} = \Phi_h + \Phi_\mathit{sol} + \Phi_\mathit{int} \tag{2.2} \end{equation}\]

- \(\Phi_h\): the energy consumption dedicated to space heating. This value is usually not directly measured, but is only a part of a meter reading (e.g. gas) which includes production and distribution losses, and often also cover the production of domestic hot water (DHW).
- \(\Phi_\mathit{sol}\): the solar heat gains. They are the outcome of direct and diffuse solar radiation, which are measured outside. The fraction of the total outdoor solar irradiance which is converted into indoor heat input depends on the orientation of the room relative to the position of the sun, the shadings, the type of glazing, etc.
- \(\Phi_\mathit{int}\): the sum of other internal heat gains, due to the presence of inhabitants, water heat gains, electrical appliances, lighting, etc. This variable is difficult to measure but tends to show a daily or weekly pattern, which makes it predictable to some extent.

The third main equation is the breakdown of \(\Phi_\mathit{out}\), which includes all heat exchange between the zone under consideration and its surroundings: \[\begin{equation} \Phi_\mathit{out} = H_\mathit{tr}^e \left(T-T_e\right) + H_\mathit{tr}^s \left(T-T_s\right) + \sum_j H_\mathit{tr}^\mathit{adj,j} \left(T-T_\mathit{adj,j}\right) + \Phi_\mathit{inf} + \Phi_v \tag{2.3} \end{equation}\]

- The first term \(H_\mathit{tr}^e \left(T_i-T_e\right)\) denotes heat loss by direct transmission from the heated room at temperature \(T\) to the outside at temperature \(T_e\). The \(H_\mathit{tr}^e\) coefficient includes the heat transmissivity of opaque walls, glazing and thermal bridges.
- The second term \(H_\mathit{tr}^g \left(T_i-T_g\right)\) denotes heat loss towards the ground at temperature \(T_g\).
- The third term encompasses heat exchange with all adjacent rooms. This mainly concerns unheated spaces, which may have a significant temperature difference with the thermal zone under consideration.
- \(\Phi_\mathit{inf}\) et \(\Phi_\mathit{ven}\) respectively denote heat loss from air infiltration or mechanical ventilation.

These terms are written here in the direction of heat *leaving* the room, hence the name of the variable \(\Phi_\mathit{out}\). This is of course just a notation: if the outdoor air temperature or an adjacent room temperature are higher than the zone’s temperature \(T\), some of these terms may very well switch signs, implying that heat is entering the zone.

These three first equations (2.1) to (2.3) will be the basis for all our modelling of heat transfer. Fig. 2.1 also illustrates some challenges of statistical inference for building energy performance assessment.

An important question is the difficulty to directly observe the terms of Eq. (2.2) that influence the indoor heat balance. On the one hand, some hypotheses are required to formulate the solar heat gains \(\Phi_\mathit{sol}\) from outdoor measurements of solar irradiance. On the other hand, the internal heat gains \(\Phi_\mathit{int}\) of occupied buildings are the sum of influences that are hard to measure. Even the heating power \(\Phi_h\) is usually not directly available. Still in the example of a building equipped with a hydronic heating system fueled by gas, a typical situation is having a common meter for all of gas consumption. \[\begin{equation} e_\mathit{gas}(t) = e_\mathit{dhw}(t) + e_\mathit{sh}(t) + e_\mathit{loss}(t) \tag{2.4} \end{equation}\] The consumption intended for domestic hot water production \(e_\mathit{dhw}\) and space heating \(e_\mathit{sh}\) either need to be metered separately, or to be disaggregated from a single meter \(e_\mathit{gas}\). Then, the energy consumption for space heating \(e_\mathit{sh}\) translates to the heating power \(\Phi_h\) (see Eq. (2.2)) through assumptions regarding the heating system. If the target of a study is the performance assessment of the buiding envelope from measurements of heating power and temperatures, then different strategies will be required if \(\Phi_h\) is somehow directly measured, than if only a general meter for \(e_\mathit{gas}\) is available.

The proper formulation of Eq. (2.2) therefore requires assumptions to translate outdoor solar irradiance measurements into \(\Phi_\mathit{sol}\), and assumptions to translate energy meter readings into \(\Phi_h\).

The second challenge we mention here is the prediction of electricity consumption, and eventually its impact on the indoor heat balance. Supposing that hourly or daily measurements of electricity consumption are available, one can be interested in either: identifying repetitive patterns which makes this consumption predictable for purposes of energy distribution management; estimating the fraction of this energy use that contributes to indoor heat gains \(\Phi_\mathit{int}\). The former question is closely related to the detection and data-driven modelling of occupancy, which is a ML problem that may be based on a variety of sensors and methods. The second question is even more challenging, as it requires not only an estimation of the use of each appliance, but also of their heat loss percentage.

The third challenge displayed on Fig. 2.1 is the decomposition of the heat loss of the envelope. The first three terms on the right side of Eq.(2.3) may be aggregated in order to define two global indicators of the heat performance of the envelope: the total Heat Transfer Coefficient (HTC) and the transmission coefficient \(H_\mathit{tr}\). \[\begin{align} \Phi_\mathit{out} & = \underbrace{H_\mathit{tr} \left(T-T_e\right) + \Phi_\mathit{inf}}_{\mathrm{HTC} \left(T-T_e\right)} + \Phi_v \\ \mathrm{HTC} & = H_\mathit{tr} + \Phi_\mathit{inf}/\left(T-T_e\right) \tag{2.5} \end{align}\] The \(H_\mathit{tr}\) coefficient describes all heat transmission through the envelope, and the HTC also includes the effect of air infiltration. Controlled mechanical ventilation is not included in these coefficients, but may as well be considered as part of the heat gains in Eq. (2.2). One of the essential questions of this book will be the characterisation of HTC and \(H_\mathit{tr}\), using short-term or long-term measurements that may be recorded without disturbing the normal operation of the building.

## 2.2 Measurement and modelling boundaries

The first step into setting up a probability model is the choice of its boundaries, i.e. which of the measured data is the *dependent* variable, and which are the *explanatory* variables.

- The dependent variable \(y\), or model output, is a variable that we wish for a fitted model to be able to predict (this book does not cover situations with several dependent variables in a single model).
- The explanatory variables, or independent variables, are the model inputs by which we try to explain the evolutions of the dependent variable. Explanatory variables are denoted \(x\) in most regression models, or \(u\) in more complex hierarchical models where \(x\) may denote a latent variable instead.
- Some models have
*latent*variables, which are unobserved and affect the dependent variable.

The IPMVP defines measurement boundaries as “notional boundaries drawn around equipment, systems or facilities to segregate those which are relevant to saving determination from those which are not. All Energy Consumption and Demand of equipment or systems within the boundary must be measured or estimated. […] Any energy effects occurring beyond the selected measurement boundary are called interactive effects. The magnitude of any interactive effects needs to be estimated or evaluated to determine savings associated with the ECMs.”

The same definition of boundaries work for simulations: a model must be defined so that its inputs and outputs are the measured independent and dependent variables, and all energy effects occurring within these boundaries are either fixed, or part of the list of parameters \(\theta\) that will be estimated by calibration.

Typical modelling boundaries of BES resemble Fig. 2.2. The time-varying inputs provided by the user are weather files and occupancy profiles. Most of the time, the latter come from standard scenarios rather than measurement. Occupancy is understood by BES as a finite set of actions and influences: presence, temperature set-points, use of appliances. The model returns predictions of energy use, usually with a higher level of disaggregation (consumption of each system) than is easily available by measurement.

Other models can have the indoor temperature as output (Fig. 2.3): for instance, heat transfer simulation models used for assessing the performance of the envelope, or for tuning model predictive control strategies.

If the target of a study is to characterise some parameter \(\theta\), or evaluate the evolution of a latent variable, rather than train a predictive model, then the same dataset \(\mathcal{D}\) can be mapped into input and output variables in different ways. Regardless of this choice, the principles of the above definition of measurement boundaries should be applied to modelling: any effects that are believed to influence the dependent variable should be either measured (explanatory variables), given assumed values (interactive effects), or estimated by fitting.

## 2.3 Categories of statistical models

Once the modelling boundaries are set, the next step is choosing the model structure itself, i.e. the equations that relate dependent variables to independent variables and eventual latent variables. The next few sections will describe the different categories of models that will be implemented later in the applications, also summarized by Fig. 2.4. Before getting to these descriptions, a summary of some notations and vocabulary they have in common might be helpful.

A **regression** model directly relates the dependent variable with one or several explanatory variables \(x\) and its parameters \(\theta\). Regression concerns dependent variables with continuous values, as opposed to classification which concerns categorical or discrete dependent variables.
\[\begin{equation}
y \sim f\left(x, \theta\right)
\tag{2.6}
\end{equation}\]
In this form, regression models assumes the independence of all elements of \(y\) with each other: each measurement is unaffected by its previous value. These models are therefore to be used with low frequency or aggregated data for long-term predictions or summaries.

Some problems can be modelled **hierarchically**, with observable outcomes modeled conditionally on certain parameters \(\theta\), which themselves are given a probabilistic specification in terms of further parameters \(\phi\), known as hyperparameters.
\[\begin{equation}
\theta \sim f\left(\phi\right)
\tag{2.7}
\end{equation}\]
The hyperparameter \(\phi\) can then be given a *hyperprior* distribution \(p\left(\phi\right)\). As written by (Gelman et al. (2013)), “simple nonhierarchical models are usually inappropriate for hierarchical data: with few parameters, they generally cannot fit large datasets accurately, whereas with many parameters, they tend to overfit such data […]. In contrast, hierarchical models can have enough parameters to fit the data well, while using a population distribution to structure some dependence into the parameters, thereby avoiding problems of overfitting.” Hierarchical thinking gives flexibility to simple model structures, and will be useful to explain data from a group of buildings, or from a single building monitored over several operating conditions.

Data often comes at high enough frequency so that consecutive measurements of the outcome variable cannot be considered independent from each other. **Time-series** models offer many ways to express this dependency Shumway and Stoffer (2000)
\[\begin{equation}
y_t \sim f\left(y_{t-1}, y_{t-2},..., x, \theta\right)
\tag{2.8}
\end{equation}\]
This type of model is called autoregressive because of the similarity of this formulation with the regression model of Eq. (2.6): the dependent variable at time \(t\) is a regression function of its previous values. The simplest autoregressive models lack explanatory variables \(x\), and only formulate the dependent variable as a regression function of its previous values. They can be used to identify trends and repetitive cycles in a single variable and predict its future values.

The last criterion we are considering for classifying models by categories, is the presence of **latent variables**. A latent variable model is a hierarchical model which relates the observed dependent variable to a set of unobservable latent variables. For each outcome \(y_n\) there is a latent variable \(z_n\) in \(\left\{1,...,K\right\}\) with a categorical distribution parameterized by some parameter \(\lambda\).
\[\begin{align}
y_n \sim f(z_n) \\
z_n \sim \mathrm{categorical}(\lambda)
\tag{2.9}
\end{align}\]
Finite mixture models can be parameterized as latent variable models, although the description we will make of them does not explicitely display latent variables. The models which will play the largest role in the next few chapters of this book are **time series models with latent variables**, or state-space models (SSM).
\[\begin{align}
y_t \sim f\left(z_t, \theta\right) \\
z_t \sim f\left(z_{t-1}, \theta\right)
\tag{2.10}
\end{align}\]

An SSM is a type of Dynamic Bayesian Network (DBN)(Murphy (2002)) where an underlying hidden state \(z_t\), generates the observations \(y_t\). The state evolves in time as a function of observable inputs. State-space models expand the classical time-series modelling approaches by allowing more complex assumptions on the evolution of the system and its uncertainty. A Hidden Markov Model (HMM) is a type of DBN whose hidden state takes discrete values. We will use the term of state-space model for models whose hidden states are continuous, although some authors call them Kalman filter models (KFM). DBNs with both categorical and continuous hidden states are called switching dynamic systems or switching Kalman filter models, and are the highest complexity we will consider in this book.

### References

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. *Bayesian Data Analysis*. CRC press.

Murphy, Kevin Patrick. 2002. “Dynamic Bayesian Networks: Representation, Inference and Learning.”

Shumway, Robert H, and David S Stoffer. 2000. *Time Series Analysis and Its Applications*. Vol. 3. Springer.