Chapter 3 A Bayesian data analysis workflow

3.1 Bayesian inference summarised

3.1.1 Motivation for a Bayesian approach

Bayesian statistics are mentioned in the Annex B of the ASHRAE Guideline 14, after it has been observed that standard approaches make it difficult to estimate the savings uncertainty when complex models are required:

“Savings uncertainty can only be determined exactly when energy use is a linear function of some independent variable(s). For more complicated models of energy use, such as changepoint models, and for data with serially autocorrelated errors, approximate formulas must be used. These approximations provide reasonable accuracy when compared with simulated data, but in general it is difficult to determine their accuracy in any given situation. One alternative method for determining savings uncertainty to any desired degree of accuracy is to use a Bayesian approach.”

Several advantages and drawbacks of Bayesian approaches are described by (Carstens, Xia, and Yadavalli (2018)). Advantages include:

  • Because Bayesian models are probabilistic, uncertainty is automatically and exactly quantified. Confidence intervals can be interpreted in the way most people understand them: degrees of belief about the value of the parameter.
  • Bayesian models are more universal and flexible than standard methods. Models are also modular and can be designed to suit the problem. For example, it is no different to create terms for serial correlation, or heteroscedasticity (non-constant variance) than it is to specify an ordinary linear model.
  • The Bayesian approach allows for the incorporation of prior information where appropriate.
  • When the savings need to be calculated for “normalised conditions”, for example, a “typical meteorological year”, rather than the conditions during the post-retrofit monitoring period, it is not possible to quantify uncertainty using current methods. However, (Shonder and Im (2012)) have shown that it can be naturally and easily quantified using the Bayesian approach.

The first two points above are the most relevant to a data analyst: any arbitrary model structure can be defined to explain the data, and the exact same set of formulas can then be used to obtain the savings uncertainty after the models have been fitted.

3.1.2 Theory of Bayesian inference and prediction

A Bayesian model is defined by two components:

  • An observational model \(p\left(y|\theta\right)\), or likelihood function, which describes the relationship between the data \(y\) and the model parameters \(\theta\).
  • A prior model \(p(\theta)\) which encodes eventual assumptions regarding model parameters, independently of the observed data. Specifying prior densities is not mandatory.

The target of Bayesian inference is the estimation of the posterior density \(p\left(\theta|y\right)\), i.e. the probability distribution of the parameters conditioned on the observed data. As a consequence of Bayes’ rule, the posterior is proportional to the product of the two previous densities:

\[\begin{equation} p(\theta|y) \propto p(y|\theta) p(\theta) \end{equation}\]

This formula can be interpreted as follows: the posterior density is a compromise between assumptions and evidence brought by data. The prior can be “strong” or “weak”, to reflect for a more or less confident prior knowledge. The posterior will stray away from the prior as more data is introduced.

Example of estimating a set point temperature after assuming a Normal prior distribution centred around 20°C. The dashed line is the point estimate which would have been obtained if only the data had been considered. The posterior distribution can be seen as a “refinement” of the prior, given the evidence of the data.

Figure 3.1: Example of estimating a set point temperature after assuming a Normal prior distribution centred around 20°C. The dashed line is the point estimate which would have been obtained if only the data had been considered. The posterior distribution can be seen as a “refinement” of the prior, given the evidence of the data.

In many applications, one is not only interested in estimating parameter values, but also the predictions \(\tilde{y}\) of the observable during a new period. The distribution of \(\tilde{y}\) conditioned on the observed data \(y\) is called the posterior predictive distribution: \[\begin{equation} p\left(\tilde{y}|y\right) = \int p\left(\tilde{y}|\theta\right) p\left(\theta|y\right) \mathrm{d}\theta \end{equation}\] The posterior predictive distribution is an average of the model predictions over the posterior distribution of \(\theta\). This formula is equivalent to the concept of using a trained model for prediction.

Apart from the possibility to define prior distributions, the main specificity of Bayesian analysis is the fact that all variables are encoded as probability densities. The two main results, the parameter posterior \(p(\theta|y)\) and the posterior prediction \(p\left(\tilde{y}|y\right)\), are not only point estimates but complete distributions which include a full description of their uncertainty.

3.1.3 Bayesian formulation of common models

If the practitioner wishes to use a regression model to explain the relationship between the parameters and the data, doing so in a Bayesian framework is very similar to the usual (frequentist) framework. As an example, a Bayesian model for linear regression with three parameters \((\theta_0,\theta_1,\theta_2)\) and two explanatory variables \((X_1,X_2)\) may read: \[\begin{align} p(y|\theta,X) & = N\left(\theta_0, \theta_1 X_1, \theta_2 X_2, \sigma\right) \\ p(\theta_i) & = \Gamma(\alpha_i, \beta_i) \end{align}\]

This means that \(y\) follows a Normal distribution whose expectation is a linear function of \(\theta\) and \(X\), with standard deviation \(\sigma\) (the measurement error). The second equation is the prior model: in this example, each parameter is assigned a Gamma prior distribution parameterised by a shape \(\alpha\) and a scale \(\beta\). Other model structures can be formulated similarly: change-point models, polynomials, models with categorical variables… Bayesian modelling however allows for much more flexibility:

  • Other distributions than the Normal distribution can be used in the observational model;
  • Hierarchical modelling is possible: parameters can be assigned a prior distribution with parameters which have their own (hyper)prior distribution;
  • Heteroscedasticity can be encoded by assuming a relationship between the error term and explanatory variables, etc.

More complex models with latent variables have separate expressions for the respective conditional probabilities of the observations \(y\), latent variables \(z\) and parameters \(\theta\). In this case, there is a joint likelihood function \(p(y,z|\theta)\) and a marginal likelihood function \(p(y|\theta)\) so that: \[\begin{equation} p(y|\theta) = \int p(y,z|\theta) \mathrm{d}z \end{equation}\]

Other applications, such as the IPMVP option D, rely on the use of calibrated building energy simulation (BES) models. These models are described by a much larger number of parameters and equations that the simple regression models typically used for other IPMVP options. In this context, it is not feasible to fully describe BES models in the form of a simple likelihood function \(p(y|\theta)\). In order to apply Bayesian uncertainty analysis to a BES model, it is possible to first approximate it with a Gaussian process (GP) model emulator. This process is denoted Bayesian calibration and was based on the seminal work of (Kennedy and O’Hagan (2001)). As opposed to the manual adjustment of building energy model parameters, Bayesian calibration explicitly quantifies uncertainties in calibration parameters, discrepancies between model predictions and observed values, as well as observation errors (Chong and Menberg (2018)).

3.1.4 Computation: Markov Chain Monte Carlo

Except in a few convenient situations, the posterior distribution is not analytically tractable. In practice, rather than finding an exact solution for it, it is estimated by approximate methods. The most popular option for approximate posterior inference are Markov Chain Monte Carlo (MCMC) sampling methods.

An MCMC algorithm returns a series of simulation draws \(\left(\theta^{(1)},...,\theta^{(S)}\right)\) which approximate the posterior distribution, provided that the sample size is large enough. \[\begin{equation} \theta^{(s)} \sim p(\theta | y) \end{equation}\] where each draw \(\theta^{(s)}\) contains a value for each of the parameters of the model.

MCMC methods, Gaussian processes, and other methods for posterior approximation, are implemented in several software platforms. Two free and well-documented examples are: Stan, a platform for statistical modelling which interfaces with most data analysis languages (R, Python, Julia, Matlab); pyMC3, a Python library for probabilistic programming.

Convergence diagnostics are an essential step after running an MCMC algorithm for a finite number of simulation draws. It is advised to have several chains run in parallel, in order to compare their distributions. The similarity of estimates between chains is assessed by the \(\hat{R}\) convergence diagnostic (Vehtari et al. (2021)). The second diagnostic is the effective sample size (ESS), an estimate of the number of independent samples from the posterior distribution.

Once trained, a model may be validated with the same metrics as in a standard M&V approach: \(R^2\), NDB, CV-RMSE, F-statistic, etc. Since the posterior distribution is described by a finite (yet large) set of values, computing the statistics of any function \(h\) of the parameters (predictions, savings…) becomes straightforward. Because the series \(\left(\theta^{(1)},...,\theta^{(S)}\right)\) approximates the posterior distribution of \(\theta\), then the series \(\left(h(\theta^{(1)}),...,h(\theta^{(S)})\right)\) approximates the posterior distribution of \(h\).

3.2 Workflow for one model

3.2.1 Overview

As was mentioned in Sec. 1.2, inverse problems are all but trivial. It is possible that the available data is simply insufficient to bring useful inferences, but that we still try to train an unsuitable model with it. Statistical analysts need the right tools to guide model selection and training, and to warn them when there is a risk of biased inferences and predictions.

This chapter is an attempt to summarize the essential points of a Bayesian workflow from a building energy perspective. Frequentist inference is also mentioned, but as a particular case of Bayesian inference.

There is a very rich literature on the proper workflow for statistical inference, including the most cited reference in this book (Gelman et al. (2013)) and extensive online tutorials. Gelman divides the process of Bayesian data analysis into three steps:

  • Setting up a full probability model;
  • Conditioning on observed data (learning);
  • Evaluating the fit of the model and the implications of the resulting posterior (checking and validation).
A workflow for the proper specification and training of one model. Most of the workflow is similar for frequentist and Bayesian inference.

Figure 3.2: A workflow for the proper specification and training of one model. Most of the workflow is similar for frequentist and Bayesian inference.

This process concerns the training of a single model. An analyst however rarely attempts to analyze data with a single model. A simple model will provide biased inferences and predictions if its structure is too simple to represent the real system. In a complex model with many degrees of freedom, the unicity of the solution to the inverse problem is not guaranteed, leading to non-robust inferences and overfitting. All statistical learning lectures therefore come with guidelines for model checking, assessment and selection: this will be explained in Sec. 3.3.

3.2.2 Model specification and prior checking

(to be completed)

3.2.3 Identifiability analysis

(to be completed)

3.2.4 Convergence diagnostics

(to be completed)

3.2.5 Residuals analysis

(to be completed)

3.2.6 Posterior predictive checking

(to be completed)

3.2.7 Inference diagnostics and practical identifiabiltiy

(to be completed)

3.3 Model assessment and selection

The “single-model training and validation” workflow shown on Fig. 3.2 is embedded in a larger process for the selection of the appropriate model complexity and structure. Two alternatives a shown on Fig. 3.3 and 3.4. The first one is the most common and intuitive: fitting models of gradually increasing complexity, keeping the ones that pass our validation checks, and comparing them in terms of predictive performance criteria. Inferences from simpler models may serve as starting values for more complex ones. This forward stepwise selection is suited to find a simple a robust model for prediction. The second alternative, on Fig. 3.4, is backward selection: starting from a high number of predictors and fitting models of decreasing complexity.

A workflow of gradually increasing model complexity

Figure 3.3: A workflow of gradually increasing model complexity

A workflow of gradually decreasing model complexity

Figure 3.4: A workflow of gradually decreasing model complexity

(to be completed)

References

Carstens, Herman, Xiaohua Xia, and Sarma Yadavalli. 2018. “Bayesian Energy Measurement and Verification Analysis.” Energies 11 (2): 380.

Chong, Adrian, and Kathrin Menberg. 2018. “Guidelines for the Bayesian Calibration of Building Energy Models.” Energy and Buildings 174: 527–47.

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. CRC press.

Kennedy, Marc C, and Anthony O’Hagan. 2001. “Bayesian Calibration of Computer Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (3): 425–64.

Shonder, John A, and Piljae Im. 2012. “Bayesian Analysis of Savings from Retrofit Projects.” ASHRAE Transactions 118: 367.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved R for Assessing Convergence of Mcmc.” Bayesian Analysis 1 (1): 1–28.