Chapter 8 Hidden Markov models
8.1 Principles
In a hidden Markov model (HMM), the sequence of \(T\) output variables \({y_t}\) is generated by a parallel sequence of latent categorical state variables \(z_t \in \left\{ 1,\cdots , K\right\}\). These hidden states are a Markov chain, so that \(z_t\) is conditionally independent of other variables given \(z_{t-1}\).
The primary use of HMMs in building energy simulation is for identifying and modelling occupancy (Candanedo, Feldheim, and Deramaix (2017)). The hidden variable \(z_t\) may denote a binary state of occupancy or a finite number of occupants; the observed output \(y_t\) may denote any measurement impacted by occupancy: environmental sensors, smart meters, cameras, infrared sensors… There is a vast literature on occupancy detection and estimation from different types of data and methods. The reader is referred to the review of Chen et al (Chen, Jiang, and Xie (2018)) for a more comprehensive insight.
A HMM is defined by:
- A sequence of hidden states \(\left\{z_t\right\} \in \mathbb{N}^T\), each of which can take a finite number of values: \(z_t \in \left[1,...,K\right]\)
- An observed variable \(\left\{y_t\right\} \in \mathbb{R}^T\)
- An initial probability \(\pi_0\) which is the likelihood of the state at time \(t=0\). \(\pi_0\) is a \(K\)-simplex, i.e. a \(K\)-vector which components sum to 1.
- A \(K\times K\) one-step transition probability matrix \(\mathbf{A}(t) = \left(a_{ij}(t)\right)\) so that \[\begin{equation} a_{ij}(t)=p\left(z_t=j |z_{t-1}=i\right) \tag{8.1} \end{equation}\] is the probability at time \(t\) for the hidden variable to switch from state \(i\) to state \(j\). Like \(\pi_0\), each row of \(\mathbf{A}(t)\) must sum to 1.
- Emission probabilities \(b_{i}(y_t) = p(y_t | z_t=i)\), i.e. the probability distribution of the output variable given each possible state.
The transition probabilities \(\left(a_{ij}(t)\right)\) are shown here as functions of time because they can be formulated as parametric expressions of external observed variables, such as the time of the day or weather variables. The Markov chain is then called inhomogeneous. Inhomogeneous transition probability matrices can capture occupancy properties at different time instances and thus encode occupancy dynamics: there can a higher probability of people entering the building at a certain time of the day, or if the outdoor temperature is high, etc.
Training an HMM means finding the set of parameters \(\theta\) which best explain the observations. This can be done by in least two ways: the first option is to compute the likelihood function with the forward algorithm, explained below, and to perform its direct numerical maximization. The second option is the Baum-Welch algorithm, a special case of the expectation-maximization (EM) algorithm: it alternates between a forward-backward algorithm for the expectation step, and an updating phase for the maximization step.
A trained HMM can then be used to predict states and future values of the outcome variable. The estimation of the most likely state sequence given the observations is called decoding. \[\begin{equation} z^*_{0:T} = \mathrm{arg max}_z \; p(z_{0:T} | y_{0:T}) \tag{8.2} \end{equation}\] and can be done by the Viterbi algorithm described below.
8.1.1 The forward algorithm
The forward algorithm computes the likelihood function \(p(y_{1:T}|\theta)\) of a hidden Markov model, given the values of its parameters \(\theta\). These parameters come from the parametric expressions of \(\pi_0\), \(a_{ij}(t)\) and \(b_{i}\).
The algorithm works by calculating the components of a \([T \times K]\) matrix \(\alpha\) of forward variables defined by: \[\begin{equation} \alpha_{ij} = p(y_{0:i},z_i=j) \tag{8.3} \end{equation}\] The forward variable \(\alpha_{ij}\) is the joint probability of observations up to time \(i\), and of the current hidden variable \(z_i\) being in state \(j\). The \(\alpha\) matrix is filled row by row, and the total likelihood of the data is the sum of its last row.
8.1.2 The Viterbi algorithm
Supposing a trained HMM with known parameters in the initial probability \(\pi_0\), transition matrix \(\mathbf{A}(t)\) and emission probabilities \(\mathrm{b}(y)\); the Viterbi algorithm finds the best sequence of hidden states, so that their probability conditioned on the data \(y_{0:T}\) is at a maximum: \[\begin{equation} z^*_{0:T} = \mathrm{arg max}_z \; p(z_{0:T} | y_{0:T}) \tag{8.4} \end{equation}\] This process is called global decoding, and one of the possible inferences from HMMs.
The Viterbi algorithm looks similar to the backward algorithm, with an additional backtracking phase. It produces two \([T \times K]\) matrices \(\delta\) and \(\psi\) and fills them row by row in a forward phase. Then, a backtracking phase computes \(z^*_{0:T}\) from the \(\psi\) matrix. The algorithm is described with more detail here.
8.2 Tutorial (Python)
Hidden Markov models are also covered in the Stan user’s guide.
This tutorial uses the same data as the ARMAX example from chapter 7: Building 1298 from the ASHRAE Kaggle competition. This building’s meter date is available in the book’s repository.
We will focus again on meter 0 (electricity), but the reader is free to use the same code for any of the other three meters of this data.
This tutorial implements a simple homogeneous HMM written in Python and doesn’t use Stan for once. The point is to show how to code the forward algorithm and the Viterbi algorithm from scratch.
Let us start by importing some libraries, reading the data file and selecting which variable will be the dependent variable of our model.
# importing the three main libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# importing parts of scipy
from scipy.special import logsumexp
from scipy.optimize import minimize
from scipy.stats import norm
# Reading the data file
= pd.read_csv('data/building_1298.csv')
df 'datetime']), inplace=True, drop=True)
df.set_index(pd.to_datetime(df[='ffill', inplace=True) df.fillna(method
The forward algorithm described above is written here. At any of the \(N\) time steps, the Markov chain can take one of \(K\) states. We assume that the emission probabilities are Normal with means \(\mu_i\) and standard deviations \(\sigma_i\): \[\begin{equation} b_i(y_t) = p(y_t|z_t=i) = N(\mu_i, \sigma_i) \tag{8.5} \end{equation}\]
The parameters of the forward algorithm are the transition matrix \(a\), and the parameters of the emission probabilities \(\mu\) and \(\sigma\):
def forward(y, a, mu, sig):
""" Calculates the likelihood from parameters a, mu and sig
Arguments:
y: dependent variable [N]
a: transition matrix [KxK]
mu: emission means [K]
sig: emission standard deviations [K]
Returns:
The total log-likelihood
"""
= len(y)
N = np.zeros((N,K)) # log of the forward variable defined above
logalpha # Initialisation
= 1/K * np.ones(K) # initial probabilities. Supposed known here.
pi0 0] = np.log(pi0) + norm.logpdf(y[0], loc=mu, scale=sig)
logalpha[# Recursion
for t in range(1, N):
for j in range(K):
= logsumexp(logalpha[t-1,:] + np.log(a[:,j]) + norm.logpdf(y[t],
logalpha[t,j] =mu[j], scale=sig[j]) )
loc# Termination
return logsumexp(logalpha[-1])
The next block uses this forward algorithm to train the HMM. Just like in the ARMAX example, we only use a one-month subset of the whole training dataset.
Training is done by the scipy.minimize() function. Before using it, we need to define an objective function to minimize. This function will take a single array \(x\) as argument and return the value we aim to minimize, which is the negative log likelihood from the forward algorithm.
# Training subset
= '2016-01-01'
training_start = '2016-01-31'
training_end = df.drop(df.index[(df.index < pd.to_datetime(training_start)) |(df.index > pd.to_datetime(training_end))])
df_train
# choosing meter 0 as dependent variable
'y'] = df_train['m0']
df_train[# removing some outliers
#df['y'][df['m0'] < 300] = df['m0'].mean()
# normalizing y between 0 and 1
'y'] = (df_train['y'] - df_train['y'].min()) / ( df_train['y'].max() - df_train['y'].min() )
df_train[
def objective(x):
# Reshaping the parameter vector x into the three variables of the forward algorithm
= np.reshape(x[:K*(K-1)], (K,K-1)) # Matrix a without the right column
a1 = (1-a1.sum(axis=1))[:,np.newaxis] # Right column of matrix a
a2 = np.concatenate([a1, a2],axis=1)
a = x[K*(K-1):K*(K-1)+K]
mu = x[K*(K-1)+K:]
sig # Returns the minus log likelihood
return -forward(df_train['y'], a, mu, sig)
# Initial parameter values to be passed to scipy.minimize()
= 2 # nombre d'états possibles
K = np.array([[0.9],[0.1]])
a_init = [0.2, 0.6] # valeurs moyennes des émissions
mu_init = [0.1, 0.1] # écarts types des émissions
sig_init
# Parameters are assembled into a single array x with given bounds
= np.concatenate( [a_init.flatten(), mu_init, sig_init] )
x0 = (K*(K-1)*[(0,1)] + 2*K*[(0, None)])
bounds
# Training
= minimize(objective, x0, bounds=bounds)
res
# Variables are recovered from the fitted x array
= np.reshape(res.x[:K*(K-1)], (K,K-1))
a1 = (1-a1.sum(axis=1))[:,np.newaxis]
a2 = np.concatenate([a1, a2],axis=1)
a = res.x[K*(K-1):K*(K-1)+K]
mu = res.x[K*(K-1)+K:] sig
Now that the parameters of the HMM have been estimated, we can decode it, i.e. estimating the most likely state sequence. This is done by the Viterbi algorithm below.
= df_train['y']
y = len(y)
N
= np.zeros(N) # hidden state to be determined
z = np.zeros((N, K)) # delta in the description above
best_logp = np.zeros((N, K)) # psi in the description above
back_ptr
# Initialisation
0] = norm.logpdf(y[0], loc=mu, scale=sig)
best_logp[
# Recursion
for t in range(1, N):
for k in range(K):
= best_logp[t - 1] + np.log(a[:, k]) + norm.logpdf(y[t], loc=mu[k], scale=sig[k])
logp = np.max(logp)
best_logp[t, k] = np.argmax(logp)
back_ptr[t, k]
# Backtracking
-1] = np.argmax(best_logp[-1])
z[for t in range(1, N):
-1 - t] = back_ptr[-1 - t + 1, int(z[-1 - t + 1])] z[
We can finish by inferring the predicted output from the estimated states and parameters of the HMM
= np.zeros(N)
y_star_mean = np.zeros(N)
y_star_std for k in range(K):
== k] = mu[k]
y_star_mean[z == k] = sig[k]
y_star_std[z
= np.random.normal(loc=y_star_mean, scale=y_star_std) y_star