Chapter 8 Hidden Markov models

8.1 Principles

In a hidden Markov model (HMM), the sequence of \(T\) output variables \({y_t}\) is generated by a parallel sequence of latent categorical state variables \(z_t \in \left\{ 1,\cdots , K\right\}\). These hidden states are a Markov chain, so that \(z_t\) is conditionally independent of other variables given \(z_{t-1}\).

Hidden Markov model. Grey nodes denote the observed output sequence; white nodes denote the hidden states. Hidden states are conditioned on each other, while each observation is only conditioned on the current state

Figure 8.1: Hidden Markov model. Grey nodes denote the observed output sequence; white nodes denote the hidden states. Hidden states are conditioned on each other, while each observation is only conditioned on the current state

The primary use of HMMs in building energy simulation is for identifying and modelling occupancy (Candanedo, Feldheim, and Deramaix (2017)). The hidden variable \(z_t\) may denote a binary state of occupancy or a finite number of occupants; the observed output \(y_t\) may denote any measurement impacted by occupancy: environmental sensors, smart meters, cameras, infrared sensors… There is a vast literature on occupancy detection and estimation from different types of data and methods. The reader is referred to the review of Chen et al (Chen, Jiang, and Xie (2018)) for a more comprehensive insight.

A HMM is defined by:

  • A sequence of hidden states \(\left\{z_t\right\} \in \mathbb{N}^T\), each of which can take a finite number of values: \(z_t \in \left[1,...,K\right]\)
  • An observed variable \(\left\{y_t\right\} \in \mathbb{R}^T\)
  • An initial probability \(\pi_0\) which is the likelihood of the state at time \(t=0\). \(\pi_0\) is a \(K\)-simplex, i.e. a \(K\)-vector which components sum to 1.
  • A \(K\times K\) one-step transition probability matrix \(\mathbf{A}(t) = \left(a_{ij}(t)\right)\) so that \[\begin{equation} a_{ij}(t)=p\left(z_t=j |z_{t-1}=i\right) \tag{8.1} \end{equation}\] is the probability at time \(t\) for the hidden variable to switch from state \(i\) to state \(j\). Like \(\pi_0\), each row of \(\mathbf{A}(t)\) must sum to 1.
  • Emission probabilities \(b_{i}(y_t) = p(y_t | z_t=i)\), i.e. the probability distribution of the output variable given each possible state.

The transition probabilities \(\left(a_{ij}(t)\right)\) are shown here as functions of time because they can be formulated as parametric expressions of external observed variables, such as the time of the day or weather variables. The Markov chain is then called inhomogeneous. Inhomogeneous transition probability matrices can capture occupancy properties at different time instances and thus encode occupancy dynamics: there can a higher probability of people entering the building at a certain time of the day, or if the outdoor temperature is high, etc.

Training an HMM means finding the set of parameters \(\theta\) which best explain the observations. This can be done by in least two ways: the first option is to compute the likelihood function with the forward algorithm, explained below, and to perform its direct numerical maximization. The second option is the Baum-Welch algorithm, a special case of the expectation-maximization (EM) algorithm: it alternates between a forward-backward algorithm for the expectation step, and an updating phase for the maximization step.

A trained HMM can then be used to predict states and future values of the outcome variable. The estimation of the most likely state sequence given the observations is called decoding. \[\begin{equation} z^*_{0:T} = \mathrm{arg max}_z \; p(z_{0:T} | y_{0:T}) \tag{8.2} \end{equation}\] and can be done by the Viterbi algorithm described below.

8.1.1 The forward algorithm

The forward algorithm computes the likelihood function \(p(y_{1:T}|\theta)\) of a hidden Markov model, given the values of its parameters \(\theta\). These parameters come from the parametric expressions of \(\pi_0\), \(a_{ij}(t)\) and \(b_{i}\).

The algorithm works by calculating the components of a \([T \times K]\) matrix \(\alpha\) of forward variables defined by: \[\begin{equation} \alpha_{ij} = p(y_{0:i},z_i=j) \tag{8.3} \end{equation}\] The forward variable \(\alpha_{ij}\) is the joint probability of observations up to time \(i\), and of the current hidden variable \(z_i\) being in state \(j\). The \(\alpha\) matrix is filled row by row, and the total likelihood of the data is the sum of its last row.

Forward algorithm for computing the likelihood of the data $y_{1:T}$ given the parameters of a HMM

Figure 8.2: Forward algorithm for computing the likelihood of the data \(y_{1:T}\) given the parameters of a HMM

8.1.2 The Viterbi algorithm

Supposing a trained HMM with known parameters in the initial probability \(\pi_0\), transition matrix \(\mathbf{A}(t)\) and emission probabilities \(\mathrm{b}(y)\); the Viterbi algorithm finds the best sequence of hidden states, so that their probability conditioned on the data \(y_{0:T}\) is at a maximum: \[\begin{equation} z^*_{0:T} = \mathrm{arg max}_z \; p(z_{0:T} | y_{0:T}) \tag{8.4} \end{equation}\] This process is called global decoding, and one of the possible inferences from HMMs.

The Viterbi algorithm looks similar to the backward algorithm, with an additional backtracking phase. It produces two \([T \times K]\) matrices \(\delta\) and \(\psi\) and fills them row by row in a forward phase. Then, a backtracking phase computes \(z^*_{0:T}\) from the \(\psi\) matrix. The algorithm is described with more detail here.

The Viterbi algorithm finds the best sequence of hidden states $z_{0:T}$ given observations $y_{0:T}$

Figure 8.3: The Viterbi algorithm finds the best sequence of hidden states \(z_{0:T}\) given observations \(y_{0:T}\)

8.2 Tutorial (Python)

Hidden Markov models are also covered in the Stan user’s guide.

This tutorial uses the same data as the ARMAX example from chapter 7: Building 1298 from the ASHRAE Kaggle competition. This building’s meter date is available in the book’s repository.

Bokeh Plot

We will focus again on meter 0 (electricity), but the reader is free to use the same code for any of the other three meters of this data.

This tutorial implements a simple homogeneous HMM written in Python and doesn’t use Stan for once. The point is to show how to code the forward algorithm and the Viterbi algorithm from scratch.

Let us start by importing some libraries, reading the data file and selecting which variable will be the dependent variable of our model.


# importing the three main libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing parts of scipy
from scipy.special import logsumexp
from scipy.optimize import minimize
from scipy.stats import norm

# Reading the data file
df = pd.read_csv('data/building_1298.csv')
df.set_index(pd.to_datetime(df['datetime']), inplace=True, drop=True)
df.fillna(method='ffill', inplace=True)

The forward algorithm described above is written here. At any of the \(N\) time steps, the Markov chain can take one of \(K\) states. We assume that the emission probabilities are Normal with means \(\mu_i\) and standard deviations \(\sigma_i\): \[\begin{equation} b_i(y_t) = p(y_t|z_t=i) = N(\mu_i, \sigma_i) \tag{8.5} \end{equation}\]

The parameters of the forward algorithm are the transition matrix \(a\), and the parameters of the emission probabilities \(\mu\) and \(\sigma\):

def forward(y, a, mu, sig):
    """ Calculates the likelihood from parameters a, mu and sig
    Arguments:
        y: dependent variable [N]
        a: transition matrix [KxK]
        mu: emission means [K]
        sig: emission standard deviations [K]
    Returns:
        The total log-likelihood
    """
    N = len(y)
    logalpha = np.zeros((N,K)) # log of the forward variable defined above
    # Initialisation
    pi0 = 1/K * np.ones(K)  # initial probabilities. Supposed known here.
    logalpha[0] = np.log(pi0) + norm.logpdf(y[0], loc=mu, scale=sig)
    # Recursion
    for t in range(1, N):
        for j in range(K):
            logalpha[t,j] = logsumexp(logalpha[t-1,:] + np.log(a[:,j]) + norm.logpdf(y[t],
                            loc=mu[j], scale=sig[j]) )
    # Termination
    return logsumexp(logalpha[-1])

The next block uses this forward algorithm to train the HMM. Just like in the ARMAX example, we only use a one-month subset of the whole training dataset.

Training is done by the scipy.minimize() function. Before using it, we need to define an objective function to minimize. This function will take a single array \(x\) as argument and return the value we aim to minimize, which is the negative log likelihood from the forward algorithm.


# Training subset
training_start = '2016-01-01'
training_end = '2016-01-31'
df_train = df.drop(df.index[(df.index < pd.to_datetime(training_start)) |(df.index > pd.to_datetime(training_end))])

# choosing meter 0 as dependent variable
df_train['y'] = df_train['m0']
# removing some outliers
#df['y'][df['m0'] < 300] = df['m0'].mean()
# normalizing y between 0 and 1
df_train['y'] = (df_train['y'] - df_train['y'].min()) / ( df_train['y'].max() - df_train['y'].min() )

def objective(x):
    # Reshaping the parameter vector x into the three variables of the forward algorithm
    a1 = np.reshape(x[:K*(K-1)], (K,K-1))   # Matrix a without the right column
    a2 = (1-a1.sum(axis=1))[:,np.newaxis]   # Right column of matrix a
    a = np.concatenate([a1, a2],axis=1)
    mu = x[K*(K-1):K*(K-1)+K]
    sig = x[K*(K-1)+K:]
    # Returns the minus log likelihood
    return -forward(df_train['y'], a, mu, sig)

# Initial parameter values to be passed to scipy.minimize()
K = 2                           # nombre d'états possibles
a_init = np.array([[0.9],[0.1]])
mu_init = [0.2, 0.6]     # valeurs moyennes des émissions
sig_init = [0.1, 0.1]      # écarts types des émissions

# Parameters are assembled into a single array x with given bounds
x0 = np.concatenate( [a_init.flatten(), mu_init, sig_init] )
bounds = (K*(K-1)*[(0,1)] + 2*K*[(0, None)])

# Training
res = minimize(objective, x0, bounds=bounds)

# Variables are recovered from the fitted x array
a1 = np.reshape(res.x[:K*(K-1)], (K,K-1))
a2 = (1-a1.sum(axis=1))[:,np.newaxis]
a = np.concatenate([a1, a2],axis=1)
mu = res.x[K*(K-1):K*(K-1)+K]
sig = res.x[K*(K-1)+K:]

Now that the parameters of the HMM have been estimated, we can decode it, i.e. estimating the most likely state sequence. This is done by the Viterbi algorithm below.

y = df_train['y']
N = len(y)

z = np.zeros(N)  # hidden state to be determined
best_logp = np.zeros((N, K))  # delta in the description above
back_ptr = np.zeros((N, K))  # psi in the description above

# Initialisation
best_logp[0] = norm.logpdf(y[0], loc=mu, scale=sig)

# Recursion
for t in range(1, N):
    for k in range(K):
        logp = best_logp[t - 1] + np.log(a[:, k]) + norm.logpdf(y[t], loc=mu[k], scale=sig[k])
        best_logp[t, k] = np.max(logp)
        back_ptr[t, k] = np.argmax(logp)
        
# Backtracking
z[-1] = np.argmax(best_logp[-1])
for t in range(1, N):
    z[-1 - t] = back_ptr[-1 - t + 1, int(z[-1 - t + 1])]

We can finish by inferring the predicted output from the estimated states and parameters of the HMM

y_star_mean = np.zeros(N)
y_star_std = np.zeros(N)
for k in range(K):
    y_star_mean[z == k] = mu[k]
    y_star_std[z == k] = sig[k]

y_star = np.random.normal(loc=y_star_mean, scale=y_star_std)
Bokeh Plot

References

Candanedo, Luis M, Véronique Feldheim, and Dominique Deramaix. 2017. “A Methodology Based on Hidden Markov Models for Occupancy Detection and a Case Study in a Low Energy Residential Building.” Energy and Buildings 148: 327–41.
Chen, Zhenghua, Chaoyang Jiang, and Lihua Xie. 2018. “Building Occupancy Estimation and Detection: A Review.” Energy and Buildings 169: 260–70.