Thursday, June 23, 2016

GSoC 2016 #2

Markov switching autoregression

If you studied statistics and remember basics of time series analysis, you should be familiar with Autoregressive model, usually denoted as AR(p):
Here y is an AR process, e is a white noise term, nu is a mean of the process. Phi is a polynomial of order p:
L is a lag operator, which, multiplied by time series element, gives previous element. So (1), actually, can be rewritten in the following explicit form:
Since the process definition (1) is essentially a linear equation between process lagged values and error, it can be put in a state space form, which is shown in [1], chapter 3.3.
Again, let's extend equation (1) by adding an underlying Markov discrete process St of changing regimes:
You can notice, that the mean, error variance, and lag polynomial become dependent on switching regime value. This is a so called Markov switching autoregressive model (MS AR). Where can it be used in practice? Let's look at the example from [2], chapter 4.4, which I also used for testing of my code:
This is a sequence of U.S. real GDP. Looking at the data, two regimes are noticeable - expansion and recession. Using maximum likelihood estimation, we can fit this data into two-regime switching mean AR model to describe real GDP changing law quantitatively. Authors use AR(4) model, so do we. The next picture displays (smoothed, that is conditional on the whole dataset) probabilities of being in the recession regime:

Peaks of probability accurately correspond to recession periods, which proves that Markov switching AR provides a sophisticated tool for analyzing an underlying structure of time process.

Implementation

Markov switching autoregression is implemented in ms_ar.py file in my PR to Statsmodels. This file contains MarkovAutoregression class, which extends RegimeSwitchingMLEModel. This class "translates" equation (4) to the state space "language".
It was quite entertaining to express ideas, explained in chapter 3.3 of [1] within the Python code. One thing I had to be very careful about was that having AR(p) model of k regimes, state space representation should carry k^(p+1) regimes, since switching means occur in (4) with different regime indices. Thus, every state space regime represents p+1 lagged AR regimes.
Such a big number of regimes leads to longer computation time, which caused some problems. For example, Kim filtering of the former real GDP model took 25 seconds, which is inappropriate, when we are doing a lot of BFGS iterations to find likelihood maximum. Luckily I found a way to optimize Kim Filter, which was quite straightforward, in fact. If you remember a previous blog post, Kim filter iteration consists of heavy-weight Kalman filter step, where Kalman filtering iteration is applied a lot of (k^(2(p+1)) for MS AR!) times, and then summing the results with weights, equal to joint probabilities of being in current and previous regime. The thing is that in the case of sparse regime transition matrix, which MS AR model is about, these joint probabilities are very rare to be non-zero, and we don't need to calculate Kalman filtering for zero ones! This feature decreased Kim filter routine evaluation dramatically, giving 2-3 seconds on my machine (which is not very powerful, by the way).

EM-algorithm

MarkovAutoregression class also has a feature of EM-algorithm. Markov switching autoregressive model, defined by (4), with some approximations, though, is a regression with switching parameters and lagged observations as regressors. Such model, as shown in chapter 4.3.5 of [2], has a simple close-form solution for EM iteration. EM-algorithm is a great device to reach a very fast convergence. For example, in the comments to my PR I copied a debug output with the following numbers:
#0 Loglike - -1941.85536159
#1 Loglike - -177.181731435
Here #0 indicates random starting parameters likelihood, and #1 indicates the likelihood of parameters after one iteration of EM-algorithm. A very significant improvement, isn't it?
 MarkovAutoregression has two public methods to run EM-algorithm: fit_em and fit_em_with_random_starts. First just performs a number of EM iterations for given starting parameters, while the second generates a set of random starting parameters, then applies EM-algorithm to all of them, finally choosing one with the best likelihood.

Testing

Right now there are two test files for MarkovAutoregression class, each based on one model - test_ms_ar_hamilton1989.py and test_ms_ar_garcia_perron1996.py. Besides formal functional tests, such as that filtering, smoothing and maximum likelihood estimation give correct values against this and this Gauss code samples, these files contain testing of EM-algorithm in its typical usage scenario - when user knows nothing about correct parameters, but wants to estimate something close to likelihood global maximum. This task is handled by already mentioned fit_em_with_random_starts method, which, by default, runs 50 sessions of EM-algorithm from random starts, each session consists of 10 iterations.

What's next?

I hope that the hardest part of the project, that is developing of Kim filter and Markov autoregression, is passed. Two more models remain: dynamic factor and time-varying parameters model with regime switching. There also will be a lot of refactoring of already written code, so some articles are going to be all about coding.

Literature

[1] - "Time Series Analysis by State Space Methods", Second Edition, by J. Durbin and S.J. Koopman.
[2] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.

Sunday, June 5, 2016

GSoC 2016 #1

Kim filter


(This is a picture I get, searching for "Kim filter" in Google)

If you look at the code in my Pull Request, you can see the class KimFilter (kim_filter.py file), which is the main part of my project. So what does it exactly do?
There is such thing called a linear state space representation:

It describes some observed vector process yt in time. Ht, At, Ft, Gt are some matrices, zt is a vector of exogenous data and beta_t is an unobserved process vector, which generates yt. et and vt are white noise vectors and nu is the intercept term.
This representation is usually used, when we have some real life process yt, and we want to understand its internal structure, to find a law describing it.
Exact values of matrices in the model definition depend on unknown vector of parameters in a model-specific way. In the general case it's just a vector of all matrices' elements, written in a row (but I never witnessed anything like this before). To estimate the unknown parameters we use the principle of maximum likelihood, widely used in Statistics and Machine Learning, which says that the most likely parameter values are the most probable. To apply maximum likelihood estimation (MLE), we need to calculate probability density of the observed process values. This is an issue a well known Kalman Filter deals with, calculating filtered values of unobserved process in addition. Then we apply some numerical maximization towards Kalman filter output as a function of parameters and get the solution.
Let's modify the linear state space representation a little:

You can see that instead of time indexes under the representation matrices we now have St, a discrete Markov process with P states and P x P transition matrix, which represents a changing regime of the process. When the process St is known a priori, we have a usual linear model. But when it is unknown and random, the model is very complex. To estimate it, we use Kim Filter, described in [1], which has been implemented by me recently in formerly mentioned PR.
It iterates over the process and during every iteration runs through three steps:
  1. Kalman filter step (_kalman_filter_step method in KalmanFilter class). During this phase we use Kalman filter for every value of current and previous regimes and thus get a battery of P x P filtered states and their covariances, conditional on every previous and current regime combination. If you look in the code, you can see, that I delegate this calculations to P KalmanFilter instances, which is very handy.
  2. Hamilton filter step (_hamilton_filter_step method). During this phase we calculate the likelihood of current observation, conditional on previous observations, and append it to  the overall likelihood. Also we calculate some helpful weight probabilities.
  3. Approximation step (_approximation_step method). Here we sum up filtered states and covariances, obtained during the first step, with weights, obtained during the second step, to achieve filtered states, conditional on only current regime value.
For the detailed mathematical description you can refer to [1] or look around my code.

Maximum likelihood estimation

We now can estimate likelihood of the parameters. The next goal is to optimize parameters according to the likelihood. This responsibility is handled by RegimeSwitchingMLEModel class, located in rs_mlemodel.py file and derived from MLEModel class. It was very handy to extend MLEModel and substitute Kalman filter by Kim filter instance in a corresponding property. These filters have a very similar interface, and MLEModel, being more about optimizing, delegates all model-specific actions to filter instance.
All noticeable changes were about parameters transforming, since we have another group of parameters, representing transition matrix. Also, RegimeSwitchingMLEModel has got a feature, discussed in the next item.

Starting parameters


Just passing the likelihood function to optimizer doesn't guarantee its convergence to global maximum. Optimizer can get stuck in local maxima, so the goal is to start from somewhere very close to the best likelihood value. My proposal suggested two approaches to this problem:

  1. Just to pretend like there is no problem and let user provide the starting parameters. Probably he will enter something very close to global maximum. This scenario is also implemented, because it is used for testing, based on Kim's code samples, where author passes some a priori known starting values to optimizer.
  2. To fit a one-regime analog of the switching model, using provided data, first (to run this feature, set fit_nonswitching_first=True in fit method of your RegimeSwitchingMLEModel extension). Then to initialize a switching model, where transition matrix is set to identity (like if there is no switching) and all regimes are equivalent to that obtained by non-switching fit. This approach seems to be intuitive, because we are choosing the model, complicating it gradually, according to the likelihood improvement.

Tests

To test the model I used this Gauss code from Kim's site, implementing Lam's Generalized Hamilton model. There is a basic test test_kim.py of Kim filter, which just tests likelihood and filtered states, obtained from my code, against Gauss code output. I am lucky that it is passed! Also, there is a file test_rs_mlemodel.py, which tests MLE against the same code. It is passed as well! Almost the same, but commented and more readable code you can see in the notebook, presenting regime_switching module usage.
Looking at the output of the notebook, you can see, that fit-non-switching-first start works not so bad, likelihood differs not much from the best, as well as obtained parameters.

What's next?

I hope to start implementing models soon, to make my module more ready-to-use. The first model in a row will be Markov-switching Autoregressive model, and it probably will be the hardest one. There is going be some fun, like implementing EM-algorithm for starting parameters estimation, I can't wait to start! Feel free to leave comments if you don't understand something or have got any thoughts about my project. Thanks!

Literature

[1] "State-space Models With Regime Switching" by Chang-Jin Kim and Charles R. Nelson.