Markov Chains in Credit Modeling

Expected Visits

We consider again the rating migration matrix presented previously. As mentioned, the probability of ultimate default is not very interesting in this case. We also calculated the probability of default over different periods of time. Here, we look at another approach to summarizing the default risk of a given rating: the expected (average) amount of time to default for each rating category.

Recall that the expected value of a non-negative integer-valued random variable $N$ is given by: \begin{eqnarray} E\left[N\right] = \sum_{i=0}^\infty i \mathbb{P}\left( N = i \right) \end{eqnarray} For this discussion, we assume the Markov chain starts at time $0$, that is, the process is $X_0, X_1, X_2, \ldots$. Now consider the random variable $\tau_D$ defined as the first time the Markov chain hits state $D$, that is, $\tau_D = \min \left\{n \in \left\{0,1,2,\ldots\right\}: X_n=D\right\}$. This is called the first hitting time. Note that we previously used the variable $\tau$ for the first exit time: the notation $\tau$ is used for a random variable which is a stopping time (which both the first exit time and the first hitting time are).

We now investigate a way to derive the expected hitting times. Let $P$ denote the transition probability matrix. For any state $s\neq D$, the expected first hitting time is the probability of transition to each possible second state times the expected value from that state plus $1$: \begin{eqnarray} E\left[\tau_D\right|\left.X_0 = s\right] & = & \sum_{s'} P_{s,s'} E\left[\tau_D\right|\left.X_0=s,X_1=s'\right]\\ & = & \sum_{s'} P_{s,s'} E\left[\tau_D\right|\left.X_1=s'\right]\\ & = & \sum_{s'} P_{s,s'} \left(1+E\left[\tau_D\right|\left.X_0=s'\right]\right)\\ \end{eqnarray} where we have used the Markov property (which applies to expectations by a straightforward demonstration) and homogeneity, respectively, for the last two equations. Note that this equation would yield $E\left[\tau_D\right|\left.X_0=D\right] = 1 + E\left[\tau_D\right|\left.X_0=D\right]$ if $s=D$ which can't be satisfied. For $s=D$, we have: \begin{eqnarray} E\left[\tau_D\right|\left.X_0=D\right] = 0\tag{2}\label{default} \end{eqnarray} If we let $x$ be the vector such that $x_i = E\left[\tau_D\right|\left.X_0=s_i\right]$, then the above equation can be written as: \begin{eqnarray} x = P(e+x) \end{eqnarray} where, as usual, $e$ is the vector of all $1$'s. This is equivalent to: \begin{eqnarray} (I - P) x = e \end{eqnarray} Note that the equation corresponding to $D$ in this system of equations needs to be replaced by equation (\ref{default}).

Applying the above formulas to the transition matrix given in the rating migration example, we get the following expected number of years to default:

226 246 241 236 228 209 172 84 0

It is surprising that it is an expected $84$ years to default from a rating of CCC given that the probability of going directly to default from CCC is about $37.6\%$ and the probability of staying in CCC is about $38.2\%$. The probability of defaulting each period given that there is no other change in rating is given by: \begin{eqnarray} \frac{37.6\%}{1-38.2\%} \approx 60.8\% \end{eqnarray} The expected time of defaulting given that there is no other change in rating is given by the expected value of the geometric distribution with probability $p \approx 60.8\%$: \begin{eqnarray} \frac{1}{1-p} = \frac{1}{1-60.8\%} \approx 2.55 \end{eqnarray} How can we then explain an 84 year expected time including upgrades? For example, there is a $14.3\%$ probability of the rating changing to NR. This adds an addition $14.3\% * 226 \approx 32.3$ years. For this reason, we also calculate the expected number of years given that there is no change to NR:

167 157 147 130 100 57 18 0

Intermediate Dynamics and Quasi-stationary Distributions

According to the model, ratings all end up in default. However, as we have seen by both probabilities to default by a particular time (2nd lecture) and expected times until default (this lecture), this can take a long time, say, more than $100$ years. As the famous economist Keynes said "In the long run, we are all dead". The only known absorbing state for life is death. Many people would agree that what happens before the absorbing state is more interesting. Notice that in the chart of the probability of default given the initial rating, in the intermediate term, before the probability of default approaches 1, the probability of default seems to be going up linearly. We will explain why this is approximately true in this section.

The explanation involves what are called quasi-stationary distributions. Let $P$ be a transition matrix with an absorbing state $D$. The distribution $\pi$ is quasi-stationary if, for the Markov chain $X_0, X_1, X_2, \ldots$ with initial distribution $\pi$ and transition matrix $P$ we have: \begin{eqnarray} \mathbb{P}\left( X_t = s \right|\left. \tau_D > t \right) = \pi_s \end{eqnarray} where as in the last section $\tau_D$ is the hitting time of state $D$. Let's use the formula for multi-step transition probabilities to write this in terms of $\pi$ and $P$: \begin{eqnarray} \mathbb{P}\left( X_t = s \right|\left. \tau_D > t \right) = \frac{\left(\pi P^t\right)_s}{1 - \left(\pi P^t\right)_D} = \pi_s \end{eqnarray}

Using a variant of the Perron-Frobenius theorem, it can be shown that, if, after deleting the absorbing state, the resulting chain is irreducible, then there is a unique quasi-stationary distribution. Note that when the absorbing state is deleted, the rows of the matrix no longer sum to $1$ so that matrix is not stochastic. It is referred to as a substochastic matrix. The definition of irreducibility, aperiodicity, etc. apply equally well to substochastic matrices because these are based on the graph of the chain and the graph of the chain is only based on which elements of the matrix are strictly positive and not their particular values. Furthermore, the chain resulting from deleting the absorbing state is aperiodic, the distribution will converge to that stationary distribution given that the process is not yet in the absorbing state. Finally, there is one more interesting phenomenon: if the process is started in the quasi-stationary distribution, the probability of reaching the absorbing state by a certain time $t$ is geometrically distributed, that is: \begin{eqnarray} \mathbb{P}\left(\tau_D = t\right) = \left(\pi P^t\right)_D - \left(\pi P^{t-1}\right)_D = \left( 1 - \lambda \right) \lambda^t \end{eqnarray} The parameter $\lambda$ of the geometric distribution is given by the largest magnitude eigenvalue of the matrix when the absorbing state is deleted and the quasi-stationary distribution is the corresponding left eigenvector.

In the case of the rating migration example we've been exploring, the chain resulting from deleting the absorbing state is irreducible: every state communicates with each othere since there are both upgrades and downgrades. Furthermore, it is aperiodic because the process has a strictly positive probability of staying in the same state for every state. Hence, the chain will converge to it. The quasi-stationary distribution of this chain is given by:

96.85% 0.02% 0.24% 1.02% 1.05% 0.56% 0.23% 0.03%

Notice that there is a high probability of being non-rated. This is because the probability of transitioning to non-rated from non-rated is very high, $99.35\%$, compared with the other states which are more like $90\%$. After $100$ steps from starting in AAA, here is the distribution of the rating:

96.86% 0.02% 0.24% 1.02% 1.04% 0.56% 0.23% 0.03%

Furthermore, the parameter of the exponential distribution is given by $99.56\%$. This means that, once the process is in the quasi-stationary distribution, the probability of default goes up by: \begin{eqnarray} 0.0044 \times 0.9956^t \end{eqnarray} The process is roughly linear since: \begin{eqnarray*} \epsilon \left(1 - \epsilon\right)^t = \epsilon \left( 1 - t \epsilon + \frac{t(t-1)}{2} \epsilon^2 + \ldots \right) \end{eqnarray*} The 2nd term inside the parentheses on the right hand side of the above is $\frac{t-1}{2}\epsilon$ times the first term. For example, for the first $10$ steps, it is given by:


Rating Migration and Continuous Time Markov Chains

Note that for credit ratings can change at more or less any time, not just on yearly boundaries as assumed by the transition matrix given in the rating migration examnple. The paper that presented this matrix argues that a continuous time model is a better fit for rating migrations. A model which is Markovian, continuous in time and discrete in space is called a continuous time Markov chain. We now explore the implications of being a Markov chain in continuous time.

In continuous time, the Markov property is: \begin{eqnarray} \mathbb{P}\left(X_{t_{i+1}} = s_{i+1}\right|\left.X_{t_1}=s_1, X_{t_2}=s_2, \ldots, X_{t_i} = s_i\right) = \mathbb{P}\left(X_{t_{i+1}} = s_{i+1}\right|\left.X_{t_i} = s_i\right) = P_{s_i,s_{i+1}}\left(t_i,t_{i+1}\right) \end{eqnarray} where we assume $t_1 \lt t_2 \lt \ldots \lt t_i \lt t_{i+1}$. Since there is not minimum unit of time, the transition probabilities are dependent upon the starting time $t_i$ and the ending time $t_{i+1}$ and we use the notation: \begin{eqnarray} P_{s_i,s_{i+1}}\left(t_i,t_{i+1}\right) = \mathbb{P}\left(X_{t_{i+1}} = s_{i+1}\right|\left.X_{t_i} = s_i\right) \end{eqnarray} Note that we are discussing the non-homogeneous case where a discrete time transition probability matrix would be dependent upon time $P_{s_i,s_{i+1}}\left(t_i\right)$ but would correspond with a time period of duration $1$ unit of time. For continuous time Markov chains, the equivalent of the multistep transition probability formula is given by: \begin{eqnarray} \lefteqn{\mathbb{P}\left(X_{t_3}=s_3\right|\left.X_{t_1}=s_1\right) = \sum_{s_2}\mathbb{P}\left(X_{t_3}=s_3, X_{t_2}=s_2\right|\left.X_{t_1}=s_1\right)}\\ & = & \sum_{s_2}\mathbb{P}\left(X_{t_3}=s_3\right|\left.X_{t_1}=s_1,X_{t_2}=s_2\right)\mathbb{P}\left(X_{t_2}=s_2\right|\left.X_{t_1}=s_1\right)\\ & = & \sum_{s_2}\mathbb{P}\left(X_{t_3}=s_3\right|\left.X_{t_2}=s_2\right)\mathbb{P}\left(X_{t_2}=s_2\right|\left.X_{t_1}=s_1\right)\\ \end{eqnarray} or, putting this in terms of the transition probability matrices: \begin{eqnarray} P\left(t_1,t_3\right) = P\left(t_1,t_2\right)P\left(t_2,t_3\right)\tag{3}\label{product} \end{eqnarray} Note that if the Markov chain is homogeneous then the transition probabilities are defined as $P(t) = P(t_1,t_1+t)$ and correspond with the probability of transition within a time period of duration $t$. In this case, equation (\ref{product}) becomes: \begin{eqnarray} P\left(t_1 + t_2\right) = P\left(t_1\right)P\left(t_2\right) \end{eqnarray} This property is called the semigroup property.

In order to proceed further, we make an additional commonly made assumption. In particular, we assume that: \begin{eqnarray} \lim_{t\rightarrow 0^+} P(t) = I\tag{4}\label{standard} \end{eqnarray} Note that this means that we can choose a time interval small enough to make the probability of transition in that interval is arbitrarily small. With this assumption, we can give some examples of a difference between continuous time Markov chains and their discrete time counterparts.

Positive Self Transitions and Aperiodicity

From assumption (\ref{standard}), we have that for any $\epsilon>0$ and sufficiently small $\delta$, it must be that $P_{s,s}(\delta) > 1 - \epsilon$. Hence, \begin{eqnarray} P_{s,s}(n\delta) = (P^n(\delta))_{s,s} \geq \left(P_{s,s}(\delta)\right)^n \end{eqnarray} so that $P_{s,s}(n\delta) > 0$ for any $n$. Note that this doesn't happen for discrete time Markov chains as in the $2$-cycle example and the GCD $2$ example. In fact, it further turns out that $P_{s,s'}(t)$ is either identically $0$ or positive for all $t$. Hence, all continuous time Markov chains are aperiodic.

The Generator

Under assumption (\ref{standard}), $P(t)$ is differentiable with respect to $t$. Define the generator of the Markov chain as: \begin{eqnarray} Q = \left(\frac{dP}{dt}\right)_{t=0} = P'(0) \end{eqnarray} Hence: \begin{eqnarray} \lefteqn{P'(t) = \lim_{h\rightarrow 0} \frac{P(t+h) - P(t)}{h} = \lim_{h\rightarrow 0} \frac{P(h)P(t) - P(0)P(t)}{h}}\\ & = & \lim_{h\rightarrow 0} \frac{P(h) - P(0)}{h} P(t) = P'(0) P(t) = Q P(t) \end{eqnarray} So that $P(t)$ is a solution to the constant coefficient linear differential equation $P'(t) = Q P(t)$. Knowing the matrix $Q$ gives allows one to calculate the transition probabilities $P(t)$ for any time $t$.

The solution to the differential equation $P'(t) = Q P(t)$ is given by the matrix exponential, written $\exp\left(Q t\right)$. One can define the matrix exponential by applying the Taylor series of an exponential to the matrix $Qt$: \begin{eqnarray} \exp(Q t) = \sum_{n=0}^\infty \frac{\left(Qt\right)^n}{n!} \end{eqnarray} Note that since the powers of a matrix $\left(Q t\right)^n$ can be bounded as exponential in the maximum eigenvalue of the matrix and since $n!$ grows much faster than this, the above expression can be shown to converge for every $Q$ and every $t$. Furthermore, it is not difficult to see that it solves $P'(t) = Q P(t)$: \begin{eqnarray} \lefteqn{\frac{d}{dt} \exp(Q t) = \frac{d}{dt}\sum_{n=0}^\infty \frac{\left(Qt\right)^n}{n!}}\\ & = & \sum_{n=0}^\infty\frac{d}{dt} \frac{\left(Qt\right)^n}{n!} = \sum_{n=0}^\infty\frac{d}{dt} \frac{Q^n t^n}{n!}\\ & = & \sum_{n=0}^\infty \frac{Q^n n t^{n-1}}{n!}\\ & = & \sum_{n=1}^\infty \frac{Q^n t^{n-1}}{\left(n-1\right)!} = Q \sum_{n=1}^\infty \frac{Q^{n-1} t^{n-1}}{\left(n-1\right)!}\\ & = & Q \sum_{n=0}^\infty \frac{Q^n t^n}{n!} = Q \sum_{n=0}^\infty \frac{\left(Q t\right)^n}{n!}\\ & = & Q \exp( Q t ) \end{eqnarray} though note that we would need to justify the interchange of derivative and sum to prove this rigorously.

We note several properties of continuous time Markov chains:

  1. The off-diagonal elements of the generator $Q$ are non-negative: \begin{eqnarray} Q_{s,s'} & = & \lim_{h\rightarrow 0} \frac{P_{s,s'}(h) - P_{s,s'}(0)}{h}\\ & = & \lim_{h\rightarrow 0} \frac{P_{s,s'}(h) - I_{s,s'}}{h} & = & \lim_{h\rightarrow 0} \frac{P_{s,s'}(h)}{h} \geq 0 \end{eqnarray}
  2. The generator matrix $Q$ has row sums equal to $0$: \begin{eqnarray} \sum_{s'} Q_{s,s'} & = & \sum_{s'} \lim_{h\rightarrow 0} \frac{P_{s,s'}(h) - P_{s,s'}(0)}{h}\\ & = & \sum_{s'} \lim_{h\rightarrow 0} \frac{P_{s,s'}(h) - I_{s,s'}}{h}\\ & = & \lim_{h\rightarrow 0} \sum_{s'} \frac{P_{s,s'}(h) - I_{s,s'}}{h}\\ & = & \lim_{h\rightarrow 0} \frac{\sum_{s'} P_{s,s'}(h) - \sum_{s'} I_{s,s'}}{h}\\ & = & \lim_{h\rightarrow 0} \frac{1 - 1}{h} = 0\\ \end{eqnarray}
  3. Every matrix, $Q$, with the last two properties (non-negative off-diagonal elements and $0$ row sums) is the generator of a continuous time Markov chain.
  4. The determinant of every transition matrix, $P(t)$, of a continuous time Markov chain, for any time $t$, is positive. For example, the following transition matrix is a transition matrix of a discrete time Markov chain but not of a continuous time one: \begin{eqnarray} \left(\begin{array}{cc} \frac{1}{3} & \frac{2}{3}\\ \frac{2}{3} & \frac{1}{3}\\ \end{array}\right) \end{eqnarray} This also implies that the transition matrix is always non-singular. Hence, the following is another example of a transition matrix which is a valid discrete-time but not a valid continuout-time Markov chain: \begin{eqnarray} \left(\begin{array}{cc} \frac{1}{3} & \frac{2}{3}\\ \frac{1}{3} & \frac{2}{3}\\ \end{array}\right) \end{eqnarray}
  5. There is no straightforward characterization of the transition matrices of a continuous time Markov chain. There are several necessary conditions such as aperiodicity and positive determinant, as we have noted. Similarly, we can say that they are matrix exponentials of matrices which have positive off-diagonals and $0$ row sums. However, there is no straightforward way to determine if a matrix is of this form. Notice, however, that the number of free parameters of $Q$ is the same as the number of free parameters of a transition matrix. Hence, even though not every transition matrix corresponds to a continuous time Markov chain, they have the same dimensionality.
  6. Given a generator $Q$, a continuous time Markov chain can be simulated from a starting state $s$ as follows:
    • Iterate the following:
      1. Determine a time $T$ when the state leaves the current state $s$ by drawing a continuous sample from an exponential distribution with parameter $q_s = -Q_{s,s}$, that is, with the following density: \begin{eqnarray} f_{q_s}(t) = q_s \exp\left(-q_s t\right) \end{eqnarray}
      2. The next state will be $s'$ (which can't be $s$) with probability $\frac{q_{s,s'}}{q_s}$

Estimation of Non-homogeneous Continuous Time Markov Chains and the Aalen-Johansen Estimator

In the paper by Lando and Skodeberg, on which the prior examples of rating migration matrices were taken from, it is claimed that a non-homogeneous continuous time Markov chain is a better model of rating migration than a homogeneous discrete time Markov chain.

Very little of the academic literature deals with non-homogeneous contimuous time Markov chains but we discuss them briefly. Let $P(t,t')$ denote the transition probabilities between time $t$ and time $t'$, that is: \begin{eqnarray} P_{s,s'}(t,t') = \mathbb{P}\left( X_{t'} = s' \right|\left. X_t = s\right) \end{eqnarray} Note that the semigroup property for non-homogeneous chains is given by: \begin{eqnarray} P(t,t')P(t',t'') = P(t,t'') \end{eqnarray}

For a non-homogeneous continuous time Markov chains, the generator matrix is time dependent. In particular, under certain regularity conditions, there are matrices $Q(t)$ such that: \begin{eqnarray} \frac{\partial P(t, t')}{\partial t} = -Q(t) P(t,t') \end{eqnarray} Given the generator matrices as a function of time $Q(t)$, one can recover the transition matrices as a function of start and end time $P(t,t')$.

Lando and Skodeberg use a non-parametric estimator of the parameters of the non-homogeneous continuous time Markov chain. Estimators like the maximum likelihood estimators discussed in the last lecture are parametric estimators that have a finite number of parameters. However, processes like the non-homogeneous continuous time Markov chains are parameterized by an infinite number of parameters since there is a generator matrix $Q(t)$ for each time $t$. Maximum likelihood estimators, while they can perform well when there are a finite number of parameters, don't perform well for an infinite number of parameters. To estimate the parameters of processes like a non-homogeneous continuous time Markov chain, we need a non-parametric estimator.

The estimator that Lando and Skodeberg use is called the Aalen-Johansen estimator. Suppose that rating changes occur at times $T_1, T_2, \ldots$. Let $Y_s\left(T_i\right)$ denote the number of firms in state $s$ just prior to time $T_i$. Let $\Delta N_{s,s'}\left(T_i\right)$ denote the number of firms which change rating from $s$ to $s'$ at time $T_i$. The Aalen-Johansen estimator for the probability transition matrix from time $t_1$ to time $t_2$ is given by: \begin{eqnarray} \hat{P}\left(t_1,t_2\right) = \prod_{\left\{i:t_1 \lt T_i \leq t_2\right\}} \left( I + \Delta A\left(T_i\right)\right) \end{eqnarray} where: \begin{eqnarray} \Delta A_{s,s'}\left(T_i\right) = \frac{\Delta N_{s,s'}\left(T_i\right)}{Y_s\left(T_i\right)} \end{eqnarray} Note that this estimator is only valid when there are multiple paths of the process since a single path would not be sufficient to determine all the transitions at a given time.

The previous transition matrix that we had presented was calculated using the maximum likelihood estimator for discrete-time Markov chains presented in the previous lecture. The table below shows the difference in estimates of the expected time to default based on this approach and the Aalen-Johansen estimator presented by Lando and Skodeberg for the year $1997$. Lando and Skodeberg don't say much about it but one interesting outcome is that the expected times to default are much longer using this approach.

Rating Discrete MLE Aalen-Johansen
AAA 167 284
AA 157 261
A 147 238
BBB 130 214
BB 100 192
B 57 148
CCC 18 63
D 0 0

Likelihood Ratio Tests of Some Effects

Lando and Skodeberg go on to investigate another model which attempts to explain the non-homogeneity of the transition probabilities of rating migrations in terms of covariates, that is, additional variables not considered in the model. Note that this is a divergence from the Markov assumption but often justified in practice. The author has found similar results in mortgage modeling in industry using very similar models.

Let $Z(t)$ denote a covariate of interest which is known at time $t$ and is assumed to be related to the probability of a ratings transition, e.g. some variable, such as the length of time in the current state. The particular model that Lando and Skodeberg propose is the following: \begin{eqnarray} Q_{s,s'}\left(t, Z(t)\right) = \alpha_{s,s'}(t) \exp\left(\beta_{s,s'} Z(t)\right)\tag{5}\label{exp} \end{eqnarray} where $\alpha_{s,s'}(t)$ is an unknown non-negative function and $\beta_{s,s'}$ is an unknown real parameter. Lando and Skodeberg use a method from a family of techniques called partial likelihood where one ignores some of the data in a way that allows one to ignore some of the parameters. In the case of the model specified by model (\ref{exp}), the parameters are the function $\alpha_{s,s'}(t)$ and the matrix $\beta$. The function $\alpha_{s,s'}(t)$ is nonparametric and so traditional maximum likelihood wouldn't work for this model. However, partial likelihood can provide a provably consistent estimator. We do not discuss further details of partial likelihood here.

Using the partial likleihood approach, Lando and Skodeberg use likelihood ratio tests to test the significant of the $\beta$ parameters, in particular, whether $\beta_{i,i+1} = 0$ and whether $\beta_{i,i-1} = 0$ where we assume the ratings are in order of default risk. They look at two important covariates: previous transition and duration in the current rating. In order to ensure sufficient data, they only look at transitions to a neighboring rating. They're conclusions are as follows:

    Previous transition:
      If the previous transition was a downgrade, there is a significant increase in the probability of a subsequent downgrade from most ratings. If the previous transition was an upgrade, there is no significant increase in the probability of a subsequent upgrade from most ratings.
    Duration in previous rating:
      The probability of downgrade decreases as the time in the current rating increases in a statistically significant way from all ratings. The probability of upgrade decreases as the time in the current rating increases in a statistically significant way from most ratings.


  1. The main reference on credit rating migrations for this lecture is:

    Lando, David and Skodeberg, Torben M. Analyzing rating transitions and rating drift with continuous observations. Journal of Banking and Finance 26 (2002), 423-444.

  2. A reference for continuous time Markov chains is:

    Kijima, Masaaki. Markov Processes for Stochastic Modeling, chapter 4. Chapman and Hall, 1997.