Numerically estimating parameters for a GLM using Iteratively Reweighted Least Squares and LU decomposition

Learning GLM theory and writing code in C++ using Eigen

3 January 2025

One of the most fundametal models in statistics is the linear model. For an \(m \times n\) matrix \(\mathbf X\) representing \(m\) observations of \(n\) features then our linear model takes the following form

\[ \begin{align*} \hat{Y} &= \mu(X) \\ &= \mathbf X \mathbf \beta \end{align*} \]

One of the assumptions for the basic linear model is \(Y_i | X_i \sim \mathcal{N} (\mu(X_i), \sigma)\) where \(\mu (X_i) = X_i \mathbf \beta\). This means that that given our predictor the response variable is normally distributed. This implies that our error terms are normal. However, if we’re trying to predict the outcome of binary data then this model could not possibly work since for a given \(X_i\) our error can only be \(1 - X_i\mathbf \beta\) when \(Y_i\) is 1 and \(-X_i \mathbf \beta\) when \(Y_i\) is 0 which clearly is not normal as it can only take on two values.

Generalised Linear Models generalise the basic linear regression model for situations where the response variable is not normally distributed. GLMs make the assumption that the response variable comes from a distribution in the exponential family which contains many commonly used distributions such as the normal distribution, the beta distribution, the binomial distribution and the Poisson distribution.

A distribution from the exponential family has the following form

\[ f(x) = \operatorname{exp}\left\{ \frac{x\theta - b(\theta)}{a(\phi)} + h(x, \phi) \right\} \]

And we call \(\theta\) the location parameter and \(\phi\) the scale parameter.

While learning about GLMs, one thing that I noticed about the exponential family is that there are so many different but equivalent parameterisations for it. The textbooks, lecture slides, youtube videos and exercise sheets that I referenced all used different formulations that defined the same class of distributions.

With GLMs we try to predict a function of the response variable as a linear function. This means that our models are now of the following form

\[ \begin{align*} g(\hat Y) &= \mathbf X \mathbf \beta \\ \hat Y &= g^{-1}(\mathbf X \mathbf \beta) \end{align*} \]

We call \(g\) the “link function”. It turns out that when our response variable \(Y_i\) follows a distribution in the exponential family then setting \(g\) to be \(\theta\) from our paramterisation gives us nice statistical properties and we call this choice of link function the “canonical link”.

So what we effectively do is reparameterise our distribution to be written as a member of the exponential family and try to learn the exponential family parameter \(\theta\). Our link function then allows us to find the parameterfor the original form of our distribution by using the value of \(\theta\) that we’ve learned.

I derived the exponential family form for the Binomial and Bernoulli distributions as practice and the logit function fell out which blew my mind since it explains why we use its inverse, the logistic sigmoid, for logistic regression: it’s just a special case of the GLM where our response variable is binomially distributed.

Now that we have our model, how do we fit it? One way is by finding the Maximum Likelihood Estimate for our parameter \(\theta\). Our exponential family is in a pretty nice form to find the log likelihood and it looks like this

\[ \ell(\theta | y) = \frac{y\theta - b(\theta)}{a(\phi)} + h(y, \phi) \]

Assuming our observations are independent, we would like to maximise the product of the likelihoods which is equivalent to maximising the sum of the log-likelihoods. Our derivative with respect to \(\theta\) looks like this

\[ \frac{d\ell}{d\theta} = \frac{y - b’(\theta)}{a(\phi)} \]

At the ideal value when this derivative is 0 we get

\[ \begin{align*} \frac{y - b’(\theta)}{a(\phi)} &= 0 \\ y - b’(\theta) &= 0 \end{align*} \]

And after taking expectations of everything we get

\[ \begin{align*} y - b’(\theta) &= 0 \\ \mathbb E \left [ \left. y - b’(\theta) \right | x \right ] &= 0 \\ \mathbb E \left [ \left. y \right | x \right ] &= \mathbb E \left [ \left. b’(\theta) \right | x \right ] \\ \mu(x) &= b’(\theta) \end{align*} \]

Since for a canonical link \(g(\mu) = \theta\) we also have \(g(\mu) = \theta\) the canonical link must take the form \(g(\mu) = {b’}^{-1}(\mu)\). However we also know that for the exponential family we have \(\mu = b’(\theta)\) so we have

\[ \begin{align*} \mathbf \theta &= {b’}^{-1}(\mu) \\ \mathbf \theta &= {b’}^{-1}(g^{-1}(\mathbf X \mathbf \beta)) \end{align*} \]

Using the result from earlier, when \(g\) is the canonical link this becomes

\[ \begin{align*} \mathbf \theta &= {b’}^{-1}({b’}(\mathbf X \mathbf \beta)) \\ \mathbf \theta &= \mathbf X \mathbf \beta \end{align*} \]

So when we maximise the log likelihood, our linear model directly learns the \(\theta\) parameter.

For a set of observations, our log-likelihood looks like this when we use the canonical link

\[ \ell(\mathbf Y) \propto \sum_{i} Y_i X_i \beta - b(X_i \beta) \]

Differentiating our likelihood function with respect to \(\beta\) we get

\[ \begin{align*} \frac{d\ell}{d\beta} &\propto \sum_i Y_iX_i - X_i b’(X_i\beta) \\ &\propto \sum_{i=1}^n Y_iX_i - X_i\mu_i \\ &\propto \sum_{i=1}^n Y_iX_i - \mu_i X_i \\ &\propto \sum_{i=1}^n (Y_i - \mu_i) X_i \\ &\propto (\mathbf Y - \mathbf \mu)^{\top}\mathbf X \\ &\propto (\mathbf Y - g^{-1}(\mathbf X \beta))^{\top}\mathbf X \\ &\propto \mathbf Y^{\top} \mathbf X - g^{-1}(\mathbf X \beta)^{\top} \mathbf X \\ \end{align*} \]

If the link function is the identity which corresponds to the normal distribution then when this is zero we have

\[ \begin{align*} \beta^{\top} \mathbf X^{\top} \mathbf X &= \mathbf Y^{\top} \mathbf X \\ \beta^{\top} &= \mathbf Y^{\top} \mathbf X (\mathbf X^{\top} \mathbf X)^{-1} \\ \beta &=(\mathbf X^{\top} \mathbf X)^{-1} \mathbf X^{\top} \mathbf Y \\ \end{align*} \]

Which I recognise from the usual least squares regression.

When we use a different link function then after some algebra we find that we need to solve \(\mathbf X^{\top} W (\mathbf Y - \mathbf \mu) = 0\) where \(W\) is a diagonal matrix and \(W_{i, i}\) holds the variance for a random variable with mean \(\mu_i\). We have a few different way to solve this, one is the standard Newton-Raphson procedure for numerically solving equations but another interesting technique is called Iteratively Reweighted Least Squares.

IRLS starts by using Fisher’s scoring algorithm, a variation of Newton-Raphson, which tells us that if we have a guess for our solution, \(\beta\), then we can find a better guess \(\beta’\) by calculating the following

\[ \begin{align*} \beta’ &= (\mathbf X^{\top}W\mathbf X)^{-1}\mathbf X^{\top} W (\mathbf Y - \mathbf \mu + \mathbf X \beta) \\ &= (\mathbf X^{\top}W\mathbf X)^{-1}\mathbf X^{\top} W (\mathbf Y - \mathbf g^{-1}(\mathbf X \beta) + \mathbf X \beta) \end{align*} \]

We can notice that we can transform this slightly so it has the form of a weighted least squares problem which has a closed form solution.

Taking advantage of this, the IRLS algorithm effectively solves for \(\beta\) using the scoring algorithm but uses the weighted least-squares form of the updates to get our new guess for \(\beta\).

Now that I’ve learned the theory behind GLMs and how to fit them, I can now code up a model in C++ using Eigen. My code looks like this

#include <Eigen/Dense>
#include <cmath>
#include <iostream>

using namespace Eigen;

const int ITERATIONS = 10;

class Bernoulli {
public:
  VectorXd canonical_link_inverse(const VectorXd &x) {
    return 1.0 / (1.0 + (-x.array()).exp());
  }

  VectorXd variance(const VectorXd &x) { return x.array() * (1 - x.array()); }
};

template <typename Distribution> class GLM {
  Distribution distribution;

public:
  VectorXd fit(const MatrixXd &X, const VectorXd &Y) {
    VectorXd beta(4);
    // Arbitrary initial guess
    beta << 0, 0, 0, 0;

    for (int i = 0; i < ITERATIONS; ++i) {
      VectorXd g_mu = X * beta;
      VectorXd mean = distribution.canonical_link_inverse(g_mu);

      VectorXd w_values = distribution.variance(mean);
      MatrixXd W = w_values.asDiagonal();

      VectorXd Z = X * beta + W.inverse() * (Y - mean);

      MatrixXd XTWX = X.transpose() * W * X;
      VectorXd XTWZ = X.transpose() * W * Z;
      beta = XTWX.ldlt().solve(XTWZ);
    }

    return beta;
  }
};

int main() {
  GLM<Bernoulli> glm;

  MatrixXd X(4, 4);
  X << 1, 3, 9, 12,
       5, 8, 2, 3,
       2, 1, 10, 10,
       6, 9, 3, 4;
  X = X.transpose().eval();

  VectorXd Y(4);
  Y << 0, 1, 0, 1;

  VectorXd weights = glm.fit(X, Y);

  std::cout << "Weights: " << weights << std::endl;

  return 0;
}

In my code I don’t actually calculate \(\beta’ = (\mathbf X^{\top}W\mathbf X)^{-1}\mathbf X^{\top} W (\mathbf Y - \mathbf g^{-1}(\mathbf X \beta) + \mathbf X \beta)\). Instead, I rewrite the equality as \( \mathbf X^{\top}W\mathbf X\beta’ = \mathbf X^{\top} W (\mathbf Y - \mathbf g^{-1}(\mathbf X \beta) + \mathbf X \beta) \) and call an Eigen routine to directly solve for \(\beta’\).

I did this because it’s faster to directly solve for \(\beta\) using LU decomposition rather than both invert the matrix and perform the matrix multiplication with the inverse matrix. In an earlier version of this project I was solving a similar problem in Python and I wrote the LU decomposition and backward-forward substition manually.

I found that I had to learn a lot to get an understanding of GLMs but I found it really cool that logistic regression was just a special case of something much more general.

References

[1]: Rigollet, Philippe. Statistics for Applications, Lecture 10. Massachusetts Institute of Technology: MIT OpenCouseWare

[2]: Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to linear regression analysis. John Wiley & Sons, 2021.