Lehmer’s algorithm

20 July 2023

Computing the Greaest Common Divisor of multiple precision integers

Here, I implement greatest common divisor algorithms in C++ and Fortran.

I slowly build up to implementing Lehmer’s algorithm for calculating the greatest common divisor on multiple precision integers and provide my own intuition for why it works.

In the future, I would like to write parallelised code for computing the Greatest Common Divisor, utilising the algorithm of Goldreic and Chor [4] and potentially write a GPU accelerated program in a similar manner.

I haven’t had much to do recently so I decided to see how fast I can write a program to calculate the greatest common divisor of two integers.

I’ll be testing several different algorithms so it would be nice to have a set of test cases I can use to benchmark them - I can generate them with this script.

#!/usr/bin/env python3

import math
import random

random.seed(0)
with open("test-cases", "w") as file:
  for i in range(3, 21):
    number_of_bits = 2**i
    for test_case in range(1000):
      minimum = 2**(number_of_bits - 1)
      maximum = 2**number_of_bits - 1
      left = random.randint(minimum, maximum)
      right = random.randint(minimum, maximum)
      file.write(f"{left} {right} {math.gcd(left, right)}\n")

I’m generating test cases that are grouped by the number of bits the inputs contain. I start with test cases where both numbers contain \(2^3 = 8\) bits and go upwards in powers of two until I get to \(2^{20}\) bit integers. \(2^{n - 1}\) is the smallest integer \(n\) bit integer and \(2^n - 1\) is the largest integer that can be made using \(n\) bits. The random.seed(0) call at the top keeps things deterministic and reproducible.

I have access to a decent number of CPUs, so it would nice to split the work across them.

#!/usr/bin/env python3
#SBATCH -t 1-0
#SBATCH --mem=128G
#SBATCH -c 128

N_EXAMPLES = 1000

import math
import random
import itertools
import multiprocessing

random.seed(0)
def generate_example(number_of_bits):
  minimum = 2**(number_of_bits - 1)
  maximum = 2**number_of_bits - 1
  left = random.randint(minimum, maximum)
  right = random.randint(minimum, maximum)
  return (left, right)

def create_testcase(example):
  left, right = example
  return left, right, math.gcd(left, right)

with open("test-cases", "w") as file:
  examples = itertools.chain.from_iterable([
      [generate_example(number_of_bits) for i in range(N_EXAMPLES)]
      for number_of_bits in [2**i for i in range(3, 21)]
  ])
  with multiprocessing.Pool(128) as pool:
    testcases = pool.imap(create_testcase, examples)
    for left, right, answer in testcases:
      file.write(f"{left} {right} {answer}\n")

Here, I create the example input in advance. I then split the task of calculating the greatest common divisor for each of these across different the different CPUs.

For large numbers, calculating the GCD will take a while so I used the default chunk size of 1. There are situations where it makes sense to increase the chunk size but I don’t think there is much need here as my program will spend much more time calculating GCDs than it will spend on synchronisation.

The SBATCH directives at the top are arguments to the sbatch command for the Slurm workload workload manager, which I’m using to submit jobs to the machine.

The naive algorithm

Ok, so you need to start somewhere.

We know that the greatest common divisor has to be less than or equal to smallest number so we can try every single number less than the smaller of the two numbers and check if they’re divisible by this number. Simple!

integer function naive_gcd(left, right)
    integer, intent(in) :: left, right
    integer :: smaller
    integer :: divisor
    logical :: divides_left, divides_right

    naive_gcd = 1
    smaller = min(left, right)

    do divisor = 1, smaller
        divides_left = mod(left, divisor) .eq. 0
        divides_right = mod(right, divisor) .eq. 0
        if (divides_left .and. divides_right) then
            naive_gcd = divisor
        end if
    end do
end function

Awesome! Now, how fast is it?

If the integers have \(n\) bits then we’re searching up to \(2^n\) integers, so this is quite bad. For 8 bits we need to search 255 numbers and for 16 bits this becomes 65535 numbers. Nowadays, integers are normally 32 bits and we would have to search through over 4 billion numbers. This is pretty awful.

We can do a tiny bit better here. We know that any divisor of some number \(n\) is either \(n\) itself or less than or equal to \(\frac{n}{2}\), why?

This is because if there is a number, \(d\), that divides \(n\) and is not \(n\) itself then it must multiply with some number to make \(n\). Since \(d\) is not \(n\), this number cannot be \(1\) so it must be larger than or equal to 2.

Let’s call this other number \(k\) so \(n = kd\) and \(k = \frac{n}{d}\). Since \(k \geq 2\) or equivalently \(2 \leq k\), \(2 \leq \frac{n}{d}\). Multiplying both sides by \(d\), we get \(2d \leq n\) and \(d \leq \frac{n}{2}\). Sweet! Let’s code this up!

integer function naive_gcd_half(left, right)
    integer, intent(in) :: left, right
    integer :: smaller, larger
    integer :: divisor
    logical :: smaller_divides_larger
    logical :: divides_left, divides_right

    naive_gcd_half = 1
    smaller = min(left, right)
    larger = max(left, right)

    smaller_divides_larger = mod(larger, smaller) .eq. 0
    if (smaller_divides_larger) then
        naive_gcd_half = smaller
        return
    end if

    do divisor = 1, smaller / 2
        divides_left = mod(left, divisor) .eq. 0
        divides_right = mod(right, divisor) .eq. 0
        if (divides_left .and. divides_right) then
            naive_gcd_half = divisor
        end if
    end do
end function

This code has an extra check as we need to make sure that the smaller number isn’t the greatest common divisor before we skip it out and start searching smaller numbers.

This code is an improvement, we’re now searching \(\frac{2^n}{2} = 2^{n - 1}\) numbers. Which isn’t great but it’s definitely an improvement.

We can make another improvement here - if we know that 2 isn’t a divisor, then we automatically know that 4, 6, 8, 10 and any other multiple of two cannot be divisors either, so we can skip all of them! We can repeat this for every number. This technique is called the sieve of Eratosthenes and while it means we complete \(O(\log 2^n \log \log 2^n) = O(n \log n)\) operations eliminating bad candidates, this allows us to use much fewer arithmetic operations, so this is much faster.

integer function eratosthenes(left, right)
    implicit none
    integer, intent(in) :: left, right
    integer :: divisor
    integer :: i
    logical, dimension(:), allocatable :: isnondivisor
    allocate(isnondivisor(left), source = .false.)

    eratosthenes = 0
    if (mod(right, left) .eq. 0) then
        eratosthenes = left
        return
    end if
    do divisor = 1, left / 2
        if (isnondivisor(divisor)) then
            continue
        end if
        if ((mod(left, divisor) .eq. 0) .and. (mod(right, divisor) .eq. 0)) then
            eratosthenes = divisor
        else
            do i = 1, left / divisor
                isnondivisor(i * divisor) = .true.
            end do
        end if
    end do
end function

Euclid’s original algorithm

Now we can start to get a bit smart about things and take a different approach. If we know some number, \(d\), divides both \(a\) and \(b\), what can we say about \(b - a\). Well, \(d\) divides \(a\) so there’s some \(k_1\) such that \(a = k_1 d\). Similarly, \(d\) divides \(b\) so there’s some number, \(k_2\) such that \(b = k_2 d\). Using these, \(a - b = k_1 d - k_2 d = d (k_1 - k_2)\). So, \(d\) divides \(a - b\).

Why is this useful? This tells us that every number that divides both \(a\) and \(b\) will also divide \(a - b\) so the greatest common divisor of \(a\) and \(b\) will be the greatest common divisor of \(a\) and \(b - a\). This is helpful because we can now search for divisors of \(b - a\) which will be smaller than the larger of the two numbers. This lends itself nicely to a recurisve algorithm. I will write this iteratively because I can’t be certain that the Fortran compiler will perform tail call optimisation.

integer function naive_euclid(left, right)
    implicit none
    integer, intent(in) :: left, right
    integer :: smaller, larger
    integer :: next

    smaller = min(left, right)
    larger = max(left, right)
    do while ((smaller .ne. 0) .and. (smaller .ne. 1))
        next = larger - smaller
        larger = max(smaller, next)
        smaller = min(smaller, next)
    end do
    euclid = larger
end function

This was the original algorithm presented in Euclid’s elements

Let A and C be the two given positive integers; it is required to find their greatest common divisor. If C divides A, then C is a common divisor of C and A, since it also divides itself. And it clearly is in fact the greatest, since no greater number than C will divide C. But if C does not divide A, then continually subtract the lesser of the numbers A, C from the greater, until some number is left that divides the previous one. This will eventually happen, for if unity is left, it will divide the previous number.

At each step of iteration, I replace the larger of the two numbers, let’s call them \(a\) and \(b\), with \(b - a\) because they all share divisors. This is slightly better however, we perform \(O(a) = O(2^n)\) iterations of numeric operations in the worst case which isn’t better than what we had before.

The Euclidean algorithm

We can improve on this by noticing that we can avoid repeated subtraction by the same number by noticing that the end result is just the remainder after division by this number. This adjustment is quick to make: we can just replace next = larger - smaller with next = mod(larger, smaller). After making this adjustment, the code for this algorithm looks like this.

integer function euclidean(left, right)
    implicit none
    integer, intent(in) :: left, right
    integer :: smaller, larger
    integer :: next

    smaller = min(left, right)
    larger = max(left, right)
    do while ((smaller .ne. 0) .and. (smaller .ne. 1))
        next = mod(larger, smaller)
        larger = max(smaller, next)
        smaller = min(smaller, next)
    end do
    euclid = larger
end function

We can show that with this adjustment we will always require fewer than \(\log_\varphi 3a\) steps where \(\varphi\) is the golden ratio but on average we perform \(O(\log a)\) steps [1].

This is significant because \(O(\log a) = O(\log 2^n) = O(n)\) so we now have a greatest common divisor algorithm whose expected number of steps is linear in the number of bits of our integers. If the computational cost of arithmetic operations is proportional to the number of bits involved, we get a complexity that scales as \(O(n^2)\). This is neat but we can do much faster.

A binary algorithm

There’s an algorithm that’s better optimised for binary arithmetic. It exploits the fact that for binary numbers, division by two is just a bit-shift operation. The algorithm is based on the following principles.

If \(a\) and \(b\) are both even then \(\gcd(a, b) = 2\gcd(\frac{a}{2}, \frac{b}{2})\)
If \(a\) is odd and \(b\) is even then \(\gcd(a, b) = \gcd(\frac{b}{2}, a)\)
If \(a\) and \(b\) are both odd then \(|a - b|\) is even and \(|a - b| < \min(a, b)\). Furthermore, \(\gcd(a, b) = \gcd(|a - b|, \min(a, b))\)
\(\gcd(a, b) = \gcd(b - a, a)\), as in Euclid’s algorithm

With this algorithm, we can begin by dividing both \(a\) and \(b\) by two while they are not both even and maintaining a counter, \(k\), for how many times two divides both of them. If any of these are even, then we can repeatedly divide this number by two, leaving the value in the counter unmodified. We can then replace \(b\) with \(b - a\). This is my implementation of this algorithm.

integer function binary(left, right)
    implicit none
    integer, intent(in) :: left, right
    integer :: smaller, larger
    integer :: next
    integer :: k = 0

    smaller = min(left, right)
    larger = max(left, right)

    do while ((mod(smaller, 2) .eq. 0) .and. (mod(larger, 2) .eq. 0))
        smaller = smaller / 2
        larger = larger / 2
        k = k + 1
    end do


    do while (smaller .ne. 0)
        do while (mod(larger, 2) .eq. 0)
            larger = larger / 2
        end do
        do while (mod(smaller, 2) .eq. 0)
            smaller = smaller / 2
        end do

       next = larger - smaller
       larger = max(smaller, next)
       smaller = min(smaller, next)
    end do
    binary = lshift(larger, k)
end function

This algorithm appears in a Chinese text Chiu Chang Suan Shu or Nine Chapters on the Mathematical Art written arround 1AD. The following excerpt describes the above algorithm.

If halving is possible, take half.

Otherwise write down the denominator and the numerator, and subtract the smaller from the greater.

Repeat until both numbers are equal.

Simplify with this common value.

This algorithm also has a complexity of \(O(n^2)\) but is significantly faster.

Multiple precision arithmetic

You may have noticed that in the test cases that I generated include numbers that contain up to \(2^{20}\) bits but the Fortran code I’ve written uses integers, which only support numbers that use fewer than 32 bits. \(32\) is much less than \(2^{20}\) so this poses a problem as I simply will not be able to fit the test cases into integers.

It’s now time to explore multiple precision arithmetic, which will allow me to store and operate on arbitrarily large numbers.

I had lots of ideas here for building my own multi-precision arithmetic library but, unfortunately, sense and reason got the better of me and I decided to use a pre-existing multi-precision arithmetic library. Which, of course, is definitely the correct thing to do albeit a bit less fun.

It’s also around here that I decide to switch to C++ - I love Fortran but I didn’t really want to deal with trying to link libraries with gfortran and I kept stumbling into dead links when searching for Fortran multi-precision libraries.

In C++, the binary GCD algorithm and the Euclidean algorithm look like this using the GNU multi-precision arithmetic library.

void binary(mpz_t result, mpz_t left, mpz_t right) {
    mpz_t smaller, larger;
    int k = 0;
    mpz_inits(smaller, larger, nullptr);
    mpz_set(smaller, left);
    mpz_set(larger, right);
    if (mpz_cmp(smaller, larger) > 0) {
        mpz_swap(smaller, larger);
    }
    while (mpz_divisible_ui_p(smaller, 2) && mpz_divisible_ui_p(larger, 2)) {
        ++k;
        mpz_tdiv_q_ui(smaller, smaller, 2);
        mpz_tdiv_q_ui(larger, larger, 2);
    }
    while (mpz_cmp_ui(smaller, 0)) {
        while (mpz_divisible_ui_p(larger, 2)) {
            mpz_tdiv_q_ui(larger, larger, 2);
        }
        while (mpz_divisible_ui_p(smaller, 2)) {
            mpz_tdiv_q_ui(smaller, smaller, 2);
        }
        mpz_sub(larger, larger, smaller);
        if (mpz_cmp(smaller, larger) > 0) {
            mpz_swap(smaller, larger);
        }
    }
    mpz_mul_2exp(larger, larger, k);
    mpz_set(result, larger);
}

void euclidean(mpz_t result, mpz_t left, mpz_t right) {
    mpz_t smaller, larger;
    if (mpz_cmp(left, right) < 0) {
        mpz_set(smaller, left);
        mpz_set(larger, right);
    }
    else {
        mpz_set(smaller, right);
        mpz_set(larger, right);
    }

    while (mpz_cmp_ui(smaller, 0)) {
        mpz_mod(larger, larger, smaller);
        if (mpz_cmp(smaller, larger) > 0) {
            mpz_swap(smaller, larger);
        }
    }
    mpz_set(result, smaller);
}

Lehmer’s algorithm

Lehmer’s algorithm is a GCD algorithm for arbitrary precision arithmetic. This is the algorithm used in the Python standard library’s gcd function. Integers in python are represented using base 32, this is implemented as an array of 32-bit longs accompanied by a number to denote size (the number of base 32 digits) and sign.

The idea with Lehmer’s algorithm is to avoid multiple-precision operations and make use of single-precision operations where possible. We can do this after first noticing that the quotients between two random numbers are usually small. The quotient between two random numbers is almost always less than 1000, we can observe this experimentally with the following program.

import Control.Arrow
import System.Random
import Data.List

seed = mkStdGen 0

generateNumbers :: StdGen -> [Int]
generateNumbers = take 1000 . unfoldr (Just . randomR (0, 2^20))

main = putStrLn (show percentage ++ "% of quotients were below 1000")
    where
    (numerators, denominators) = generateNumbers *** generateNumbers $ split seed
    quotients = zipWith div numerators denominators
    percentage = 100 * count (<=1000) quotients `div` length quotients
    count = (length .) . filter

The above reports that 99% of quotients were below 1000. We use this fact by approximating the quotient of two larger multiprecision numbers as the quotient of two numbers that can be represented using single-precision words. In fact, we take two approximations - one approximation is an overestimate and the other is an underestimate. Because these quotients are small, there are only a few possibilities for the true quotient that lies between these two bounds. In fact, about 68% of quotients are either 1, 2 or 3.

We achieve these approximations by looking at the first few digits of the number. For the case of GNU MP, these digits will be 32 bit integers but I will use base 10 for the purpose of example.

If I’d like to find the greatest common divisor of 27182818 and 10000000 using the Euclidean algorithm, I’d begin by finding the remainder after dividing 27182818 by 10000000. I can do this by under- and over-approximation each of these two numbers. 27182818 is approximated as 27180000 and 27190000 and 1000000 is approximated as 10000000 and 10010000. We can thus produce an under-approximation of the true quotient that comes out to be \(\frac{27180000}{10010000}\) and \(\frac{27190000}{10000000}\). Clearly, these are equal to \(\frac{2718}{1001}\) and \(\frac{2719}{1000}\) which can be computed using single-precision division. However, using a calculator, these quotients are both 2. So the true quotient, being bounded by these approximations, must also be 2.

Lehmer’s algorithm works by using the Eucliean algorithm on these single-precision approximations until the quotients are no longer the same. At this point, since we know the true quotients accumulated from the Euclidean algorithm, we can use these quotients to perform the next steps of the Euclidean algorithm on the multi-precision number without ever needing to perform a multi-precision division.

With the GNU MP library, each base 32 digit is stored as a 32-bit long. These 32-bit longs are called “limbs” and have the type mp_limb_t, I acess them using mpz_get_limbn. I came up with the following code.

void lehmer(mpz_t result, mpz_t left, mpz_t right) {
    mpz_t smaller, larger;
    mpz_t smaller_prime, larger_prime;
    mpz_inits(smaller, larger, smaller_prime, larger_prime, nullptr);

    /* under- and over-estimates of each number*/
    mp_limb_t smaller_u, smaller_o, larger_u, larger_o;
    /* Corresponding quotients and remainders */
    mp_limb_t q_u, r_u, q_o, r_o;

    mpz_set(smaller, left);
    mpz_set(larger, right);


    if (mpz_cmp(smaller, larger) > 0) {
        mpz_swap(smaller, larger);
    }
    /* Begin as with the Euclidean algorithm */
    while (mpz_cmp_ui(smaller, 0)) {
        if (mpz_cmp(smaller, larger) > 0) {
            mpz_swap(smaller, larger);
        }
        if (!mpz_cmp_ui(smaller, 0)) {
            break;
        }
        /* Get the most significant limb from each number */
        smaller_u = mpz_getlimbn(smaller, mpz_size(larger) - 1);
        larger_u = mpz_getlimbn(larger, mpz_size(larger) - 1);

        smaller_o = smaller_u + 1;
        larger_o = larger_u + 1;

        /*
         * The larger number is much larger than the smaller number
         *  We are forced to compute a multi-precision division
         *  This happens extremely rarely
         * */
        if (!smaller_u) {
            mpz_mod(larger, larger, smaller);
            continue;
        }

        // After some steps of the euclidean algorithm
        //  the two numbers become some linear combination
        //  of the original values
        // Use a, b to denote the coefficients
        //  for the larger and smaller number
        mp_limb_t
          a_l = 0,
          b_l = 1,
          a_s = 1,
          b_s = 0;
        while (1) {
            q_u = larger_u / smaller_o;
            q_o = larger_o / smaller_u;

            if (q_u != q_o) {
                if(a_l == 0 && b_l == 1 && a_s == 1 && b_s == 0) {
                    mpz_mod(larger, larger, smaller);
                    break;
                }
                mpz_mul_ui(smaller_prime, smaller, a_s);
                mpz_addmul_ui(smaller_prime, larger, b_s);

                mpz_mul_ui(larger_prime, smaller, a_l);
                mpz_addmul_ui(larger_prime, larger, b_l);
                break;
            } 
            // Perform an iteration of Lehmer's algorithm
            r_u = larger_u - q_u * smaller_o;
            r_o = larger_o - q_o * smaller_u;

            // This remainder is less than both of these numbers
            // The new largest number is the original smallest number
            // The new smaller number is the remainder
            larger_u = smaller_o;
            larger_o = smaller_u;
            
            smaller_u = r_u;
            smaller_o = r_o;

            // We have 
            // remainder = larger - q * smaller
            // larger' = smaller
            // smaller' = remainder
            // if larger = a_l * smaller_0 + b_l * larger_0
            //    smaller = a_s * smaller_0 + b_s * larger_0
            // then
            //    smaller' = (a_l - q * a_s) * smaller_0 + (b_l - q * b_s) * larger_0
            //    larger' = a_s * smaller_0 + b_s * larger_0
            // thus
            //
            // a_s' = a_l - q * a_s
            // b_s' = b_l - q * b_s
            //
            // a_l' = a_s
            // b_l' = b_s

            mp_limb_t
              _a_l = a_s,
              _b_l = b_s;
            a_s = a_l - q_u * a_s;
            b_s = b_l - q_o * b_s;
            a_l = _a_l;
            b_l = _b_l;
        }
    }
    mpz_set(result, larger);
}

Writing this was quite fun and seeing it work was truly awesome. I gave it two 62 digit numbers and it returned the greatest common divisor in under an instant. This algorithm is insanely fast but I’m not done yet.

Parallelisation

So my code is fast, this is cool but not extraordinary. As I said, this is the same algorithm that the Python math library uses, although I beat Python since my C++ is able to process the input faster using the GMP library.

I mentioned earlier that I have acess to a good number of CPU cores so I’m obligated to use them. I was able to find two algorithms for computing the GCD in parallel. The first due to Brent and Kung [3] and the second due to Goldreich and Chor [4]. In the future, I would like to read through these papers and implement a parallel GCD algorithm.

[1]: Knuth, Donald E. “The art of computer programming. Volume 2: Seminumerical algorithms.”

[2]: Möller, Niels. “On Schönhage’s algorithm and subquadratic integer GCD computation.” Mathematics of Computation 77.261 (2008): 589-607.

[3]: R. P. Brent and H. T. Kung, “A systolic algorithm for integer GCD computation,” 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH), Urbana, IL, USA, 1985, pp. 118-125, doi: 10.1109/ARITH.1985.6158931.

[4]: Chor, B., Goldreich, O. An improved parallel algorithm for integer GCD. Algorithmica 5, 1–10 (1990). https://doi.org/10.1007/BF01840374