# Lehmer’s algorithm

**20 July 2023**

**Computing the Greaest Common Divisor of multiple precision
integers**

Here, I implement greatest common divisor algorithms in C++ and Fortran.

I slowly build up to implementing Lehmer’s algorithm for calculating the greatest common divisor on multiple precision integers and provide my own intuition for why it works.

In the future, I would like to write parallelised code for computing the Greatest Common Divisor, utilising the algorithm of Goldreic and Chor [4] and potentially write a GPU accelerated program in a similar manner.

I haven’t had much to do recently so I decided to see how fast I can write a program to calculate the greatest common divisor of two integers.

I’ll be testing several different algorithms so it would be nice to have a set of test cases I can use to benchmark them - I can generate them with this script.

```
#!/usr/bin/env python3
import math
import random
0)
random.seed(with open("test-cases", "w") as file:
for i in range(3, 21):
= 2**i
number_of_bits for test_case in range(1000):
= 2**(number_of_bits - 1)
minimum = 2**number_of_bits - 1
maximum = random.randint(minimum, maximum)
left = random.randint(minimum, maximum)
right file.write(f"{left} {right} {math.gcd(left, right)}\n")
```

I’m generating test cases that are grouped by the number of bits the
inputs contain. I start with test cases where both numbers contain \(2^3
= 8\) bits and go upwards in powers of two until I get to \(2^{20}\) bit
integers. \(2^{n - 1}\) is the smallest integer \(n\) bit integer and
\(2^n - 1\) is the largest integer that can be made using \(n\) bits.
The `random.seed(0)`

call at the top keeps things
deterministic and reproducible.

I have access to a decent number of CPUs, so it would nice to split the work across them.

```
#!/usr/bin/env python3
#SBATCH -t 1-0
#SBATCH --mem=128G
#SBATCH -c 128
N_EXAMPLES = 1000
import math
import random
import itertools
import multiprocessing
random.seed(0)
def generate_example(number_of_bits):
minimum = 2**(number_of_bits - 1)
maximum = 2**number_of_bits - 1
left = random.randint(minimum, maximum)
right = random.randint(minimum, maximum)
return (left, right)
def create_testcase(example):
left, right = example
return left, right, math.gcd(left, right)
with open("test-cases", "w") as file:
examples = itertools.chain.from_iterable([
[generate_example(number_of_bits) for i in range(N_EXAMPLES)]
for number_of_bits in [2**i for i in range(3, 21)]
])
with multiprocessing.Pool(128) as pool:
testcases = pool.imap(create_testcase, examples)
for left, right, answer in testcases:
file.write(f"{left} {right} {answer}\n")
```

Here, I create the example input in advance. I then split the task of calculating the greatest common divisor for each of these across different the different CPUs.

For large numbers, calculating the GCD will take a while so I used the default chunk size of 1. There are situations where it makes sense to increase the chunk size but I don’t think there is much need here as my program will spend much more time calculating GCDs than it will spend on synchronisation.

The `SBATCH`

directives at the top are arguments to the
`sbatch`

command for the Slurm workload workload manager,
which I’m using to submit jobs to the machine.

# The naive algorithm

Ok, so you need to start somewhere.

We know that the greatest common divisor has to be less than or equal to smallest number so we can try every single number less than the smaller of the two numbers and check if they’re divisible by this number. Simple!

```
integer function naive_gcd(left, right)
integer, intent(in) :: left, right
integer :: smaller
integer :: divisor
logical :: divides_left, divides_right
= 1
naive_gcd = min(left, right)
smaller
do divisor = 1, smaller
= mod(left, divisor) .eq. 0
divides_left = mod(right, divisor) .eq. 0
divides_right if (divides_left .and. divides_right) then
= divisor
naive_gcd end if
end do
end function
```

Awesome! Now, how fast is it?

If the integers have \(n\) bits then we’re searching up to \(2^n\) integers, so this is quite bad. For 8 bits we need to search 255 numbers and for 16 bits this becomes 65535 numbers. Nowadays, integers are normally 32 bits and we would have to search through over 4 billion numbers. This is pretty awful.

We can do a tiny bit better here. We know that any divisor of some number \(n\) is either \(n\) itself or less than or equal to \(\frac{n}{2}\), why?

This is because if there is a number, \(d\), that divides \(n\) and is not \(n\) itself then it must multiply with some number to make \(n\). Since \(d\) is not \(n\), this number cannot be \(1\) so it must be larger than or equal to 2.

Let’s call this other number \(k\) so \(n = kd\) and \(k = \frac{n}{d}\). Since \(k \geq 2\) or equivalently \(2 \leq k\), \(2 \leq \frac{n}{d}\). Multiplying both sides by \(d\), we get \(2d \leq n\) and \(d \leq \frac{n}{2}\). Sweet! Let’s code this up!

```
integer function naive_gcd_half(left, right)
integer, intent(in) :: left, right
integer :: smaller, larger
integer :: divisor
logical :: smaller_divides_larger
logical :: divides_left, divides_right
= 1
naive_gcd_half = min(left, right)
smaller = max(left, right)
larger
= mod(larger, smaller) .eq. 0
smaller_divides_larger if (smaller_divides_larger) then
= smaller
naive_gcd_half return
end if
do divisor = 1, smaller / 2
= mod(left, divisor) .eq. 0
divides_left = mod(right, divisor) .eq. 0
divides_right if (divides_left .and. divides_right) then
= divisor
naive_gcd_half end if
end do
end function
```

This code has an extra check as we need to make sure that the smaller number isn’t the greatest common divisor before we skip it out and start searching smaller numbers.

This code is an improvement, we’re now searching \(\frac{2^n}{2} = 2^{n - 1}\) numbers. Which isn’t great but it’s definitely an improvement.

We can make another improvement here - if we know that 2 isn’t a divisor, then we automatically know that 4, 6, 8, 10 and any other multiple of two cannot be divisors either, so we can skip all of them! We can repeat this for every number. This technique is called the sieve of Eratosthenes and while it means we complete \(O(\log 2^n \log \log 2^n) = O(n \log n)\) operations eliminating bad candidates, this allows us to use much fewer arithmetic operations, so this is much faster.

```
integer function eratosthenes(left, right)
implicit none
integer, intent(in) :: left, right
integer :: divisor
integer :: i
logical, dimension(:), allocatable :: isnondivisor
allocate(isnondivisor(left), source = .false.)
= 0
eratosthenes if (mod(right, left) .eq. 0) then
= left
eratosthenes return
end if
do divisor = 1, left / 2
if (isnondivisor(divisor)) then
continue
end if
if ((mod(left, divisor) .eq. 0) .and. (mod(right, divisor) .eq. 0)) then
= divisor
eratosthenes else
do i = 1, left / divisor
* divisor) = .true.
isnondivisor(i end do
end if
end do
end function
```

# Euclid’s original algorithm

Now we can start to get a bit smart about things and take a different approach. If we know some number, \(d\), divides both \(a\) and \(b\), what can we say about \(b - a\). Well, \(d\) divides \(a\) so there’s some \(k_1\) such that \(a = k_1 d\). Similarly, \(d\) divides \(b\) so there’s some number, \(k_2\) such that \(b = k_2 d\). Using these, \(a - b = k_1 d - k_2 d = d (k_1 - k_2)\). So, \(d\) divides \(a - b\).

Why is this useful? This tells us that every number that divides both \(a\) and \(b\) will also divide \(a - b\) so the greatest common divisor of \(a\) and \(b\) will be the greatest common divisor of \(a\) and \(b - a\). This is helpful because we can now search for divisors of \(b - a\) which will be smaller than the larger of the two numbers. This lends itself nicely to a recurisve algorithm. I will write this iteratively because I can’t be certain that the Fortran compiler will perform tail call optimisation.

```
integer function naive_euclid(left, right)
implicit none
integer, intent(in) :: left, right
integer :: smaller, larger
integer :: next
= min(left, right)
smaller = max(left, right)
larger do while ((smaller .ne. 0) .and. (smaller .ne. 1))
= larger - smaller
next = max(smaller, next)
larger = min(smaller, next)
smaller end do
= larger
euclid end function
```

This was the original algorithm presented in Euclid’s elements

Let A and C be the two given positive integers; it is required to find their greatest common divisor. If C divides A, then C is a common divisor of C and A, since it also divides itself. And it clearly is in fact the greatest, since no greater number than C will divide C. But if C does not divide A, then continually subtract the lesser of the numbers A, C from the greater, until some number is left that divides the previous one. This will eventually happen, for if unity is left, it will divide the previous number.

At each step of iteration, I replace the larger of the two numbers, let’s call them \(a\) and \(b\), with \(b - a\) because they all share divisors. This is slightly better however, we perform \(O(a) = O(2^n)\) iterations of numeric operations in the worst case which isn’t better than what we had before.

# The Euclidean algorithm

We can improve on this by noticing that we can avoid repeated
subtraction by the same number by noticing that the end result is just
the remainder after division by this number. This adjustment is quick to
make: we can just replace `next = larger - smaller`

with
`next = mod(larger, smaller)`

. After making this adjustment,
the code for this algorithm looks like this.

```
integer function euclidean(left, right)
implicit none
integer, intent(in) :: left, right
integer :: smaller, larger
integer :: next
= min(left, right)
smaller = max(left, right)
larger do while ((smaller .ne. 0) .and. (smaller .ne. 1))
= mod(larger, smaller)
next = max(smaller, next)
larger = min(smaller, next)
smaller end do
= larger
euclid end function
```

We can show that with this adjustment we will always require fewer than \(\log_\varphi 3a\) steps where \(\varphi\) is the golden ratio but on average we perform \(O(\log a)\) steps [1].

This is significant because \(O(\log a) = O(\log 2^n) = O(n)\) so we now have a greatest common divisor algorithm whose expected number of steps is linear in the number of bits of our integers. If the computational cost of arithmetic operations is proportional to the number of bits involved, we get a complexity that scales as \(O(n^2)\). This is neat but we can do much faster.

# A binary algorithm

There’s an algorithm that’s better optimised for binary arithmetic. It exploits the fact that for binary numbers, division by two is just a bit-shift operation. The algorithm is based on the following principles.

- If \(a\) and \(b\) are both even then \(\gcd(a, b) = 2\gcd(\frac{a}{2}, \frac{b}{2})\)
- If \(a\) is odd and \(b\) is even then \(\gcd(a, b) = \gcd(\frac{b}{2}, a)\)
- If \(a\) and \(b\) are both odd then \(|a - b|\) is even and \(|a - b| < \min(a, b)\). Furthermore, \(\gcd(a, b) = \gcd(|a - b|, \min(a, b))\)
- \(\gcd(a, b) = \gcd(b - a, a)\), as in Euclid’s algorithm

With this algorithm, we can begin by dividing both \(a\) and \(b\) by two while they are not both even and maintaining a counter, \(k\), for how many times two divides both of them. If any of these are even, then we can repeatedly divide this number by two, leaving the value in the counter unmodified. We can then replace \(b\) with \(b - a\). This is my implementation of this algorithm.

```
integer function binary(left, right)
implicit none
integer, intent(in) :: left, right
integer :: smaller, larger
integer :: next
integer :: k = 0
= min(left, right)
smaller = max(left, right)
larger
do while ((mod(smaller, 2) .eq. 0) .and. (mod(larger, 2) .eq. 0))
= smaller / 2
smaller = larger / 2
larger = k + 1
k end do
do while (smaller .ne. 0)
do while (mod(larger, 2) .eq. 0)
= larger / 2
larger end do
do while (mod(smaller, 2) .eq. 0)
= smaller / 2
smaller end do
= larger - smaller
next = max(smaller, next)
larger = min(smaller, next)
smaller end do
= lshift(larger, k)
binary end function
```

This algorithm appears in a Chinese text *Chiu Chang Suan Shu*
or *Nine Chapters on the Mathematical Art* written arround 1AD.
The following excerpt describes the above algorithm.

If halving is possible, take half.

Otherwise write down the denominator and the numerator, and subtract the smaller from the greater.

Repeat until both numbers are equal.

Simplify with this common value.

This algorithm also has a complexity of \(O(n^2)\) but is significantly faster.

# Multiple precision arithmetic

You may have noticed that in the test cases that I generated include numbers that contain up to \(2^{20}\) bits but the Fortran code I’ve written uses integers, which only support numbers that use fewer than 32 bits. \(32\) is much less than \(2^{20}\) so this poses a problem as I simply will not be able to fit the test cases into integers.

It’s now time to explore multiple precision arithmetic, which will allow me to store and operate on arbitrarily large numbers.

I had lots of ideas here for building my own multi-precision arithmetic library but, unfortunately, sense and reason got the better of me and I decided to use a pre-existing multi-precision arithmetic library. Which, of course, is definitely the correct thing to do albeit a bit less fun.

It’s also around here that I decide to switch to C++ - I love Fortran but I didn’t really want to deal with trying to link libraries with gfortran and I kept stumbling into dead links when searching for Fortran multi-precision libraries.

In C++, the binary GCD algorithm and the Euclidean algorithm look like this using the GNU multi-precision arithmetic library.

```
void binary(mpz_t result, mpz_t left, mpz_t right) {
mpz_t smaller, larger;
int k = 0;
(smaller, larger, nullptr);
mpz_inits(smaller, left);
mpz_set(larger, right);
mpz_setif (mpz_cmp(smaller, larger) > 0) {
(smaller, larger);
mpz_swap}
while (mpz_divisible_ui_p(smaller, 2) && mpz_divisible_ui_p(larger, 2)) {
++k;
(smaller, smaller, 2);
mpz_tdiv_q_ui(larger, larger, 2);
mpz_tdiv_q_ui}
while (mpz_cmp_ui(smaller, 0)) {
while (mpz_divisible_ui_p(larger, 2)) {
(larger, larger, 2);
mpz_tdiv_q_ui}
while (mpz_divisible_ui_p(smaller, 2)) {
(smaller, smaller, 2);
mpz_tdiv_q_ui}
(larger, larger, smaller);
mpz_subif (mpz_cmp(smaller, larger) > 0) {
(smaller, larger);
mpz_swap}
}
(larger, larger, k);
mpz_mul_2exp(result, larger);
mpz_set}
void euclidean(mpz_t result, mpz_t left, mpz_t right) {
mpz_t smaller, larger;
if (mpz_cmp(left, right) < 0) {
(smaller, left);
mpz_set(larger, right);
mpz_set}
else {
(smaller, right);
mpz_set(larger, right);
mpz_set}
while (mpz_cmp_ui(smaller, 0)) {
(larger, larger, smaller);
mpz_modif (mpz_cmp(smaller, larger) > 0) {
(smaller, larger);
mpz_swap}
}
(result, smaller);
mpz_set}
```

# Lehmer’s algorithm

Lehmer’s algorithm is a GCD algorithm for arbitrary precision
arithmetic. This is the algorithm used in the Python standard library’s
`gcd`

function. Integers in python are represented using base
32, this is implemented as an array of 32-bit longs accompanied by a
number to denote size (the number of base 32 digits) and sign.

The idea with Lehmer’s algorithm is to avoid multiple-precision operations and make use of single-precision operations where possible. We can do this after first noticing that the quotients between two random numbers are usually small. The quotient between two random numbers is almost always less than 1000, we can observe this experimentally with the following program.

```
import Control.Arrow
import System.Random
import Data.List
= mkStdGen 0
seed
generateNumbers :: StdGen -> [Int]
= take 1000 . unfoldr (Just . randomR (0, 2^20))
generateNumbers
= putStrLn (show percentage ++ "% of quotients were below 1000")
main where
= generateNumbers *** generateNumbers $ split seed
(numerators, denominators) = zipWith div numerators denominators
quotients = 100 * count (<=1000) quotients `div` length quotients
percentage = (length .) . filter count
```

The above reports that `99% of quotients were below 1000`

.
We use this fact by approximating the quotient of two larger
multiprecision numbers as the quotient of two numbers that can be
represented using single-precision words. In fact, we take two
approximations - one approximation is an overestimate and the other is
an underestimate. Because these quotients are small, there are only a
few possibilities for the true quotient that lies between these two
bounds. In fact, about 68% of quotients are either 1, 2 or 3.

We achieve these approximations by looking at the first few digits of the number. For the case of GNU MP, these digits will be 32 bit integers but I will use base 10 for the purpose of example.

If I’d like to find the greatest common divisor of 27182818 and 10000000 using the Euclidean algorithm, I’d begin by finding the remainder after dividing 27182818 by 10000000. I can do this by under- and over-approximation each of these two numbers. 27182818 is approximated as 27180000 and 27190000 and 1000000 is approximated as 10000000 and 10010000. We can thus produce an under-approximation of the true quotient that comes out to be \(\frac{27180000}{10010000}\) and \(\frac{27190000}{10000000}\). Clearly, these are equal to \(\frac{2718}{1001}\) and \(\frac{2719}{1000}\) which can be computed using single-precision division. However, using a calculator, these quotients are both 2. So the true quotient, being bounded by these approximations, must also be 2.

Lehmer’s algorithm works by using the Eucliean algorithm on these single-precision approximations until the quotients are no longer the same. At this point, since we know the true quotients accumulated from the Euclidean algorithm, we can use these quotients to perform the next steps of the Euclidean algorithm on the multi-precision number without ever needing to perform a multi-precision division.

With the GNU MP library, each base 32 digit is stored as a 32-bit
long. These 32-bit longs are called “limbs” and have the type
`mp_limb_t`

, I acess them using `mpz_get_limbn`

. I
came up with the following code.

```
void lehmer(mpz_t result, mpz_t left, mpz_t right) {
mpz_t smaller, larger;
mpz_t smaller_prime, larger_prime;
(smaller, larger, smaller_prime, larger_prime, nullptr);
mpz_inits
/* under- and over-estimates of each number*/
mp_limb_t smaller_u, smaller_o, larger_u, larger_o;
/* Corresponding quotients and remainders */
mp_limb_t q_u, r_u, q_o, r_o;
(smaller, left);
mpz_set(larger, right);
mpz_set
if (mpz_cmp(smaller, larger) > 0) {
(smaller, larger);
mpz_swap}
/* Begin as with the Euclidean algorithm */
while (mpz_cmp_ui(smaller, 0)) {
if (mpz_cmp(smaller, larger) > 0) {
(smaller, larger);
mpz_swap}
if (!mpz_cmp_ui(smaller, 0)) {
break;
}
/* Get the most significant limb from each number */
= mpz_getlimbn(smaller, mpz_size(larger) - 1);
smaller_u = mpz_getlimbn(larger, mpz_size(larger) - 1);
larger_u
= smaller_u + 1;
smaller_o = larger_u + 1;
larger_o
/*
* The larger number is much larger than the smaller number
* We are forced to compute a multi-precision division
* This happens extremely rarely
* */
if (!smaller_u) {
(larger, larger, smaller);
mpz_modcontinue;
}
// After some steps of the euclidean algorithm
// the two numbers become some linear combination
// of the original values
// Use a, b to denote the coefficients
// for the larger and smaller number
mp_limb_t
= 0,
a_l = 1,
b_l = 1,
a_s = 0;
b_s while (1) {
= larger_u / smaller_o;
q_u = larger_o / smaller_u;
q_o
if (q_u != q_o) {
if(a_l == 0 && b_l == 1 && a_s == 1 && b_s == 0) {
(larger, larger, smaller);
mpz_modbreak;
}
(smaller_prime, smaller, a_s);
mpz_mul_ui(smaller_prime, larger, b_s);
mpz_addmul_ui
(larger_prime, smaller, a_l);
mpz_mul_ui(larger_prime, larger, b_l);
mpz_addmul_uibreak;
}
// Perform an iteration of Lehmer's algorithm
= larger_u - q_u * smaller_o;
r_u = larger_o - q_o * smaller_u;
r_o
// This remainder is less than both of these numbers
// The new largest number is the original smallest number
// The new smaller number is the remainder
= smaller_o;
larger_u = smaller_u;
larger_o
= r_u;
smaller_u = r_o;
smaller_o
// We have
// remainder = larger - q * smaller
// larger' = smaller
// smaller' = remainder
// if larger = a_l * smaller_0 + b_l * larger_0
// smaller = a_s * smaller_0 + b_s * larger_0
// then
// smaller' = (a_l - q * a_s) * smaller_0 + (b_l - q * b_s) * larger_0
// larger' = a_s * smaller_0 + b_s * larger_0
// thus
//
// a_s' = a_l - q * a_s
// b_s' = b_l - q * b_s
//
// a_l' = a_s
// b_l' = b_s
mp_limb_t
= a_s,
_a_l = b_s;
_b_l = a_l - q_u * a_s;
a_s = b_l - q_o * b_s;
b_s = _a_l;
a_l = _b_l;
b_l }
}
(result, larger);
mpz_set}
```

Writing this was quite fun and seeing it work was truly awesome. I gave it two 62 digit numbers and it returned the greatest common divisor in under an instant. This algorithm is insanely fast but I’m not done yet.

# Parallelisation

So my code is fast, this is cool but not extraordinary. As I said, this is the same algorithm that the Python math library uses, although I beat Python since my C++ is able to process the input faster using the GMP library.

I mentioned earlier that I have acess to a good number of CPU cores so I’m obligated to use them. I was able to find two algorithms for computing the GCD in parallel. The first due to Brent and Kung [3] and the second due to Goldreich and Chor [4]. In the future, I would like to read through these papers and implement a parallel GCD algorithm.

[1]: Knuth, Donald E. “The art of computer programming. Volume 2: Seminumerical algorithms.”

[2]: Möller, Niels. “On Schönhage’s algorithm and subquadratic integer GCD computation.” Mathematics of Computation 77.261 (2008): 589-607.

[3]: R. P. Brent and H. T. Kung, “A systolic algorithm for integer GCD computation,” 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH), Urbana, IL, USA, 1985, pp. 118-125, doi: 10.1109/ARITH.1985.6158931.

[4]: Chor, B., Goldreich, O. An improved parallel algorithm for integer GCD. Algorithmica 5, 1–10 (1990). https://doi.org/10.1007/BF01840374