Implementing the Brent-Kung addition circuit in hardware
13 September 2025
Recently I’ve been doing the HDLBits Verilog exercises; there’s about 200 of them but one in particular piqued my interest.
The exercise titled Adder100 asks you to create a 100-bit addition circuit that handles both carry in and carry out. I remembered that there is an algorithm for parallel addition that falls in thes AC0 complexity class.
AC0 is the complexity class for circuits of constant depth with respect to the size of the input where each logic gate in the circuit has polynomial fan in. That means every logic gate like AND has its number of inputs bounded by a polynomial function of the input size. So saying addition is in AC0 means we can we only need a constant number of logic gates provided each logic gate can accept a reasonable number of inputs.
The algorithm itself isn’t very complicated and is pretty much the same as how we’re taught to add number in school.
Take for example the binary numbers 1010
and
1011
. We add the numbers in each of the digit positions
like in the diagram below where we know the following rules for binary
addition
- \(0 + 0 = 0\)
- \(0 + 1 = 1\)
- \(1 + 0 = 1\)
- \(1 + 1 = 10\)
All of these additions produce one digit except the rule for \(1 + 1\) which produces two digits. If we didn’t have the case where both of the summand digits were 1 then we could find digits for the sum by simply adding the digits in the same position.
This doesn’t work in general because \(1 + 1\) generates a ‘carry’ bit which we need to add to the next digit, just like when we did addition in school. Visually that looks like this
When we need to add two one bits together, we set the carry bit of the next digit to 1. And, once we’ve calculated the carry bits we can add the three digits together in a similar way.
If the carry bit is 0 then we add the numbers as we usually would since adding 0 doesn’t change a number.
If, however, the carry bit is 1 then we have new addition rules
- \(1 + 0 + 0 = 1\)
- \(1 + 0 + 1 = 10\)
- \(1 + 1 + 0 = 10\)
- \(1 + 1 + 1 = 11\)
Most of these generate a carry! However, this is ok as we can continue this procedure of adding and generating carry bits until we don’t have a carry anymore.
The addition rules for the sum bit simplifies to just the xor of the bits we’re adding. So \(1 + 0 + 1\) has a sum bit of \(1 \oplus 0 \oplus 1 = 0\).
The rule for determining the carry is a bit different. I introduced it earlier by iteratively adding digits and moving the carry bit to the next place. However, we can think about it in a different way and look at a specific position then ask “when will the carry bit for this position be 1?”.
We do this by thinking about “generating” and “propagating” carry
bits. This works by noticing that for 2 digit summation we only have a
carry bit when both of the summands bits are 1. In SystemVerilog we can
test that for a position using a[j] & b[j]
which
evaluates to true when both a
and b
have 1
bits at position j
.
Even though a certain bit generates a carry bit, we need to ask if it makes it way to the position we care about. For example, the carry generated by the 1 + 1 addition makes its way to the final position following example
When looking at the carry bit for a certain position we can look at
all generator bits in front of it and check if the digits in between
allow it to propagate onwards. Looking at our addition rules, we can
propagate an carry 1 bit as long as we have at least one of our summand
bits is a 1. In SystemVerilog we can test this using
a[i] | b[i]
.
in SystemVerilog that looks like this
module top_module #(
parameter BITS = 4
)
(input logic [BITS - 1 : 0] a,
input logic [BITS - 1 : 0] b,
input logic cin,
output logic [BITS - 1 : 0] sum,
output logic cout
);logic [BITS - 1 : 0] carry;
assign carry[0] = cin;
genvar i, j, k;
generate
for (i = 1; i < BITS; i++) begin : loop
logic [i - 1 : 0] or_reduction;
for (j = 0; j < i; j++) begin : loop2
logic [i - 1 : j] does_generate;
assign does_generate[j] =
0) ?
(j == 0] & b[0] | a[0] & cin | b[0] & cin)
(a[
: a[j] & b[j];for (k = j + 1; k < i; k++) begin : loop3
assign does_generate[k] = a[k] | b[k];
end
assign or_reduction[j] = &does_generate;
end
assign carry[i] = |or_reduction;
end
endgenerate
assign sum = a ^ b ^ carry;
assign cout = a[BITS - 1] & b[BITS - 1] | carry[BITS - 1] & a[BITS - 1] | carry[BITS - 1] & b[BITS - 1];
endmodule
As a circuit it looks like this
However there’s a massive problem. In hardware we don’t have polynomial fan in: we have a fan-in of 2 so we can’t compute the chain of ANDs and ORs in one step which means this circuit doesn’t have constant depth. In fact this circuit has a pretty bad \(O(n)\) depth and \(O(n^2)\) gates.
It’s not that awful though, we can speed up my circuit by computing the AND and OR chain in parallel in a similar way to how you would do a parallel tree reduction in MPI. There’s pseudocode for this on the wikipedia page for the Brent-Kung adder.
module top_module #(
parameter BITS = 4
)(
input logic [BITS - 1 : 0] a,
input logic [BITS - 1 : 0] b,
input logic cin,
output logic [BITS - 1 : 0] sum,
output logic cout
);
logic [BITS - 1 : 0] g, p;
assign g = a & b;
assign p = a | b;
logic [BITS - 1 : 0] G [$clog2(BITS) : 0];
logic [BITS - 1 : 0] P [$clog2(BITS) : 0];
assign G[0] = g;
assign P[0] = p;
logic [BITS : 0] carry;
assign carry[0] = cin;
genvar step, i, j;
generate
for (step = 1; step <= $clog2(BITS); step++) begin : loop
for (i = 0; i < BITS; i++) begin : loop2
if (i < (1 << (step - 1))) begin
assign G[step][i] = G[step - 1][i];
assign P[step][i] = P[step - 1][i];
end else begin
assign G[step][i] = G[step - 1][i] | (P[step - 1][i] & G[step -1][i - (1 << (step - 1))]);
assign P[step][i] = P[step - 1][i] & P[step - 1][i - (1 << (step - 1))];
end
end
end
for (i = 1; i < BITS + 1; i++) begin : loop3
assign carry[i] = G[$clog2(BITS)][i - 1] | (P[$clog2(BITS)][i - 1] & cin);
end
endgenerate
assign sum = a ^ b ^ carry[BITS - 1 : 0];
assign cout = carry[BITS];
endmodule
This looks like this as a circuit
I wrote a small cocotb test bench for this
import cocotb
from cocotb.triggers import Timer
import random
@cocotb.test()
async def test(dut):
= len(dut.a)
BITS for a in range(2**(BITS - 1)):
for b in range(2 ** (BITS - 1)):
for cin in [0, 1]:
= a
dut.a.value = b
dut.b.value = cin
dut.cin.value await Timer(1, units="ns")
= int(dut.sum.value)
actual_sum = int(dut.cout.value)
actual_cout = a + b + cin
real_sum = real_sum & (2 ** BITS - 1)
expected_sum = (real_sum >> BITS) & 1
expected_cout
assert actual_sum == expected_sum, f"{a} + {b} + {cin}: expected {expected_sum=} but got {sum=}"
assert actual_cout == expected_cout, f"{a} + {b} + {cin}: expected {expected_cout=} but got {cout=}"
It needs a Makefile and there’s a guide in the cocotb documentation
SIM = verilator
TOPLEVEL_LANG = verilog
VERILOG_SOURCES = ../parallel-adder.sv
COCOTB_TEST_MODULES = test
TOPLEVEL = top_module
include $(shell cocotb-config --makefiles)/Makefile.sim
By default cocotb uses iverilog but I changed it to verilator since that’s what I have installed on my laptop.
I had to use /* verilator lint_off UNOPTFLAT */
pragmas
to get it to work, I don’t understand why though.
After doing this I ran make
and my tests all passed.
However, I’ve overcomplicated this and the site accepts the following solution
module top_module(
input [99:0] a, b,
input cin,
output cout,
output [99:0] sum
);assign {cout, sum} = a + b + cin;
endmodule
Even still, this was a fun exploration!