Multiplying two 32-bit numbers without using 64-bit variable

---15 Aug 2023---

In this post, we figure out how to multiplying two 32-bit positive numbers by only using 32-bit variables.

Suppose that we are given two positive numbers \(x,y \in [0, 2^{32}-1]\), the goal is to calculate \(z = x \times y\) where \(z \in [0, 2^{64} - 1]\) without using any 64-bit variable (and also, without using conditional expressions).

We use two variables, say \(z_1\) and \(z_0\), to represent \(z\) where \(z_1\) is 32 upper bits and \(z_0\) is 32 lower bits of \(z\), respectively. In other words, \(z = z_1 \times 2^{32} + z_0\). So, our goal now is to find \(z_1\) and \(z_0\) using only calculations on 32-bit variables.

The first and naive idea is to decompose \(x\) and \(y\) into:

\[x = x_1 \times 2^{16} + x_0,\] \[y = y_1 \times 2^{16} + y_0,\]

where \(x_0, x_1, y_0, y_1 \in [0, 2^{16}-1]\) are 16-bit numbers. We then have:

\[z = x_1y_1 \times 2^{32} + (x_0y_1 + x_1y_0) \times 2^{16} + x_0y_0\]

When we see the factor \(2^{32}\) here, we might say that let’s take \(z_1 = x_1y_1\) and \(z_0 = (x_0y_1 + x_1y_0) \times 2^{16} + x_0y_0\) !?

Be careful! It is a bit more complicated than that. Why? Let’s take a look on \(x_0y_0\). Notice that \(x_0\) and \(y_0\) are 16-bit numbers. So their product is maximum 32 bits. It is similar to \(x_0y_1\) and \(x_1y_0\). Thus, \((x_0y_1 + x_1y_0) \times 2^{16} + x_0y_0\) does not fit 32 bits of \(z_0\).

To deal with this, instead of considering two 32-bit parts \(z_1\) and \(z_0\), we consider \(z\) as four 16-bit parts \(a_3, a_2, a_1, a_0 \in [0, 2^{16}-1]\) such that:

\[z = a_3 \times 2^{48} + a_2 \times 2^{32} + a_1 \times 2^{16} + a_0,\]

\[z_1 = a_3 \times 2^{16} + a_2,\] \[z_0 = a_1 \times 2^{16} + a_0.\]

Now we sequentially find \(a_0\), \(a_1\), \(a_2\) and \(a_3\). Before starting, we need to introduce the following two functions for a 32-bit number \(v\):

\(\mathsf{UP}(v)\): returns the 16 upper bits of \(v\),
\(\mathsf{LO}(v)\): returns the 16 lower bits of \(v\).

Firstly, we start with \(a_0\). Since only \(x_0y_0\) involves in the value of \(a_0\), it can be sure that \(a_0\) is the 16 lower bits of \(x_0y_0\). We write:

\[a_0 = \mathsf{LO}(x_0y_0).\]

Secondly, we find \(a_1\). We have that \(x_0y_1\), \(x_1y_0\) and \(\mathsf{UP}(x_0y_0)\) involve in the value of \(a_1\). We need to calculate the sum of them and \(a_1\) is the 16 lower bits of the sum. However, we have to be careful with the sum’s calculation, otherwise, it can cause overflow. To have a closer look, we assign the maximum values for \(x_0, x_1, y_0, y_1\). Namely,

\[x_0 = x_1 = y_0 = y_1 = 65535 = (\texttt{FFFF})_{16},\]

then

\[x_0y_1 = x_1y_0 = 4294836225 = (\texttt{FFFE0001})_{16}\]

We can observe that it is still possible to add \(2 \times \texttt{FFFF}\) (twice of the maximum value of 16 bits) to \(x_0y_1\) or \(x_1y_0\) before it reaches the maximum value of 32 bits. Plus, we cannot calculate \(x_0y_1 + x_1y_0\) directly. So, we do the following steps:

\[a_1 = \mathsf{LO}(x_1y_0 + \mathsf{LO}(x_0y_1 + \mathsf{UP}(x_0y_0)))\]

Thirdly, we find \(a_2\). We have that \(x_1y_1\) and all the previous calculations (\(x_0y_1\), \(x_1y_0\) and \(\mathsf{UP}(x_0y_0)\)) involve in the value of \(a_2\) (in case their sums are propagatedly over 16 bits). We pay attention to the previous calculations in \(a_1\) where we only take the lower bits, including \(x_0y_1 + \mathsf{UP}(x_0y_0)\) and \(x_1y_0 + \mathsf{LO}(x_0y_1 + \mathsf{UP}(x_0y_0))\). Now:

\[\begin{align} a_2 = \mathsf{LO} &( \\ &x_1y_1 + \\ &\mathsf{UP}(x_0y_1 + \mathsf{UP}(x_0y_0)) + \\ &\mathsf{UP}(x_1y_0 + \mathsf{LO}(x_0y_1 + \mathsf{UP}(x_0y_0))) \\ &) \end{align}\]

Finally, we find \(a_3\). It is similar to \(a_2\), but we take the upper bits at the end:

\[\begin{align} a_3 = \mathsf{UP} &( \\ &x_1y_1 + \\ &\mathsf{UP}(x_0y_1 + \mathsf{UP}(x_0y_0)) + \\ &\mathsf{UP}(x_1y_0 + \mathsf{LO}(x_0y_1 + \mathsf{UP}(x_0y_0))) \\ &) \end{align}\]

NOTE: It is not necessary to calculate \(a_2\) and \(a_3\) separately, because:

\[z_1 = a_3 \times 2^{16} + a_2 = x_1y_1 + \mathsf{UP}(x_0y_1 + \mathsf{UP}(x_0y_0)) + \mathsf{UP}(x_1y_0 + \mathsf{LO}(x_0y_1 + \mathsf{UP}(x_0y_0))))\]

Voilà. Now we have \(z_1 = a_3 \times 2^{16} + a_2\) and \(z_0 = a_1 \times 2^{16} + a_0.\) Below is the C implementation and here is the link to github: https://github.com/nvietsang/mult32

#define UP(x) (x >> 16)      // 16 upper bits
#define LO(x) (x & 0xFFFF)   // 16 lower bits

/**
 * Given:
 *  x = x1 * 2^16 + x0
 *  y = y1 * 2^16 + y0
 * Compute:
 *  z = x * y
 * Return:
 *  z[1]: 32 upper bits
 *  z[0]: 32 lower bits
*/
void mult32(unsigned int *z, unsigned int x, unsigned int y) {
	unsigned int x1 = UP(x);
	unsigned int x0 = LO(x);
	unsigned int y1 = UP(y);
	unsigned int y0 = LO(y);

    unsigned int t;
    // z = z3 * 2^48 + z2 * 2^32 + z1 * 2^16 + z0
    // z[1] = z3 * 2^16 + z2 
    // z[0] = z1 * 2^16 + z0
    unsigned int z3, z2, z1, z0; 
    t = x0 * y0;
    z0 = LO(t);
    
    z1 = UP(t);
    t = x0 * y1 + z1;
    z1 = LO(t);
    z2 = UP(t);

    t = x1 * y0 + z1;
    z1 = LO(t);
    z2 = UP(t) + z2;
    t = x1 * y1 + z2;
    // z2 = LO(t);
    // z3 = UP(t);
    
    z[1] = t;
    // z[1] = (z3 << 16) | z2;
    z[0] = (z1 << 16) | z0;
}