$p$-adic Valuation and the bsf / tzcnt and popcnt Instructions

Today, we’ll be looking at $p$-adic valuation defined on the integers, with a specific look at a special formula for finding the $p$-adic valuation of $n!$ for any given $n.$ We won’t be as concerned about optimizations this time, since we’re looking at something with a known formula.

This will be a very quick post with a bit of mathematics and programming, but this time we’ll be looking at the disassembly of our program!

The $p$-adic Valuation and tzcnt

Definition. Let $p$ be a prime number. The $p$-adic valuation of an integer $n$ is defined by $${\nu_{p}\left(n\right)={\begin{cases}k &{\text{if }}n\neq 0, p^{k}\mid\mid n\\\infty &{\text{if }}n=0\end{cases}}}$$ where $p^{k}\mid\mid n$ means that $k$ is the highest power of $p$ which divides $n.$

How would we implement this in code? The easiest way, and the way we’ll be doing it (for most values of $p$), is to just divide $n$ by $p$ until we hit an integer which is no longer divisible by $p.$

auto p_adic_valuation( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ
{   // Assuming n != 0 before proceeding; this will be handled in main() before this is called.
    NTL::ZZ valuation{ 0 };

    while( n % p == 0 )
    {
        ++valuation;
        n /= p;
    }

    return valuation;
}

Therefore, we get the following for p_adic_valuation( 89756756756756756667, 3 ) (note that I picked $89756756756756756667 = 3^{6}\cdot 123123123123123123$):
The 3-adic valuation of 89756756756667 is 8.

And, indeed, $3^{8}\mid\mid 89756756756756756667$ as $89756756756756756667 = 3^{8} \cdot 13680347013680347$ and $3 \nmid 13680347013680347.$

I said we wouldn’t worry about optimizations, but can we specialize this for $p = 2$ in a way that bypasses division? The following proposition and its corollary should give us an idea.

Proposition. Let $b \gt 1$ and $n \neq 0$ be integers. Then, there exist unique integers $d\geq 0$ and $n_{0},\ldots, n_{d}$ such that $$n = \sum_{0\leq i\leq d} n_{i} b^{i},$$ where $0\leq n_{i} \lt b$ for each $0\leq i\leq d$ and $n_{d}\neq 0.$

This proposition essentially says that, for any $b\gt 1,$ there exists a unique base-$b$ representation of $n,$ as long as we don’t consider extra $0$s appearing on the left. This is obviously true when $n = 0,$ since we can only represent $0$ as $0$ without adding extra $0$s.

Before proving this, there’s a very simple corollary, which I’ll prove first. This will be a hint for how we could do the $p = 2$ case without repeated division.

Corollary. Let $p$ be prime and $n\neq 0$. Then, $\nu_{p}\left(n\right)$ is equal to the number of trailing $0$s in $n$’s base-$p$ representation.
Proof.
Suppose $\nu_{p}\left(n\right) = k.$ Then, $p^{k} \mid\mid n,$ which means that $n_{i} = 0$ for all $i \lt k,$ and, moreover, $n_{k} \neq 0$ since, otherwise, we could divide out another $p.$
QED

In computers, we very often deal with integers expressed in binary, which many modern CPU architectures can manipulate quite freely. We’ll get back to this once we prove the proposition.

Proof of Proposition.
If $n\lt 0,$ we can multiply by $-1$ to get $-n\gt 0,$ and $d$ will be the same as for $-n,$ while the $n_{i}$ are just the negatives of those for $-n.$ Therefore, we need only think of $n\gt 0.$

Using the division algorithm, there exist $q_{0}, r_{0}$ with $0 \leq q_{0} \lt n$ and $0\leq r_{0} \lt b$ such that $n = q_{0}b + r_{0}.$ Let $n_{0} = r_{0},$ and repeat the division algorithm on $q_{0}.$ Since each subsequent quotient is smaller than the previous, and each one is non-negative due to the non-negativity of $b$ and $n,$ this process will eventually stop, with $n_{i} = r_{i}$ for each $i$ and $d$ equal to the index of the first quotient in this process where $q_{d} = 0.$ This proves existence of a representation.

For uniqueness, consider $d’$ and $n_{i}’$ such that $$n = \sum_{0\leq i\leq d’} n_{i}’ b^{i}.$$

Suppose $d\neq d’.$ Without loss in generality, suppose $d\lt d’.$ Then, $$n_{d’}’ b^{d’}\geq b^{d’} \gt b^{d’} – 1 = \sum_{0\leq i\leq d’ – 1} \left(b – 1\right) b^{i} \geq \sum_{0\leq i\leq d} \left(b – 1\right) b^{i} \geq \sum_{0\leq i\leq d} n_{i} b^{i} = n,$$ a contradiction.

This also forces $n_{i} = n_{i}’$ for all $0\leq i\leq d,$ and so uniqueness is proved.
QED

The corollary to this conjecture means that we can use the number of trailing zeroes of the base-$p$ representation of $n$ gives the valuation, which we can do easily for $p = 2$ on a computer (since everything’s already encoded in base-$2$) using std::countr_zero. Let’s look at the code! Note that we have to do a little extra work, since we’re using the Number Theory Library for the big integer support.

auto p_adic_valuation( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ
{   // Assuming n != 0 before proceeding; this will be handled in main() before this is called.
    NTL::ZZ valuation{ 0 };

    if( p != 2 )
    {
        while( n % p == 0 )
        {
            ++valuation;
            n /= p;
        }
    }
    else
    {   // We have to get the raw bytes of our integer n.
        auto num_bytes{ NTL::NumBytes( n ) };
        std::vector< std::uint8_t > bytes(num_bytes, 0);

        // Returns integer in bytes in order from least significant to most significant byte.
        NTL::BytesFromZZ( bytes.data(), n, num_bytes );

        for( auto b : bytes )
        {
            if( b == 0 )
            {   // Skip the trailing 0 byte and add 8 to the valuation.
                valuation += 8;
            }
            else
            {   // Non-zero byte means we count the final trailing zeroes and end.
                valuation += std::countr_zero( b );
                break;
            }
        }
    }

    return valuation;
}

There are a few things to explain (which are partially explained by the comments).
First, since we’re using NTL, we have to get the raw bytes for our integer in a vector of bytes. The NTL::BytesFromZZ function returns the bytes in order from least significant to most significant, which means the bytes are in order from lowest-order byte to highest-order byte. This means we can loop over the bytes, skip over the bytes which are all $0$ (adding $8$ to the valuation, since that’s $8$ trailing $0$s for each $0$ byte), and then call the std::countr_zero function on the first non-zero byte we see and terminate the loop there.

The following assembly was generated using Visual Studio 2022’s C++ compiler and debugger. I set a break point at the loop, right clicked it, and went to the disassembly. I added comments to explain what’s happening and used a different syntax coloring to distinguish it from all the C++ stuff.

; for( auto b : bytes )
mov         rdi,rbx               ; Move the address of the first element into rdi (this is our iterator)
cmp         rbx,r15               ; Check if our iterator is already at the end
je          p_adic_valuation+1C0h ; If so, skip loop.  
movzx       eax,byte ptr [rdi]    ; Move the first byte into eax by dereferencing the iterator

; test if( b == 0 )
mov         r8,rsi
mov         rcx,qword ptr [rsi]
test        al,al
jne         p_adic_valuation+1B1h ; Jump to else branch when b != 0

; if( b == 0 ) branch
; valuation += 8
mov         edx,8                 ; Setting up valuation += 8
call        _ntl_gsadd            ; NTL's add function for NTL::ZZ, finishing up valuation += 8
inc         rdi                   ; Increment the iterator forward 1 byte
cmp         rdi,r15               ; Test if iterator is at the end
jne         p_adic_valuation+190h ; Jump back to beginning of loop if not done
jmp         p_adic_valuation+1C0h ; Get out of the loop if finished

; else branch
; valuation += std::countr_zero( b );
tzcnt       edx,eax               ; std::countr_zero( b )
call        _ntl_gsadd            ; the "valuation +=" part
; Done with the loop!
;

Notice the tzcnt instruction in the else branch. This instruction counts the number of trailing zeroes of (in this case) a 32-bit integer. If I was running this on, say, a Sandy Bridge Intel CPU, this would actually be decoded as rep bsf by the CPU (the rep part doesn’t do anything), since the Sandy Bridge architecture doesn’t support tzcnt directly. The opcode for tzcnt was already in the Sandy Bridge architecture, but was sort-of a placeholder to allow for backwards compatibility when tzcnt finally came around. Neat!

The difference between tzcnt and bsf is a bit subtle.
When bsf is called on 0, the Zero Flag (ZF) is set to 1, and the destination register is left unchanged (this is how it’s defined for AMD CPUs, and while this is also the typical behavior on Intel CPUs, the upper 32 bits of the full 64-bit destination register are “undefined” according to Intel).
When tzcnt is called on 0, it sets the Carry Flag (CF); on the other hand, it sets ZF when it’s output is 0 (that is, when the least-significant bit is set).
In most cases, these differences don’t change anything, so they’re pretty interchangeable, except that tzcnt is typically faster.

For full transparency, I did modify the assembly slightly beyond just adding comments. There was an unnecessary or eax, 0FFFFFF00h line before tzcnt. When using the intrinsic function, _tzcnt_u32( b ), directly, the or line isn’t there. I tested the std::countr_zero( b ) line with g++ using g++ -S -march=native p_adic_valuation.cpp -lntl -lgmp -lm -std=c++20 -fverbose-asm -O3 -masm=intel to generate the assembly, and the tzcnt instruction had no or before it. I even set /arch:AVX2 in the compiler flags in Visual Studio, which should make std::countr_zero( b ) emit tzcnt without any extra code, so I’m not sure what caused this other than the fact that I’m calling it on a std::uint8_t.

The $p$-adic Valuation of $n!$ and popcnt

Now that we’ve implemented the p_adic_valuation function, let’s look at what happens if we call it on $n!$ for $p = 2, 3, 5.$ I’ve implemented the factorial function in the obvious way, so I omit that code. The following calls p_adic_valuation( factorial( n ), p ) for $n = 20$ and $p = 2,3,5.$
$ ./p_adic_valuation 20
The 2-adic valuation of 2432902008176640000 is 18.
The 3-adic valuation of 2432902008176640000 is 8.
The 5-adic valuation of 2432902008176640000 is 4.

Can we come up with a formula for the $p$-adic valuation of $n!$ to use instead? Turns out, yes!

Proposition. Let $p$ be prime and $n\geq 1$ an integer. Then, $$\nu_{p}\left(n!\right) = \sum_{\ell\geq 1}\left\lfloor \frac{n}{p^{\ell}} \right\rfloor.$$

While this is being expressed as an infinite sum, this is actually a finite sum, since eventially $p^{\ell} > n,$ and so the floor of their quotient will be $0$ from that point on.

Proof.
Since $n!$ is the product of $1$ through $n,$ we can divide by $p$ for each multiple of $p$ in the product. There are $\left\lfloor \frac{n}{p} \right\rfloor$ of these.
Similarly, we may divide out another $p$ for any factor of $n!$ which is a multiple of $p^{2}.$ There are $\left\lfloor \frac{n}{p^{2}} \right\rfloor$ of these.
And so on.
QED

Corollary (Legendre’s Formula). Let $p$ be a prime number and $n\geq 0$ an integer. Then, $$\nu_p\left(n!\right) = \frac{n – s_{p}\left(n\right)}{p – 1},$$ where $$s_{p}\left(n\right) = \sum_{0\leq i\leq d} n_{i}$$ is the sum of the base-$p$ digits of $n.$
Proof.
First, suppose that $p \lt n.$
If we write $n$ in base-$p,$ note that $n_{0} = 0$ and $$\nu_{p}\left(n!\right) = \sum_{\ell\geq 1}\left\lfloor \frac{n}{p^{\ell}} \right\rfloor = \left( n_{1} + n_{2}p + \cdots + n_{d}p^{d-1} \right) + \left( n_{2} + n_{3}p + \cdots n_{d} p^{d-2} \right) + \cdots + n_{d}.$$

Combining like terms gives $$\nu_{p}\left(n!\right) = n_{1} + n_{2}\left(1 + p\right) + \cdots + n_{d}\left(1 + \cdots + p^{d – 1}\right)$$ $$= \frac{1}{p – 1} \sum_{1\leq i\leq d} n_{i}\left(p^{i} – 1\right) = \frac{1}{p – 1} \left( \sum_{1\leq i\leq d} n_{i}p^{i} -\sum_{1\leq i\leq d} n_{i} \right) = \frac{n – s_{p}\left(n\right)}{p-1}.$$

If $n = p,$ then the base-$p$ representation is just $1,$ so $$\nu_{p}\left(p!\right) = 1 = \frac{p – 1}{p – 1} = \frac{p – s_{p}\left(p\right)}{p – 1}.$$

If $p \gt n,$ then $$\nu_{p}\left(n\right) = 0 = \frac{n – n}{p – 1} = \frac{n – s_{p}\left(n\right)}{p – 1}.$$
QED

Let’s implement this in C++.

// Legendre formula
auto legendre( NTL::ZZ const &n, NTL::ZZ const &p ) -> NTL::ZZ
{   // Returns the p-adic valuation of n! using Legendre's formula
    // v(p; n!) = (n - s(p; n))/(p-1)
    return ( n - digit_sum( n, p ) ) / ( p - 1 );
}

auto digit_sum( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ
{   // Returns the sum of the digits of n in base-p. This is s(p; n).
    auto sum{ NTL::ZZ{ 0 } };
    while( n != 0 )
    {
        sum += n % p;
        n /= p;
    }
    return sum;
}

Now we can use legendre( n, p ) instead of p_adic_valuation( factorial( n ), p ) to get the same result!

But like before, let’s think about the special case when $p = 2.$ What is $s_{2}\left(n\right)?$ It’s just the sum of the bits of the binary expansion of $n,$ so what?

There’s actually a CPU instruction, apparently believed to have been implemented because the NSA needed it, that we can use here, called popcnt. What the popcnt instruction does is return the number of set bits inside a given register; this is precisely the base-$2$ digit sum!

So, let’s re-implement digit_sum to use std::popcount when $p = 2.$

auto digit_sum( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ
{   // Returns the sum of the digits of n in base-p. This is s(p; n).
    auto sum{ NTL::ZZ{ 0 } };
    if( p != 2 )
    {
        while( n != 0 )
        {
            sum += n % p;
            n /= p;
        }
    }
    else
    {   // We have to get the raw bytes of our integer n.
        auto num_bytes{ NTL::NumBytes( n ) };
        std::vector< std::uint8_t > bytes( num_bytes, 0 );

        // Returns integer in bytes in order from least significant to most significant byte.
        NTL::BytesFromZZ( bytes.data(), n, num_bytes );

        for( auto b : bytes )
        {
            sum += std::popcount( b );
        }
    }

    return sum;
}

And, again, we’re going to look at the generated assembly.

; for( auto b : bytes )
mov         rbp,rbx                  ; Move address of first element into rbp (this is our iterator)
cmp         rbx,r15                  ; Check if our iterator is already at the end
je          digit_sum+1BFh           ; If so, skip loop.

; sum += std::popcount( b );
movzx       eax,byte ptr [rbp]       ; Dereference iterator and put value in eax
popcnt      edx,eax                  ; Get the population count and put into edx
mov         r8,r14                   ; Set up for NTL's add function
mov         rcx,qword ptr [r14]      ; 
call        _ntl_gsadd               ; NTL's add function for NTL::ZZ
inc         rbp                      ; Increment iterator
cmp         rbp,r15                  ; Check if our iterator is already at the end
jne         digit_sum+1A0h           ; Jump back to beginning of loop if not done.
; Done with the loop!
;

Since Visual C++ seems to be emitting more code than necessary, I’ve modified it slightly to what you see above. The comments should be explanatory enough.

As for the extra code that Visual C++ is emitting:
Using std::popcount gives
popcnt cx, ax
movzx edx, cx

while using the intrinsic function __popcnt directly gives (roughly, up to some register shuffling) popcnt edx, eax. Testing with g++ shows that only the singular popcnt edx, eax is necessary, so I guess Visual C++ might not like that I’m calling std::popcount on std::uint8_t bytes.

Conclusion

We’ve just introduced the tzcnt and popcnt instructions through covering a fun number-theoretic fact about the prime factorization of $n!$ via Legendre’s formula and implementing it all in C++. I’m happy with how this blog post turned out, so I’m ending it here. As always, full source code can be found on GitHub.

This entry was posted in Number Theory, Programming. Bookmark the permalink.