Today, we’ll be looking at $p$-adic valuation defined on the integers, with a specific look at a special formula for finding the $p$-adic valuation of $n!$ for any given $n.$ We won’t be as concerned about optimizations this time, since we’re looking at something with a known formula.
This will be a very quick post with a bit of mathematics and programming, but this time we’ll be looking at the disassembly of our program!
The $p$-adic Valuation and tzcnt
Definition. Let $p$ be a prime number. The $p$-adic valuation of an integer $n$ is defined by $${\nu_{p}\left(n\right)={\begin{cases}k &{\text{if }}n\neq 0, p^{k}\mid\mid n\\\infty &{\text{if }}n=0\end{cases}}}$$ where $p^{k}\mid\mid n$ means that $k$ is the highest power of $p$ which divides $n.$
How would we implement this in code? The easiest way, and the way we’ll be doing it (for most values of $p$), is to just divide $n$ by $p$ until we hit an integer which is no longer divisible by $p.$
auto p_adic_valuation( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ { // Assuming n != 0 before proceeding; this will be handled in main() before this is called. NTL::ZZ valuation{ 0 }; while( n % p == 0 ) { ++valuation; n /= p; } return valuation; }
Therefore, we get the following for p_adic_valuation( 89756756756756756667, 3 )
(note that I picked $89756756756756756667 = 3^{6}\cdot 123123123123123123$):
The 3-adic valuation of 89756756756667 is 8.
And, indeed, $3^{8}\mid\mid 89756756756756756667$ as $89756756756756756667 = 3^{8} \cdot 13680347013680347$ and $3 \nmid 13680347013680347.$
I said we wouldn’t worry about optimizations, but can we specialize this for $p = 2$ in a way that bypasses division? The following proposition and its corollary should give us an idea.
Proposition. Let $b \gt 1$ and $n \neq 0$ be integers. Then, there exist unique integers $d\geq 0$ and $n_{0},\ldots, n_{d}$ such that $$n = \sum_{0\leq i\leq d} n_{i} b^{i},$$ where $0\leq n_{i} \lt b$ for each $0\leq i\leq d$ and $n_{d}\neq 0.$
This proposition essentially says that, for any $b\gt 1,$ there exists a unique base-$b$ representation of $n,$ as long as we don’t consider extra $0$s appearing on the left. This is obviously true when $n = 0,$ since we can only represent $0$ as $0$ without adding extra $0$s.
Before proving this, there’s a very simple corollary, which I’ll prove first. This will be a hint for how we could do the $p = 2$ case without repeated division.
Corollary. Let $p$ be prime and $n\neq 0$. Then, $\nu_{p}\left(n\right)$ is equal to the number of trailing $0$s in $n$’s base-$p$ representation.
Proof.
Suppose $\nu_{p}\left(n\right) = k.$ Then, $p^{k} \mid\mid n,$ which means that $n_{i} = 0$ for all $i \lt k,$ and, moreover, $n_{k} \neq 0$ since, otherwise, we could divide out another $p.$
QED
In computers, we very often deal with integers expressed in binary, which many modern CPU architectures can manipulate quite freely. We’ll get back to this once we prove the proposition.
Proof of Proposition.
If $n\lt 0,$ we can multiply by $-1$ to get $-n\gt 0,$ and $d$ will be the same as for $-n,$ while the $n_{i}$ are just the negatives of those for $-n.$ Therefore, we need only think of $n\gt 0.$
Using the division algorithm, there exist $q_{0}, r_{0}$ with $0 \leq q_{0} \lt n$ and $0\leq r_{0} \lt b$ such that $n = q_{0}b + r_{0}.$ Let $n_{0} = r_{0},$ and repeat the division algorithm on $q_{0}.$ Since each subsequent quotient is smaller than the previous, and each one is non-negative due to the non-negativity of $b$ and $n,$ this process will eventually stop, with $n_{i} = r_{i}$ for each $i$ and $d$ equal to the index of the first quotient in this process where $q_{d} = 0.$ This proves existence of a representation.
For uniqueness, consider $d’$ and $n_{i}’$ such that $$n = \sum_{0\leq i\leq d’} n_{i}’ b^{i}.$$
Suppose $d\neq d’.$ Without loss in generality, suppose $d\lt d’.$ Then, $$n_{d’}’ b^{d’}\geq b^{d’} \gt b^{d’} – 1 = \sum_{0\leq i\leq d’ – 1} \left(b – 1\right) b^{i} \geq \sum_{0\leq i\leq d} \left(b – 1\right) b^{i} \geq \sum_{0\leq i\leq d} n_{i} b^{i} = n,$$ a contradiction.
This also forces $n_{i} = n_{i}’$ for all $0\leq i\leq d,$ and so uniqueness is proved.
QED
The corollary to this conjecture means that we can use the number of trailing zeroes of the base-$p$ representation of $n$ gives the valuation, which we can do easily for $p = 2$ on a computer (since everything’s already encoded in base-$2$) using std::countr_zero
. Let’s look at the code! Note that we have to do a little extra work, since we’re using the Number Theory Library for the big integer support.
auto p_adic_valuation( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ { // Assuming n != 0 before proceeding; this will be handled in main() before this is called. NTL::ZZ valuation{ 0 }; if( p != 2 ) { while( n % p == 0 ) { ++valuation; n /= p; } } else { // We have to get the raw bytes of our integer n. auto num_bytes{ NTL::NumBytes( n ) }; std::vector< std::uint8_t > bytes(num_bytes, 0); // Returns integer in bytes in order from least significant to most significant byte. NTL::BytesFromZZ( bytes.data(), n, num_bytes ); for( auto b : bytes ) { if( b == 0 ) { // Skip the trailing 0 byte and add 8 to the valuation. valuation += 8; } else { // Non-zero byte means we count the final trailing zeroes and end. valuation += std::countr_zero( b ); break; } } } return valuation; }
There are a few things to explain (which are partially explained by the comments).
First, since we’re using NTL, we have to get the raw bytes for our integer in a vector of bytes. The NTL::BytesFromZZ
function returns the bytes in order from least significant to most significant, which means the bytes are in order from lowest-order byte to highest-order byte. This means we can loop over the bytes, skip over the bytes which are all $0$ (adding $8$ to the valuation, since that’s $8$ trailing $0$s for each $0$ byte), and then call the std::countr_zero
function on the first non-zero byte we see and terminate the loop there.
The following assembly was generated using Visual Studio 2022’s C++ compiler and debugger. I set a break point at the loop, right clicked it, and went to the disassembly. I added comments to explain what’s happening and used a different syntax coloring to distinguish it from all the C++ stuff.
; for( auto b : bytes ) mov rdi,rbx ; Move the address of the first element into rdi (this is our iterator) cmp rbx,r15 ; Check if our iterator is already at the end je p_adic_valuation+1C0h ; If so, skip loop. movzx eax,byte ptr [rdi] ; Move the first byte into eax by dereferencing the iterator ; test if( b == 0 ) mov r8,rsi mov rcx,qword ptr [rsi] test al,al jne p_adic_valuation+1B1h ; Jump to else branch when b != 0 ; if( b == 0 ) branch ; valuation += 8 mov edx,8 ; Setting up valuation += 8 call _ntl_gsadd ; NTL's add function for NTL::ZZ, finishing up valuation += 8 inc rdi ; Increment the iterator forward 1 byte cmp rdi,r15 ; Test if iterator is at the end jne p_adic_valuation+190h ; Jump back to beginning of loop if not done jmp p_adic_valuation+1C0h ; Get out of the loop if finished ; else branch ; valuation += std::countr_zero( b ); tzcnt edx,eax ; std::countr_zero( b ) call _ntl_gsadd ; the "valuation +=" part ; Done with the loop! ;
Notice the tzcnt
instruction in the else branch. This instruction counts the number of trailing zeroes of (in this case) a 32-bit integer. If I was running this on, say, a Sandy Bridge Intel CPU, this would actually be decoded as rep bsf
by the CPU (the rep
part doesn’t do anything), since the Sandy Bridge architecture doesn’t support tzcnt
directly. The opcode for tzcnt
was already in the Sandy Bridge architecture, but was sort-of a placeholder to allow for backwards compatibility when tzcnt
finally came around. Neat!
The difference between tzcnt
and bsf
is a bit subtle.
When bsf
is called on 0, the Zero Flag (ZF) is set to 1, and the destination register is left unchanged (this is how it’s defined for AMD CPUs, and while this is also the typical behavior on Intel CPUs, the upper 32 bits of the full 64-bit destination register are “undefined” according to Intel).
When tzcnt
is called on 0, it sets the Carry Flag (CF); on the other hand, it sets ZF when it’s output is 0 (that is, when the least-significant bit is set).
In most cases, these differences don’t change anything, so they’re pretty interchangeable, except that tzcnt
is typically faster.
For full transparency, I did modify the assembly slightly beyond just adding comments. There was an unnecessary or eax, 0FFFFFF00h
line before tzcnt
. When using the intrinsic function, _tzcnt_u32( b )
, directly, the or
line isn’t there. I tested the std::countr_zero( b )
line with g++ using g++ -S -march=native p_adic_valuation.cpp -lntl -lgmp -lm -std=c++20 -fverbose-asm -O3 -masm=intel
to generate the assembly, and the tzcnt
instruction had no or
before it. I even set /arch:AVX2
in the compiler flags in Visual Studio, which should make std::countr_zero( b )
emit tzcnt
without any extra code, so I’m not sure what caused this other than the fact that I’m calling it on a std::uint8_t
.
The $p$-adic Valuation of $n!$ and popcnt
Now that we’ve implemented the p_adic_valuation
function, let’s look at what happens if we call it on $n!$ for $p = 2, 3, 5.$ I’ve implemented the factorial
function in the obvious way, so I omit that code. The following calls p_adic_valuation( factorial( n ), p )
for $n = 20$ and $p = 2,3,5.$
$ ./p_adic_valuation 20
The 2-adic valuation of 2432902008176640000 is 18.
The 3-adic valuation of 2432902008176640000 is 8.
The 5-adic valuation of 2432902008176640000 is 4.
Can we come up with a formula for the $p$-adic valuation of $n!$ to use instead? Turns out, yes!
Proposition. Let $p$ be prime and $n\geq 1$ an integer. Then, $$\nu_{p}\left(n!\right) = \sum_{\ell\geq 1}\left\lfloor \frac{n}{p^{\ell}} \right\rfloor.$$
While this is being expressed as an infinite sum, this is actually a finite sum, since eventially $p^{\ell} > n,$ and so the floor of their quotient will be $0$ from that point on.
Proof.
Since $n!$ is the product of $1$ through $n,$ we can divide by $p$ for each multiple of $p$ in the product. There are $\left\lfloor \frac{n}{p} \right\rfloor$ of these.
Similarly, we may divide out another $p$ for any factor of $n!$ which is a multiple of $p^{2}.$ There are $\left\lfloor \frac{n}{p^{2}} \right\rfloor$ of these.
And so on.
QED
Corollary (Legendre’s Formula). Let $p$ be a prime number and $n\geq 0$ an integer. Then, $$\nu_p\left(n!\right) = \frac{n – s_{p}\left(n\right)}{p – 1},$$ where $$s_{p}\left(n\right) = \sum_{0\leq i\leq d} n_{i}$$ is the sum of the base-$p$ digits of $n.$
Proof.
First, suppose that $p \lt n.$
If we write $n$ in base-$p,$ note that $n_{0} = 0$ and $$\nu_{p}\left(n!\right) = \sum_{\ell\geq 1}\left\lfloor \frac{n}{p^{\ell}} \right\rfloor = \left( n_{1} + n_{2}p + \cdots + n_{d}p^{d-1} \right) + \left( n_{2} + n_{3}p + \cdots n_{d} p^{d-2} \right) + \cdots + n_{d}.$$
Combining like terms gives $$\nu_{p}\left(n!\right) = n_{1} + n_{2}\left(1 + p\right) + \cdots + n_{d}\left(1 + \cdots + p^{d – 1}\right)$$ $$= \frac{1}{p – 1} \sum_{1\leq i\leq d} n_{i}\left(p^{i} – 1\right) = \frac{1}{p – 1} \left( \sum_{1\leq i\leq d} n_{i}p^{i} -\sum_{1\leq i\leq d} n_{i} \right) = \frac{n – s_{p}\left(n\right)}{p-1}.$$
If $n = p,$ then the base-$p$ representation is just $1,$ so $$\nu_{p}\left(p!\right) = 1 = \frac{p – 1}{p – 1} = \frac{p – s_{p}\left(p\right)}{p – 1}.$$
If $p \gt n,$ then $$\nu_{p}\left(n\right) = 0 = \frac{n – n}{p – 1} = \frac{n – s_{p}\left(n\right)}{p – 1}.$$
QED
Let’s implement this in C++.
// Legendre formula auto legendre( NTL::ZZ const &n, NTL::ZZ const &p ) -> NTL::ZZ { // Returns the p-adic valuation of n! using Legendre's formula // v(p; n!) = (n - s(p; n))/(p-1) return ( n - digit_sum( n, p ) ) / ( p - 1 ); } auto digit_sum( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ { // Returns the sum of the digits of n in base-p. This is s(p; n). auto sum{ NTL::ZZ{ 0 } }; while( n != 0 ) { sum += n % p; n /= p; } return sum; }
Now we can use legendre( n, p )
instead of p_adic_valuation( factorial( n ), p )
to get the same result!
But like before, let’s think about the special case when $p = 2.$ What is $s_{2}\left(n\right)?$ It’s just the sum of the bits of the binary expansion of $n,$ so what?
There’s actually a CPU instruction, apparently believed to have been implemented because the NSA needed it, that we can use here, called popcnt
. What the popcnt
instruction does is return the number of set bits inside a given register; this is precisely the base-$2$ digit sum!
So, let’s re-implement digit_sum
to use std::popcount
when $p = 2.$
auto digit_sum( NTL::ZZ n, NTL::ZZ const &p ) -> NTL::ZZ { // Returns the sum of the digits of n in base-p. This is s(p; n). auto sum{ NTL::ZZ{ 0 } }; if( p != 2 ) { while( n != 0 ) { sum += n % p; n /= p; } } else { // We have to get the raw bytes of our integer n. auto num_bytes{ NTL::NumBytes( n ) }; std::vector< std::uint8_t > bytes( num_bytes, 0 ); // Returns integer in bytes in order from least significant to most significant byte. NTL::BytesFromZZ( bytes.data(), n, num_bytes ); for( auto b : bytes ) { sum += std::popcount( b ); } } return sum; }
And, again, we’re going to look at the generated assembly.
; for( auto b : bytes ) mov rbp,rbx ; Move address of first element into rbp (this is our iterator) cmp rbx,r15 ; Check if our iterator is already at the end je digit_sum+1BFh ; If so, skip loop. ; sum += std::popcount( b ); movzx eax,byte ptr [rbp] ; Dereference iterator and put value in eax popcnt edx,eax ; Get the population count and put into edx mov r8,r14 ; Set up for NTL's add function mov rcx,qword ptr [r14] ; call _ntl_gsadd ; NTL's add function for NTL::ZZ inc rbp ; Increment iterator cmp rbp,r15 ; Check if our iterator is already at the end jne digit_sum+1A0h ; Jump back to beginning of loop if not done. ; Done with the loop! ;
Since Visual C++ seems to be emitting more code than necessary, I’ve modified it slightly to what you see above. The comments should be explanatory enough.
As for the extra code that Visual C++ is emitting:
Using std::popcount
gives
popcnt cx, ax
movzx edx, cx
while using the intrinsic function __popcnt
directly gives (roughly, up to some register shuffling) popcnt edx, eax
. Testing with g++ shows that only the singular popcnt edx, eax
is necessary, so I guess Visual C++ might not like that I’m calling std::popcount
on std::uint8_t
bytes.
Conclusion
We’ve just introduced the tzcnt
and popcnt
instructions through covering a fun number-theoretic fact about the prime factorization of $n!$ via Legendre’s formula and implementing it all in C++. I’m happy with how this blog post turned out, so I’m ending it here. As always, full source code can be found on GitHub.