Hey! I'm David, cofounder of zkSecurity and the author of the Real-World Cryptography book. I was previously a crypto architect at O(1) Labs (working on the Mina cryptocurrency), before that I was the security lead for Diem (formerly Libra) at Novi (Facebook), and a security consultant for the Cryptography Services of NCC Group. This is my blog about cryptography and security and other related topics that I find interesting.
One awesome feature of Go is cross-compilation. One limitation is that we can only choose to build for some pre-defined architectures and OS, but we can't build per CPU-model. In the previous post I was talking about C programs, where the user actually chooses the CPU model when calling the Make. Go could probably have something like that but it wouldn't be gooy. One solution is to build for every CPU models anyway, and decide later what is good to be used. So one assembly code for SSE2, one code for AVX, one code for AVX-512.
Note that we do not need to use SSE3/SSE4 (or AVX2) as the interesting functions are contained in SSE2 (respectively AVX) which will have more support and be contained in greater versions of SSE (respectively AVX) anyway.
The official Blake2 implementation in Go actually uses SIMD instructions. Looking at it is a good way to see how SIMD coding works in Go.
In _amd64.go, they use the builtin init() function to figure out at runtime what is supported by the host architecture:
In the second solution, the runtime variables seems to be undocumented and only available since go1.7, they are probably filled via cpuid calls as well. Surprisingly, the internal/cpu package already has all the necessary functions to detect flavors of SIMD. See an example of use in the bytes package.
And that's it! Blake2's hashBlocks() function then dynamically decides which function to use at runtime:
func hashBlocks(h *[8]uint64, c *[2]uint64, flag uint64, blocks []byte) {
if useAVX2 {
hashBlocksAVX2(h, c, flag, blocks)
} else if useAVX {
hashBlocksAVX(h, c, flag, blocks)
} else if useSSE4 {
hashBlocksSSE4(h, c, flag, blocks)
} else {
hashBlocksGeneric(h, c, flag, blocks)
}
}
Because Go does not have intrisic functions for SIMD, these are implemented directly in assembly. You can look at the code in the relevant _amd64.s file. Now it's kind of tricky because Go has invented its own assembly language (based on Plan9) and you have to find out things the hard way. Instructions like VINSERTI128 and VPSHUFD are the SIMD instructions. MMX registers are M0...M7, SSE registers are X0...X15, AVX registers are Y0, ..., Y15. MOVDQA is called MOVO (or MOVOA) and MOVDQU is called MOVOU. Things like that.
As for AVX-512, Go probably still doesn't have instructions for that. So you'll need to write the raw opcodes yourself using BYTE (like here) and as explained here.
The Keccak Code Package repository contains all of the Keccak team's constructions, including for example SHA-3, SHAKE, cSHAKE, ParallelHash, TupleHash, KMAC, Keyak, Ketje and KangarooTwelve. ParallelHash and KangarooTwelve are two hash functions based on the same basis of SHA-3, but that can be sped up with parallelization. This makes these two hash functions really interesting, especially when hashing big files.
MMX, SSE, SSE2, AVX, AVX2, AVX-512
To support parallelization, a common way is to use SIMD instructions, a set of instructions generally available on any modern 64-bit architecture that allows computation on large blocks of data (64, 128, 256 or 512 bits). Using them to operate in blocks of data is what we often call vector/array programming, the compiler will sometimes optimize your code by automatically using these large SIMD registers.
SIMD instructions have been here since the 70s, and have become really common. This is one of the reason why image, sound, video and games all work so well nowadays. Generally, if you're on a 64-bit architecture your CPU will support SIMD instructions.
There are several versions of these instructions. On Intel's side these are called MMX, SSE and AVX instructions. AMD has SSE and AVX instructions as well. On ARM these are called NEON instructions.
MMX allows you to operate on 64-bit registers at once (called MM registers). SSE, SSE2, SSE3 and SSE4 all allow you to use 128-bit registers (XMM registers). AVX and AVX2 introduced 256-bit registers (YMM registers) and the more recent AVX-512 supports 512-bit registers (ZMM registers).
Looking at the list of architectures supported by the Keccak Code Package I see Haswell, which is of the same family and supports AVX2 as well. Compiling with it, I can run my KangarooTwelve code with AVX2 support, which parallelizes four runs of the Keccak permutation at the same time using these 256-bit registers!
In more details, the Keccak permutation goes through several rounds (12 for KangarooTwelve, 24 for ParallelHash) that need to serially operate on a succession of 64-bit lanes. AVX (no need for AVX2) 256-bit's registers allow four 64-bit lanes to be operated on at the same time. That's effectively four Keccak permutations running in parallel.
Intrisic Instructions
Intrisic functions are functions you can use directly in code, and that are later recognized and handled by the compiler.
In C, if you're compiling with GCC on an Intel/AMD architecture you can start using intrisic functions for SIMD by including x86intrin.h. Or you can use this script to include the correct file for different combination of compilers and architectures:
If we look at the reference implementation of KangarooTwelve in C we can see how they decided to use the AVX2 instructions. They first define a __m256i variable which will hold 4 lanes at the same time.
#define ANDnu256(a, b) _mm256_andnot_si256(a, b)
#define CONST256(a) _mm256_load_si256((const V256 *)&(a))
#define CONST256_64(a) (V256)_mm256_broadcast_sd((const double*)(&a))
#define LOAD256(a) _mm256_load_si256((const V256 *)&(a))
#define LOAD256u(a) _mm256_loadu_si256((const V256 *)&(a))
#define LOAD4_64(a, b, c, d) _mm256_set_epi64x((UINT64)(a), (UINT64)(b), (UINT64)(c), (UINT64)(d))
#define ROL64in256(d, a, o) d = _mm256_or_si256(_mm256_slli_epi64(a, o), _mm256_srli_epi64(a, 64-(o)))
#define ROL64in256_8(d, a) d = _mm256_shuffle_epi8(a, CONST256(rho8))
#define ROL64in256_56(d, a) d = _mm256_shuffle_epi8(a, CONST256(rho56))
#define STORE256(a, b) _mm256_store_si256((V256 *)&(a), b)
#define STORE256u(a, b) _mm256_storeu_si256((V256 *)&(a), b)
#define STORE2_128(ah, al, v) _mm256_storeu2_m128d((V128*)&(ah), (V128*)&(al), v)
#define XOR256(a, b) _mm256_xor_si256(a, b)
#define XOReq256(a, b) a = _mm256_xor_si256(a, b)
#define UNPACKL( a, b ) _mm256_unpacklo_epi64((a), (b))
#define UNPACKH( a, b ) _mm256_unpackhi_epi64((a), (b))
#define PERM128( a, b, c ) (V256)_mm256_permute2f128_ps((__m256)(a), (__m256)(b), c)
#define SHUFFLE64( a, b, c ) (V256)_mm256_shuffle_pd((__m256d)(a), (__m256d)(b), c)
And if you're wondering how each of these _mm256 function is used, you can check the same Intel documentation
a lot of keywords here are really interesting. But first, what is a Mersenne prime?
A mersenne prime is simply a prime \(p\) such that \(p=2^n - 1\). The nice thing about that, is that the programming way of writing such a number is
(1 << n) - 1
which is a long series of 1s.
A number modulo this prime can be any bitstring of the mersenne prime's length.
OK we know what's a Mersenne prime. How do we build our new public key cryptosystem now?
Let's start with a private key: (secret, privkey), two bitstrings of low hamming weight. Meaning that they do not have a lot of bits set to 1.
Now something very intuitive happens: the inverse of such a bitstring will probably have a high hamming weight, which let us believe that \(secret \cdot privkey^{-1} \pmod{p}\) looks random. This will be our public key.
Now that we have a private key and a public key. How do we encrypt ?
The paper starts with a very simple scheme on how to encrypt a bit \(b\).
\[ciphertext = (-1)^b \cdot ( A \cdot pubkey + B ) \pmod{p} \]
with \(A\) and \(B\) two public numbers that have low hamming weights as well.
We can see intuitively that the ciphertext will have a high hamming weight (and thus might look random).
If you are not convinced, all of this is based on actual proofs that such operations between low and high hamming weight bitstrings will yield other low or high hamming weight bitstrings. All of this really work because we are modulo a \(1111\cdots\) kind of number. The following lemmas taken from the paper are proven in section 2.1.
How do you decrypt such an encrypted bit?
This is how:
\[ciphertext \cdot privkey \pmod{p}\]
This will yield either a low hamming weight number → the original bit \(b\) was a \(0\),
or a high hamming weight number → the original bit \(b\) was a \(1\).
You can convince yourself by following the equation:
And again, intuitively you can see that everything is low hamming weight except for the value of \((-1)^b\).
This scheme doesn't look CCA secure nor practical. The paper goes on with an explanation of a more involved cryptosystem in section 6.
EDIT: there is already a reduction of the security estimates published on eprint.
Recently T. Duong, D. Bleichenbacher, Q. Nguyen and B. Przydatek released a crypto library intitled Tink. At its center lied an implementation of AES-GCM somehow different from the rest: it did not take a nonce as one of its input.
A few days ago, at the Crypto SummerSchool in Croatia, Nik Kinkel told me that he would generally recommend against letting developers tweak the nonce value, based on how AES-GCM tended to be heavily mis-used in the wild. For a recap, if a nonce is used twice to encrypt two different messages AES-GCM will leak the authentication key.
I think it's a fair improvement of AES-GCM to remove the nonce argument. By doing so, nonces have to be randomly generated. Now the new danger is that the same nonce is randomly generated twice for the same key. The birthday bound tells us that after \(2^{n/2}\) messages, \(n\) being the bit-size of a nonce, you have great odds of generating a previous nonce.
The optimal rekey point has been studied by Abdalla and Bellare and can computed with the cubic root of the nonce space. If more nonces are generated after that, chances of a nonce collision are too high. For AES-GCM this means that after \(2^{92/3} = 1704458900\) different messages, the key should be rotated.
This is of course assuming that you use 92-bit nonces with 32-bit counters. Some protocol and implementations will actually fix the first 32 bits of these 92-bit nonces reducing the birthday bound even further.
Isn't that a bit low?
Yes it kinda is. An interesting construction by Dan J. Bernstein called XSalsa20 (and that can be extended to XChacha20) allow us to use nonces of 192 bits. This would mean that you should be able to use the same key for up to \(2^{192/3} = 18446744073709551616\) messages. Which is already twice more that what a BIG INT can store in a database
It seems like Sponge-based AEADs should benefit from large nonces as well since their rate can store even more bits. This might be a turning point for these constructions in the last round of the CAESAR competition. There are currently 4 of these: Ascon, Ketje, Keyak and NORX.
With that in mind, is nonce mis-use resistance now fixed?
EDIT: Here is a list of recent papers on the subject:
It is heavily based on the official Go's x/crypto/sha3 library. But because of minor implementation details the relevant files have been copied and modified there so you do not need Go's SHA-3 implementation to run this package. Hopefully one day Go's SHA-3 library will be more flexible to allow other keccak construction to rely on it.
I have tested this implementation with different test vectors and it works fine. Note that it has not received proper peer review. If you look at the code and find issues (or not) please let me know!
This implementation does not yet make use of SIMD to parallelize the implementation. But we can already see improvements due to the smaller number of rounds:
100 bytes
1000 bytes
10,000 bytes
K12
761 ns/op
1875 ns/op
15399 ns/op
SHA3
854 ns/op
3962 ns/op
34293 ns/op
SHAKE128
668 ns/op
2853 ns/op
29661 ns/op
This was done with a very simple bench script on my 2 year-old macbook pro.
Adam Langley from Google wrote a blogpost yesterday called Maybe Skip SHA-3. It started some discussions both on HN and on r/crypto
Speed drives adoption, Daniel J. Bernstein (djb) probably understand this more than anyone else. And this is what led Adam Langley to decide to either stay on SHA-2 or move to BLAKE2. Is that a good advice? Should we all follow his steps?
I think it is important to remember that Google, as well as the other big players, have an agenda for speed. I'm not saying that they do not care about security, it would be stupid for me to say that, especially seeing all of the improvements we've had in the field thanks to them in the last few years (Project Zero, Android, Chrome, Wycheproof, Tink, BoringSSL, email encryption, Key Transparency, Certificate Transparency, ...)
What I'm saying is that although they care deeply about security, they are also looking for compromises to gain speed. This can be seen with the push for 0-RTT in TLS, but I believe we're seeing it here as well with a push for either KangarooTwelve (K12) or BLAKE2 instead of SHA-3/SHAKE and BLAKE (which are more secure versions of K12 and BLAKE2).
Adam Langley even went as far as to recommend folks to stay on SHA-2.
But how can we keep advising SHA-2 when we know it can be badly misused? Yes I'm talking about length extension attacks. These attacks that prevent you from using SHA-2(key|data) to create a Message Authentication Code (MAC).
We recently cared so much about misuses of AES-GCM (documented misuses!) that Adam Langley's previous blog post to the SHA-3 one is about AES-GCM-SIV.
We cared so much about simple APIs, that recently Tink simply removed the nonce argument from AES-GCM's API.
If we cared so much about documented misusage of crypto APIs, how can we not care about this undocumented misuse of SHA-2? Intuitively, if SHA-2 was to behave like a random oracle there would be no problem at all with the SHA-2(key|data) construction. Actually, none of the more secure hash functions like Blake and SHA-3 have this problem.
If we cared so much about simple APIs, why would we advise people to "fix" SHA-2 by truncating 256 bits of SHA-512's output (SHA-512/256)?
The reality is that you should use SHA-3. I'm making this as a broad recommendation for people who do not know much about cryptography. You can't go wrong with the NIST's standard.
Now let's see how we can dilute a good advice.
If you care about speed you should use SHAKE or BLAKE.
If you really care about speed you should use KangarooTwelve or BLAKE2.
If you really really care about speed you should use SHA-512/256. (edit: people are pointing to me that Blake2 is also faster than that)
If you really really really care about speed you should use CRC32 (don't, it's a joke).
How big of a security compromise are you willing to make? We know the big players have decided, but have they decided for you?
Is SHA-3 that slow?
Where is a hash function usually used? One major use is in signing messages. You hash the message before signing it because you can't sign something that is too big.
Here is a number taken from djb's benchmarks on how many cycles it takes to sign with ed25519: 187746
Here is a number taken from djb's benchmarks on how many cycles it takes to hash a byte with keccak512: 15
Spoiler Alert: you probably don't care about a few cycles.
Not to say there are no cases where this is relevant, but if you're doing a huge amount of hashing and you're hashing big messages, you should rather switch to a tree-hashing mode. KangarooTwelve, BLAKE2bp and BLAKE2sp all support that.
EDIT: The Keccak team just released a response as well.
Reading cryptographic implementations, you can quickly spot a lot of weirdness and oddities. Unfortunately, this is because very few cryptographic implementations aim for readability (hence the existence of readable implementations). But on the other hand, there are a lot of cool techniques in there which are interesting to learn about.
Bit-slicing is one I've never really talked about, so here is a self-explanatory tweet quoting Thomas Pornin:
But I'm here to talk about loop unrolling. Here's what wikipedia has to say about it:
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff.
Here is an implementation of MD5 in go. You can see that the gen.go file is generating a different md5block.go file which will be the unrolled code. Well yeah, sometimes it's just easier than doing it by hand :]