david wong

Hey! I'm David, the author of the Real-World Cryptography book. I'm a crypto engineer at O(1) Labs on the Mina cryptocurrency, previously I was the security lead for Diem (formerly Libra) at Novi (Facebook), and a security consultant for the Cryptography Services of NCC Group. This is my blog about cryptography and security and other related topics that I find interesting.

Defcon: SHA-3 vs the world posted July 2017

I'll be speaking at the Defcon Crypto village again this year (my talk of last year is here).

It will be about recent hash functions, it will focus a lot on SHA-3 and it will try to avoid any of the recent controversy on which hash function is better (it will be hard but I will try to be neutral and fair).

It'll be recorded if you can't make it. If you can make it, head to the crypto village at 11am on Friday. See the Defcon Crypto Village schedule here. And here is the abstract:

Since Keccak has been selected as the winner of the SHA-3 competition in 2012, a myriad of different hash functions have been trending. From BLAKE2 to KangarooTwelve we'll cover what hash functions are out there, what is being used, and what you should use. Extending hash functions, we’ll also discover STROBE, a symmetric protocol framework derived from SHA-3.

sponge

2 comments

How big are TLS records during a handshake? posted June 2017

I've asked some TLS size questions on Twitter and got some nice results :]

People mostly got it right for the Client Hello. But it wasn't as easy for the Server Hello.

Client Hello → 212 bytes

Server Hello → 66 bytes

These are just numbers I got from a TLS 1.2 handshake with a random website. These numbers are influenced by the browser I use and the configuration of the server. But they should be close to that range anyway as the structure of a Client Hello or a Server Hello are quite simple.

A better question would be: what is the bigger message? and the Client Hello would always win. This is because the Server Hello only replies with one choice from the list of choices the client proposed. For example the server will choose only one ciphersuite from the 13 suites the client proposed. The server will choose one curve from the 3 different curves proposed by the client. The server will choose a single signature algorithm from the client's 10 propositions. And on and on...

Everyone (mostly) got this one!

Certificate → 2540 bytes

Obviously, this is the biggest message of the handshake by far. The number I got is from receiving two different certificates where each certificate is about a thousand bytes. This is because servers tend to send the full chain of certificates to the client, longer chains will increase the size of this message. Probably why there are propositions for a certification compression extension.

ServerKeyExchange → 338 bytes

ClientKeyExchange → 75 bytes

Both of these messages include the peer's public key during ephemeral key exchanges. But the ServerKeyExchange additionally contains the parameters of the key exchange algorithm and a signature of the server's public key. In my case, the signature was done with RSA-2048 and of size 256 bytes, while the NIST p256 public keys were of size 65 bytes.

Using ECDSA for signing, signatures could have been smaller. Using FFDH for the key agreement, public keys could have been bigger.
Tim Dierks also mentioned that using RSA-10000 would have drastically increased the size of the ServerKeyExchange.
Maybe a better question, again, would be which one is the bigger message.

People mostly got it right here!

The rest of the handshake is negligible:

ChangeCipherSpec is just 6 bytes indicating a switch to encryption, it will always be the same size no matter what kind of handshake you went through, most of its length comes from the record's header.

Finished is 45 bytes. Its content is a MAC of the handshake transcript, but an additional MAC is added to protect the integrity of the ciphertext (ciphertext expansion). Remember, Finished is the first (and only) encrypted message in a handshake.

comment on this story

Crypto training at Black Hat USA posted June 2017

I'll be back in Vegas this year to give the crypto training of Black Hat. The class is not full yet so hurry up if that is something that interests you.

It will be a blend of culture, exercises and technical dives. For 2 days, students get to learn all the cool crypto attacks, get to dive into some of them deeply, and get to interact via numerous exercises.

comment on this story

Noise+Strobe=Disco posted June 2017

disco

Noise is a protocol framework allowing you to build different lightweight TLS-like handshakes depending on your use case. Benefits are a short code size, very few dependencies, simplicity of the security guarantees and analysis. It focuses primarily on the initial asymmetric phase of the setup of a secure channel, but does leave you with two ciphers that you can use to read and write on both sides of the connection. If you want to know more, I wrote a readable implementation, and have a tutorial video.

Strobe is a protocol framework as well, focusing on the symmetric part of the protocol. Its simplicity boils down to only using one cryptographic primitive: the duplex construction. Which allows developers to benefit from an ultra short cryptographic code base supporting their custom-made symmetric protocols as well as their different needs of cryptographic functions. Indeed, Strobe can be used as well to instantiate a hash function, a key derivation function, a pseudo-random number generator, a message authentication code, an authenticated encryption with associated data cipher, etc... If you want to know more, I wrote a readable implementation and Mike Hamburg gave a talk at RWC.

Noise+Strobe=Disco. One of Noise's major character is that it keeps a running hash, digesting every message and allowing every new handshake message to mix the transcript in its encryption while authenticating previous messages received and sent. Strobe works like that naturally. Its duplex function absorbs every calls being made to the underlying primitive (the Keccak permutation), to the extent that every new operation is influenced by any operation that happened previously. These two common traits in Strobe and Noise led me to pursue a merge between the two: what if that running hash and symmetric state in Noise was simply Strobe's primitive? And what if at the end of a handshake Noise would just spew out two Strobe's objects also depending on the handshake transcript? I talked to Trevor Perrin about it and his elegant suggestion for a name (Disco) and my curiosity led to an implementation of what it would look like.

This is of course highly experimental. I modified the Noise's specification to see how much I could remove/simplify from it and the result is already enjoyable.

I've discussed the changes on the mailing list. But simply put: the CipherState has been removed, the SymmetricState has been replaced by calls to Strobe. This leaves us only with one object: the HandshakeState. Every symmetric algorithm has been removed (HDKF, HMAC, HASH, AEAD). The specification looks way shorter, while the Disco implementation is more than half the size of the Noise implementation.

The Strobe's calls naturally absorbs every operation, and can encrypt/decrypt the handshake messages even if no shared secret has been negotiated (with a non-keyed duplex construction), which simplifies corner cases where you would have to test if you have already negotiated a shared secret or not.

comment on this story

Readable implementation of the Noise protocol framework posted June 2017

I wrote an implementation of the Noise Protocol Framework. If you don't know what that is, it is a framework to create lightweight TLS-like protocols. If you do not want to use TLS because it is unnecessarily complicated, and you know what you're doing, Noise is the solution. You have different patterns for different usecase and everything is well explained for you to implement it smoothly.

To learn more about Noise you can also check this screencast I shot last year:

My current research includes merging this framework with the Strobe protocol framework I've talked about previously.

This led me to first implement a readable and understandable version of Noise here.

Note that this is highly experimental and it has not been thoroughly tested.

I also had to deviate from the specification when naming things because Golang:

  • doesn't use snake_case, but Noise does.
  • capitalizes function names to make them public, Noise does it for different reasons.
comment on this story

SIMD instructions in Go posted June 2017

One awesome feature of Go is cross-compilation. One limitation is that we can only choose to build for some pre-defined architectures and OS, but we can't build per CPU-model. In the previous post I was talking about C programs, where the user actually chooses the CPU model when calling the Make. Go could probably have something like that but it wouldn't be gooy. One solution is to build for every CPU models anyway, and decide later what is good to be used. So one assembly code for SSE2, one code for AVX, one code for AVX-512.

Note that we do not need to use SSE3/SSE4 (or AVX2) as the interesting functions are contained in SSE2 (respectively AVX) which will have more support and be contained in greater versions of SSE (respectively AVX) anyway.

The official Blake2 implementation in Go actually uses SIMD instructions. Looking at it is a good way to see how SIMD coding works in Go.

In _amd64.go, they use the builtin init() function to figure out at runtime what is supported by the host architecture:

func init() {
    useAVX2 = supportsAVX2()
    useAVX = supportsAVX()
    useSSE4 = supportsSSE4()
}

Which are calls to assembly functions detecting what is supported either via:

  1. a CPUID call directly for SSE4.
  2. calls to Golang's runtime library for AVX and AVX2.

In the second solution, the runtime variables seems to be undocumented and only available since go1.7, they are probably filled via cpuid calls as well. Surprisingly, the internal/cpu package already has all the necessary functions to detect flavors of SIMD. See an example of use in the bytes package.

And that's it! Blake2's hashBlocks() function then dynamically decides which function to use at runtime:

func hashBlocks(h *[8]uint64, c *[2]uint64, flag uint64, blocks []byte) {
    if useAVX2 {
        hashBlocksAVX2(h, c, flag, blocks)
    } else if useAVX {
        hashBlocksAVX(h, c, flag, blocks)
    } else if useSSE4 {
        hashBlocksSSE4(h, c, flag, blocks)
    } else {
        hashBlocksGeneric(h, c, flag, blocks)
    }
}

Because Go does not have intrisic functions for SIMD, these are implemented directly in assembly. You can look at the code in the relevant _amd64.s file. Now it's kind of tricky because Go has invented its own assembly language (based on Plan9) and you have to find out things the hard way. Instructions like VINSERTI128 and VPSHUFD are the SIMD instructions. MMX registers are M0...M7, SSE registers are X0...X15, AVX registers are Y0, ..., Y15. MOVDQA is called MOVO (or MOVOA) and MOVDQU is called MOVOU. Things like that.

As for AVX-512, Go probably still doesn't have instructions for that. So you'll need to write the raw opcodes yourself using BYTE (like here) and as explained here.

3 comments

SIMD instructions in crypto posted June 2017

The Keccak Code Package repository contains all of the Keccak team's constructions, including for example SHA-3, SHAKE, cSHAKE, ParallelHash, TupleHash, KMAC, Keyak, Ketje and KangarooTwelve. ParallelHash and KangarooTwelve are two hash functions based on the same basis of SHA-3, but that can be sped up with parallelization. This makes these two hash functions really interesting, especially when hashing big files.

MMX, SSE, SSE2, AVX, AVX2, AVX-512

To support parallelization, a common way is to use SIMD instructions, a set of instructions generally available on any modern 64-bit architecture that allows computation on large blocks of data (64, 128, 256 or 512 bits). Using them to operate in blocks of data is what we often call vector/array programming, the compiler will sometimes optimize your code by automatically using these large SIMD registers.

SIMD instructions have been here since the 70s, and have become really common. This is one of the reason why image, sound, video and games all work so well nowadays. Generally, if you're on a 64-bit architecture your CPU will support SIMD instructions.

There are several versions of these instructions. On Intel's side these are called MMX, SSE and AVX instructions. AMD has SSE and AVX instructions as well. On ARM these are called NEON instructions.

MMX allows you to operate on 64-bit registers at once (called MM registers). SSE, SSE2, SSE3 and SSE4 all allow you to use 128-bit registers (XMM registers). AVX and AVX2 introduced 256-bit registers (YMM registers) and the more recent AVX-512 supports 512-bit registers (ZMM registers).

How To Compile?

OK, looking back at the Keccak Code Package, I need to choose what architecture to compile my Keccak code with to take advantage of the parallelization. I have a macbook pro, but have no idea what kind version of SSE or AVX my CPU model supports. One way to find out is to use www.everymac.com → I have an Intel CPU Broadwell which seems to support AVX2!

Looking at the list of architectures supported by the Keccak Code Package I see Haswell, which is of the same family and supports AVX2 as well. Compiling with it, I can run my KangarooTwelve code with AVX2 support, which parallelizes four runs of the Keccak permutation at the same time using these 256-bit registers!

In more details, the Keccak permutation goes through several rounds (12 for KangarooTwelve, 24 for ParallelHash) that need to serially operate on a succession of 64-bit lanes. AVX (no need for AVX2) 256-bit's registers allow four 64-bit lanes to be operated on at the same time. That's effectively four Keccak permutations running in parallel.

Intrisic Instructions

Intrisic functions are functions you can use directly in code, and that are later recognized and handled by the compiler.

Intel has an awesome guide on these here. You just need to find out which function to use, which is pretty straight forward looking at the documentation.

doc

In C, if you're compiling with GCC on an Intel/AMD architecture you can start using intrisic functions for SIMD by including x86intrin.h. Or you can use this script to include the correct file for different combination of compilers and architectures:

#if defined(_MSC_VER)
     /* Microsoft C/C++-compatible compiler */
     #include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
     /* GCC-compatible compiler, targeting x86/x86-64 */
     #include <x86intrin.h>
#elif defined(__GNUC__) && defined(__ARM_NEON__)
     /* GCC-compatible compiler, targeting ARM with NEON */
     #include <arm_neon.h>
#elif defined(__GNUC__) && defined(__IWMMXT__)
     /* GCC-compatible compiler, targeting ARM with WMMX */
     #include <mmintrin.h>
#elif (defined(__GNUC__) || defined(__xlC__)) && (defined(__VEC__) || defined(__ALTIVEC__))
     /* XLC or GCC-compatible compiler, targeting PowerPC with VMX/VSX */
     #include <altivec.h>
#elif defined(__GNUC__) && defined(__SPE__)
     /* GCC-compatible compiler, targeting PowerPC with SPE */
     #include <spe.h>
#endif

If we look at the reference implementation of KangarooTwelve in C we can see how they decided to use the AVX2 instructions. They first define a __m256i variable which will hold 4 lanes at the same time.

typedef __m256i V256;

They then declare a bunch of them. Some of them will be used as temporary registers.

They then use unrolling to write the 12 rounds of Keccak. Which are defined via relevant AVX2 instructions:

    #define ANDnu256(a, b)          _mm256_andnot_si256(a, b)
    #define CONST256(a)             _mm256_load_si256((const V256 *)&(a))
    #define CONST256_64(a)          (V256)_mm256_broadcast_sd((const double*)(&a))
    #define LOAD256(a)              _mm256_load_si256((const V256 *)&(a))
    #define LOAD256u(a)             _mm256_loadu_si256((const V256 *)&(a))
    #define LOAD4_64(a, b, c, d)    _mm256_set_epi64x((UINT64)(a), (UINT64)(b), (UINT64)(c), (UINT64)(d))
    #define ROL64in256(d, a, o)     d = _mm256_or_si256(_mm256_slli_epi64(a, o), _mm256_srli_epi64(a, 64-(o)))
    #define ROL64in256_8(d, a)      d = _mm256_shuffle_epi8(a, CONST256(rho8))
    #define ROL64in256_56(d, a)     d = _mm256_shuffle_epi8(a, CONST256(rho56))
    #define STORE256(a, b)          _mm256_store_si256((V256 *)&(a), b)
    #define STORE256u(a, b)         _mm256_storeu_si256((V256 *)&(a), b)
    #define STORE2_128(ah, al, v)   _mm256_storeu2_m128d((V128*)&(ah), (V128*)&(al), v)
    #define XOR256(a, b)            _mm256_xor_si256(a, b)
    #define XOReq256(a, b)          a = _mm256_xor_si256(a, b)
    #define UNPACKL( a, b )         _mm256_unpacklo_epi64((a), (b))
    #define UNPACKH( a, b )         _mm256_unpackhi_epi64((a), (b))
    #define PERM128( a, b, c )      (V256)_mm256_permute2f128_ps((__m256)(a), (__m256)(b), c)
    #define SHUFFLE64( a, b, c )    (V256)_mm256_shuffle_pd((__m256d)(a), (__m256d)(b), c)

And if you're wondering how each of these _mm256 function is used, you can check the same Intel documentation

avx shuffle

Voila!

comment on this story