david wong

Hey! I'm David, a security consultant at Cryptography Services, the crypto team of NCC Group . This is my blog about cryptography and security and other related topics that I find interesting.

Readable implementation of the Noise protocol framework

posted June 2017

I wrote an implementation of the Noise Protocol Framework. If you don't know what that is, it is a framework to create lightweight TLS-like protocols. If you do not want to use TLS because it is unnecessarily complicated, and you know what you're doing, Noise is the solution. You have different patterns for different usecase and everything is well explained for you to implement it smoothly.

To learn more about Noise you can also check this screencast I shot last year:

My current research includes merging this framework with the Strobe protocol framework I've talked about previously.

This led me to first implement a readable and understandable version of Noise here.

Note that this is highly experimental and it has not been thoroughly tested.

I also had to deviate from the specification when naming things because Golang:

  • doesn't use snake_case, but Noise does.
  • capitalizes function names to make them public, Noise does it for different reasons.
comment on this story

SIMD instructions in Go

posted June 2017

One awesome feature of Go is cross-compilation. One limitation is that we can only choose to build for some pre-defined architectures and OS, but we can't build per CPU-model. In the previous post I was talking about C programs, where the user actually chooses the CPU model when calling the Make. Go could probably have something like that but it wouldn't be gooy. One solution is to build for every CPU models anyway, and decide later what is good to be used. So one assembly code for SSE2, one code for AVX, one code for AVX-512.

Note that we do not need to use SSE3/SSE4 (or AVX2) as the interesting functions are contained in SSE2 (respectively AVX) which will have more support and be contained in greater versions of SSE (respectively AVX) anyway.

The official Blake2 implementation in Go actually uses SIMD instructions. Looking at it is a good way to see how SIMD coding works in Go.

In _amd64.go, they use the builtin init() function to figure out at runtime what is supported by the host architecture:

func init() {
    useAVX2 = supportsAVX2()
    useAVX = supportsAVX()
    useSSE4 = supportsSSE4()
}

Which are calls to assembly functions detecting what is supported either via:

  1. a CPUID call directly for SSE4.
  2. calls to Golang's runtime library for AVX and AVX2.

In the second solution, the runtime variables seems to be undocumented and only available since go1.7, they are probably filled via cpuid calls as well. Surprisingly, the internal/cpu package already has all the necessary functions to detect flavors of SIMD. See an example of use in the bytes package.

And that's it! Blake2's hashBlocks() function then dynamically decides which function to use at runtime:

func hashBlocks(h *[8]uint64, c *[2]uint64, flag uint64, blocks []byte) {
    if useAVX2 {
        hashBlocksAVX2(h, c, flag, blocks)
    } else if useAVX {
        hashBlocksAVX(h, c, flag, blocks)
    } else if useSSE4 {
        hashBlocksSSE4(h, c, flag, blocks)
    } else {
        hashBlocksGeneric(h, c, flag, blocks)
    }
}

Because Go does not have intrisic functions for SIMD, these are implemented directly in assembly. You can look at the code in the relevant _amd64.s file. Now it's kind of tricky because Go has invented its own assembly language (based on Plan9) and you have to find out things the hard way. Instructions like VINSERTI128 and VPSHUFD are the SIMD instructions. MMX registers are M0...M7, SSE registers are X0...X15, AVX registers are Y0, ..., Y15. MOVDQA is called MOVO (or MOVOA) and MOVDQU is called MOVOU. Things like that.

As for AVX-512, Go probably still doesn't have instructions for that. So you'll need to write the raw opcodes yourself using BYTE (like here) and as explained here.

1 comment

SIMD instructions in crypto

posted June 2017

The Keccak Code Package repository contains all of the Keccak team's constructions, including for example SHA-3, SHAKE, cSHAKE, ParallelHash, TupleHash, KMAC, Keyak, Ketje and KangarooTwelve. ParallelHash and KangarooTwelve are two hash functions based on the same basis of SHA-3, but that can be sped up with parallelization. This makes these two hash functions really interesting, especially when hashing big files.

MMX, SSE, SSE2, AVX, AVX2, AVX-512

To support parallelization, a common way is to use SIMD instructions, a set of instructions generally available on any modern 64-bit architecture that allows computation on large blocks of data (64, 128, 256 or 512 bits). Using them to operate in blocks of data is what we often call vector/array programming, the compiler will sometimes optimize your code by automatically using these large SIMD registers.

SIMD instructions have been here since the 70s, and have become really common. This is one of the reason why image, sound, video and games all work so well nowadays. Generally, if you're on a 64-bit architecture your CPU will support SIMD instructions.

There are several versions of these instructions. On Intel's side these are called MMX, SSE and AVX instructions. AMD has SSE and AVX instructions as well. On ARM these are called NEON instructions.

MMX allows you to operate on 64-bit registers at once (called MM registers). SSE, SSE2, SSE3 and SSE4 all allow you to use 128-bit registers (XMM registers). AVX and AVX2 introduced 256-bit registers (YMM registers) and the more recent AVX-512 supports 512-bit registers (ZMM registers).

How To Compile?

OK, looking back at the Keccak Code Package, I need to choose what architecture to compile my Keccak code with to take advantage of the parallelization. I have a macbook pro, but have no idea what kind version of SSE or AVX my CPU model supports. One way to find out is to use www.everymac.com → I have an Intel CPU Broadwell which seems to support AVX2!

Looking at the list of architectures supported by the Keccak Code Package I see Haswell, which is of the same family and supports AVX2 as well. Compiling with it, I can run my KangarooTwelve code with AVX2 support, which parallelizes four runs of the Keccak permutation at the same time using these 256-bit registers!

In more details, the Keccak permutation goes through several rounds (12 for KangarooTwelve, 24 for ParallelHash) that need to serially operate on a succession of 64-bit lanes. AVX (no need for AVX2) 256-bit's registers allow four 64-bit lanes to be operated on at the same time. That's effectively four Keccak permutations running in parallel.

Intrisic Instructions

Intrisic functions are functions you can use directly in code, and that are later recognized and handled by the compiler.

Intel has an awesome guide on these here. You just need to find out which function to use, which is pretty straight forward looking at the documentation.

doc

In C, if you're compiling with GCC on an Intel/AMD architecture you can start using intrisic functions for SIMD by including x86intrin.h. Or you can use this script to include the correct file for different combination of compilers and architectures:

#if defined(_MSC_VER)
     /* Microsoft C/C++-compatible compiler */
     #include <intrin.h>
#elif defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
     /* GCC-compatible compiler, targeting x86/x86-64 */
     #include <x86intrin.h>
#elif defined(__GNUC__) && defined(__ARM_NEON__)
     /* GCC-compatible compiler, targeting ARM with NEON */
     #include <arm_neon.h>
#elif defined(__GNUC__) && defined(__IWMMXT__)
     /* GCC-compatible compiler, targeting ARM with WMMX */
     #include <mmintrin.h>
#elif (defined(__GNUC__) || defined(__xlC__)) && (defined(__VEC__) || defined(__ALTIVEC__))
     /* XLC or GCC-compatible compiler, targeting PowerPC with VMX/VSX */
     #include <altivec.h>
#elif defined(__GNUC__) && defined(__SPE__)
     /* GCC-compatible compiler, targeting PowerPC with SPE */
     #include <spe.h>
#endif

If we look at the reference implementation of KangarooTwelve in C we can see how they decided to use the AVX2 instructions. They first define a __m256i variable which will hold 4 lanes at the same time.

typedef __m256i V256;

They then declare a bunch of them. Some of them will be used as temporary registers.

They then use unrolling to write the 12 rounds of Keccak. Which are defined via relevant AVX2 instructions:

    #define ANDnu256(a, b)          _mm256_andnot_si256(a, b)
    #define CONST256(a)             _mm256_load_si256((const V256 *)&(a))
    #define CONST256_64(a)          (V256)_mm256_broadcast_sd((const double*)(&a))
    #define LOAD256(a)              _mm256_load_si256((const V256 *)&(a))
    #define LOAD256u(a)             _mm256_loadu_si256((const V256 *)&(a))
    #define LOAD4_64(a, b, c, d)    _mm256_set_epi64x((UINT64)(a), (UINT64)(b), (UINT64)(c), (UINT64)(d))
    #define ROL64in256(d, a, o)     d = _mm256_or_si256(_mm256_slli_epi64(a, o), _mm256_srli_epi64(a, 64-(o)))
    #define ROL64in256_8(d, a)      d = _mm256_shuffle_epi8(a, CONST256(rho8))
    #define ROL64in256_56(d, a)     d = _mm256_shuffle_epi8(a, CONST256(rho56))
    #define STORE256(a, b)          _mm256_store_si256((V256 *)&(a), b)
    #define STORE256u(a, b)         _mm256_storeu_si256((V256 *)&(a), b)
    #define STORE2_128(ah, al, v)   _mm256_storeu2_m128d((V128*)&(ah), (V128*)&(al), v)
    #define XOR256(a, b)            _mm256_xor_si256(a, b)
    #define XOReq256(a, b)          a = _mm256_xor_si256(a, b)
    #define UNPACKL( a, b )         _mm256_unpacklo_epi64((a), (b))
    #define UNPACKH( a, b )         _mm256_unpackhi_epi64((a), (b))
    #define PERM128( a, b, c )      (V256)_mm256_permute2f128_ps((__m256)(a), (__m256)(b), c)
    #define SHUFFLE64( a, b, c )    (V256)_mm256_shuffle_pd((__m256d)(a), (__m256d)(b), c)

And if you're wondering how each of these _mm256 function is used, you can check the same Intel documentation

avx shuffle

Voila!

comment on this story

A New Public-Key Cryptosystem via Mersenne Numbers

posted June 2017

A new paper made its way to eprint the other day.

paper

a lot of keywords here are really interesting. But first, what is a Mersenne prime?

A mersenne prime is simply a prime \(p\) such that \(p=2^n - 1\). The nice thing about that, is that the programming way of writing such a number is

(1 << n) - 1

which is a long series of 1s.

mersenne

A number modulo this prime can be any bitstring of the mersenne prime's length.

OK we know what's a Mersenne prime. How do we build our new public key cryptosystem now?

Let's start with a private key: (secret, privkey), two bitstrings of low hamming weight. Meaning that they do not have a lot of bits set to 1.

privkey

Now something very intuitive happens: the inverse of such a bitstring will probably have a high hamming weight, which let us believe that \(secret \cdot privkey^{-1} \pmod{p}\) looks random. This will be our public key.

Now that we have a private key and a public key. How do we encrypt ?

The paper starts with a very simple scheme on how to encrypt a bit \(b\).

\[ciphertext = (-1)^b \cdot ( A \cdot pubkey + B ) \pmod{p} \]

with \(A\) and \(B\) two public numbers that have low hamming weights as well.

We can see intuitively that the ciphertext will have a high hamming weight (and thus might look random).

If you are not convinced, all of this is based on actual proofs that such operations between low and high hamming weight bitstrings will yield other low or high hamming weight bitstrings. All of this really work because we are modulo a \(1111\cdots\) kind of number. The following lemmas taken from the paper are proven in section 2.1.

lemma

How do you decrypt such an encrypted bit?

This is how:

\[ciphertext \cdot privkey \pmod{p}\]

This will yield either a low hamming weight number → the original bit \(b\) was a \(0\),
or a high hamming weight number → the original bit \(b\) was a \(1\).

You can convince yourself by following the equation:

decryption

And again, intuitively you can see that everything is low hamming weight except for the value of \((-1)^b\).

value

This scheme doesn't look CCA secure nor practical. The paper goes on with an explanation of a more involved cryptosystem in section 6.

EDIT: there is already a reduction of the security estimates published on eprint.

comment on this story

Is Symmetric Security Solved?

posted June 2017

Recently T. Duong, D. Bleichenbacher, Q. Nguyen and B. Przydatek released a crypto library intitled Tink. At its center lied an implementation of AES-GCM somehow different from the rest: it did not take a nonce as one of its input.

A few days ago, at the Crypto SummerSchool in Croatia, Nik Kinkel told me that he would generally recommend against letting developers tweak the nonce value, based on how AES-GCM tended to be heavily mis-used in the wild. For a recap, if a nonce is used twice to encrypt two different messages AES-GCM will leak the authentication key.

I think it's a fair improvement of AES-GCM to remove the nonce argument. By doing so, nonces have to be randomly generated. Now the new danger is that the same nonce is randomly generated twice for the same key. The birthday bound tells us that after \(2^{n/2}\) messages, \(n\) being the bit-size of a nonce, you have great odds of generating a previous nonce.

The optimal rekey point has been studied by Abdalla and Bellare and can computed with the cubic root of the nonce space. If more nonces are generated after that, chances of a nonce collision are too high. For AES-GCM this means that after \(2^{92/3} = 1704458900\) different messages, the key should be rotated.

This is of course assuming that you use 92-bit nonces with 32-bit counters. Some protocol and implementations will actually fix the first 32 bits of these 92-bit nonces reducing the birthday bound even further.

Isn't that a bit low?

Yes it kinda is. An interesting construction by Dan J. Bernstein called XSalsa20 (and that can be extended to XChacha20) allow us to use nonces of 192 bits. This would mean that you should be able to use the same key for up to \(2^{192/3} = 18446744073709551616\) messages. Which is already twice more that what a BIG INT can store in a database

It seems like Sponge-based AEADs should benefit from large nonces as well since their rate can store even more bits. This might be a turning point for these constructions in the last round of the CAESAR competition. There are currently 4 of these: Ascon, Ketje, Keyak and NORX.

With that in mind, is nonce mis-use resistance now fixed?

EDIT: Here is a list of recent papers on the subject:

comment on this story

Implementation of Kangaroo Twelve in Go

posted June 2017

I've released an implementation of KangarooTwelve in Go

It is heavily based on the official Go's x/crypto/sha3 library. But because of minor implementation details the relevant files have been copied and modified there so you do not need Go's SHA-3 implementation to run this package. Hopefully one day Go's SHA-3 library will be more flexible to allow other keccak construction to rely on it.

I have tested this implementation with different test vectors and it works fine. Note that it has not received proper peer review. If you look at the code and find issues (or not) please let me know!

See here why you should use KangarooTwelve instead of SHA-3. But see here first why you should still not skip SHA-3.

This implementation does not yet make use of SIMD to parallelize the implementation. But we can already see improvements due to the smaller number of rounds:

100 bytes1000 bytes10,000 bytes
K12761 ns/op1875 ns/op15399 ns/op
SHA3854 ns/op3962 ns/op34293 ns/op
SHAKE128668 ns/op2853 ns/op29661 ns/op

This was done with a very simple bench script on my 2 year-old macbook pro.

comment on this story

Maybe you shouldn't skip SHA-3

posted June 2017

Adam Langley from Google wrote a blogpost yesterday called Maybe Skip SHA-3. It started some discussions both on HN and on r/crypto

Speed drives adoption, Daniel J. Bernstein (djb) probably understand this more than anyone else. And this is what led Adam Langley to decide to either stay on SHA-2 or move to BLAKE2. Is that a good advice? Should we all follow his steps?

I think it is important to remember that Google, as well as the other big players, have an agenda for speed. I'm not saying that they do not care about security, it would be stupid for me to say that, especially seeing all of the improvements we've had in the field thanks to them in the last few years (Project Zero, Android, Chrome, Wycheproof, Tink, BoringSSL, email encryption, Key Transparency, Certificate Transparency, ...)

What I'm saying is that although they care deeply about security, they are also looking for compromises to gain speed. This can be seen with the push for 0-RTT in TLS, but I believe we're seeing it here as well with a push for either KangarooTwelve (K12) or BLAKE2 instead of SHA-3/SHAKE and BLAKE (which are more secure versions of K12 and BLAKE2).

Adam Langley even went as far as to recommend folks to stay on SHA-2.

But how can we keep advising SHA-2 when we know it can be badly misused? Yes I'm talking about length extension attacks. These attacks that prevent you from using SHA-2(key|data) to create a Message Authentication Code (MAC).

We recently cared so much about misuses of AES-GCM (documented misuses!) that Adam Langley's previous blog post to the SHA-3 one is about AES-GCM-SIV. We cared so much about simple APIs, that recently Tink simply removed the nonce argument from AES-GCM's API.

If we cared so much about documented misusage of crypto APIs, how can we not care about this undocumented misuse of SHA-2? Intuitively, if SHA-2 was to behave like a random oracle there would be no problem at all with the SHA-2(key|data) construction. Actually, none of the more secure hash functions like Blake and SHA-3 have this problem.

If we cared so much about simple APIs, why would we advise people to "fix" SHA-2 by truncating 256 bits of SHA-512's output (SHA-512/256)?

The reality is that you should use SHA-3. I'm making this as a broad recommendation for people who do not know much about cryptography. You can't go wrong with the NIST's standard.

Now let's see how we can dilute a good advice.

If you care about speed you should use SHAKE or BLAKE.

If you really care about speed you should use KangarooTwelve or BLAKE2.

If you really really care about speed you should use SHA-512/256. (edit: people are pointing to me that Blake2 is also faster than that)

If you really really really care about speed you should use CRC32 (don't, it's a joke).

How big of a security compromise are you willing to make? We know the big players have decided, but have they decided for you?

Is SHA-3 that slow?

Where is a hash function usually used? One major use is in signing messages. You hash the message before signing it because you can't sign something that is too big.

Here is a number taken from djb's benchmarks on how many cycles it takes to sign with ed25519: 187746

Here is a number taken from djb's benchmarks on how many cycles it takes to hash a byte with keccak512: 15

Keccak has a page with more numbers for software performance

Spoiler Alert: you probably don't care about a few cycles.

Not to say there are no cases where this is relevant, but if you're doing a huge amount of hashing and you're hashing big messages, you should rather switch to a tree-hashing mode. KangarooTwelve, BLAKE2bp and BLAKE2sp all support that.

EDIT: The Keccak team just released a response as well.

8 comments

Loop unrolling

posted May 2017

Reading cryptographic implementations, you can quickly spot a lot of weirdness and oddities. Unfortunately, this is because very few cryptographic implementations aim for readability (hence the existence of readable implementations). But on the other hand, there are a lot of cool techniques in there which are interesting to learn about.

Bit-slicing is one I've never really talked about, so here is a self-explanatory tweet quoting Thomas Pornin:

But I'm here to talk about loop unrolling. Here's what wikipedia has to say about it:

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff.

Want to see an example? Here's how Keccak was implemented in go

While the θ step could have easily fit in a loop:

for x := 0; x < 5; x++ {
    B[x] = a[x][0] ^ a[x][1] ^ a[x][2] ^ a[x][3] ^ a[x][4]
}

It was unrolled to allow for faster execution, getting rid of loop registers and checks for the end of the loop at every iteration:

// Round 1
bc0 = a[0] ^ a[5] ^ a[10] ^ a[15] ^ a[20]
bc1 = a[1] ^ a[6] ^ a[11] ^ a[16] ^ a[21]
bc2 = a[2] ^ a[7] ^ a[12] ^ a[17] ^ a[22]
bc3 = a[3] ^ a[8] ^ a[13] ^ a[18] ^ a[23]
bc4 = a[4] ^ a[9] ^ a[14] ^ a[19] ^ a[24]
d0 = bc4 ^ (bc1<<1 | bc1>>63)

Here is an implementation of MD5 in go. You can see that the gen.go file is generating a different md5block.go file which will be the unrolled code. Well yeah, sometimes it's just easier than doing it by hand :]

Here is an implementation of SHA-1 in assembly.

It's assembly, I'm not going to try to explain that.

Here's another example with Keccak in C. The unrolling is done in the preprocessing phase via C macros like this one:

#define REPEAT5(e) e e e e e

That's it! If you still haven't got it, you might have the qualities to run for president.

If you know more tricks let me know!

comment on this story

StrobeGo

posted May 2017

I wrote an opinionated and readable implementation of the Strobe Protocol Framework in Go here.

It hasn't been thoroughly tested, and its main goal was to help me understand the specification of Strobe while kicking around ideas to challenge the spec.

It started a discussion on the mailing list if you're interested.

I hope that this implementation can help someone else understand the paper/spec better. The plan is to have a usable implementation by the time Strobe's specification is more stable (not that it isn't, its version is actually 1.0.2 at the moment).

Following is a list of divergence from the current specification:

Operate

The main Operate public function that allows you to interact with STROBE as a developer has a different API. In pseudo code:

Operate(meta bool, operation string, data []byte, length int, more bool) → (data []byte)

I've also written high level functions to call the different known operations like KEY, PRF, send_ENC, etc... One example:

// `meta` is appropriate for checking the integrity of framing data. 
func (s *strobe) send_MAC(meta bool, output_length int) []byte {
  return s.Operate(meta, "send_MAC", []byte{}, output_length, false)
}

The meta boolean variable is used to indicate if the operation is a META operation.

// operation is meta?
if meta {
    flags.add("M")
}

What is a META operation? From the specification:

For any operation, there is a corresponding "meta" variant. The meta operation works exactly the same way as the ordinary operation. The two are distinguished only in an "M" bit that is hashed into the protocol transcript for the meta operations. This is used to prevent ambiguity in protocol transcripts. This specification describes uses for certain meta operations. The others are still legal, but their use is not recommended. The usage of the meta flag is described below in Section 6.3.

Each operation that has a relevant META variant, also contains explanation on it. For example the ENC section:

send_meta_ENC and recv_meta_ENC are used for encrypted framing data.

The operations are called via their strings equivalent instead of their flag value in byte. This could in theory be irrelevant if only the high level functions were made available to the developers.

A length argument is also available for operations like PRF, send_MAC and RATCHET that require a length. A non-zero length will create a zero buffer of the required length, which is not optimal for the RATCHET operation.

// does the operation requires a length?
if operation == "PRF" || operation == "send_MAC" || operation == "RATCHET" {
    if length == 0 {
      panic("A length should be set for this operation.")
    }

    // create an empty data slice of the relevant size
    data = bytes.Repeat([]byte{0}, length)

} else {
    if length != 0 {
      panic("Output length must be zero except for PRF, send_MAC and RATCHET operations.")
    }

    // copy the data not to modify the original data
    data = make([]byte, len(data_))
    copy(data, data_)
}

Verbose

One of my grip with the Strobe's specification is that nothing is truly explained through words: the reader is expected to follow reference snippets of code written in Python. Everything has been optimized and it is quite hard to understand what is really happening (hence why I wrote this implementation in the first place). To remediate this, I wrote a clear path of what is happening for each operation. It is great for other readers who want to understand what is happening, but obviously not the optimal thing to do for an efficient implementation :)

It looks like a bunch of if and else if (Go does not have support for switch):

if operation == "AD" { // A
    // nothing happens
} else if operation == "KEY" { // AC
    cbefore = true
    forceF = true
} else if operation == "PRF" { // IAC
    cbefore = true
    forceF = true
    ...

_Duplex

I have omited any function name starting with _ as Go has a different way of making functions available out of a package: make the first letter upper case.

The _duplex function was written after Golang's official SHA-3 implementation and is not optimal. The reason is that SHA-3 does not need to XOR anything in the state until after either the end of the rate (block size) or the end of the input has been reached. Because of this, Go's SHA-3 implementation waits for these to happen while STROBE often cannot: STROBE often needs the result right away. This led to a few oddities.

pos

I do not use the pos variable as it is easily obtainable from the buffered state. This is another (good) consequence of Go's decision on waiting before running the Keccak function. Some other implementations might want to keep track of it.

cur_flags

The specification states that cur_flags should be initialized to None. Which is annoying to program (see how I did it for I0). This is why I took the decision that a flag cannot be zero, and I'm initializing cur_flags to zero.

I'm later asserting that the flag is valid, according to a list of valid flags

// operation is valid?
var flags flag
var ok bool
if flags, ok = OperationMap[operation]; !ok {
    panic("not a valid operation")
}

R is always R

The specification states that the initialization should be done with R = R + 2. This is to have a match with cSHAKE's padding when using some STROBE instances like Strobe-128/1600 and Strobe-256/1600.

I like having constants and not modifying them. So I initialize with a smaller rate (block_size) and do not conform with cSHAKE's specification. Is it important? I think yes. It does not make sense to use cSHAKE's padding on Strobe-128/800, Strobe-256/800 and Strobe-128/400 as their rate is smaller than cSHAKE's defined rate in its padding.

EDIT: I ended up reverting this change. As I only implemented Strobe-256/1600, it is easy to conform to cSHAKE by just initializing with R + 2 which will not modify the constant R.

3 comments

Crypto Blogging Award 2016

posted May 2017

I spent a lot of time reading this blogpost today and thought to myself: this is a great blog post. If there was a blogging award for security/crypto blogposts this would probably be the front runner for 2017.

But I can't really blog about it, because we're still waiting for a lot more good blog posts to come this year.

So what I decided to do instead, is to go through all the blog posts that I liked last year (it's easy, I have a list here) and find out which ones stood out from the rest.

please do not get offended if yours is not in there, I might have just missed it!

How does a blog post make it to the list? It has to be:

  • Interesting. I need to learn something out of it, whatever the topic is. If it's only about results I'm generally not interested.
  • Pedagogical. Don't dump your unfiltered knowledge on me, I'm dumb. Help me with diagrams and explain it to me like I'm 5.
  • Well written. I can't read boring. Bonus point if it's funny :)

So here it is, a list of my favorite Crypto-y/Security blogposts for the year of 2016:

Did I miss one? Make me know in the comments :]

comment on this story

Tamarin looking for sources

posted May 2017

Tamarin will sometimes be unable to find the real source of a premise ([premise]-->[conclusion]) and will end up spinning until the scary words "partial deconstructions" will be flashed at the screen.

loops

Raw sources is Tamarin trying to find the source of each of your rules. That's a bunch of pre-computation to help him prove things afterwards, because Tamarin works like that under the hood (or so I've heard): it searches for things backward.

(This might be a good reason to start designing your protocol from the end!)

Partial deconstructions can happen when for example you happen to have used a linear fact both in the premise and the conclusion of a rule:

rule creation:
  [Fr(~a)]-->[Something(~a)]

rule loop:
  [Something(a)]-->[Something(a), Out(a)]

This is bad, because when looking for the source of the loop rule, Tamarin will find the loop rule as a possible origin. This over and over and over.

Source lemmas are special lemmas that can be used during this pre-computation to get rid of such loops. They basically prove that for such loop rules, there is a beginning: a creation rule. Source lemmas are automatically proven using induction which is the relevant strategy to use when trying to prove such things.

Refined sources are the raw sources helped by the source lemmas. If your wrote the right source lemmas, you will get rid of the partial deconstructions in the refined sources and Tamarin will use the refined sources for its proofs instead of the raw sources.

Persistent Facts is the answer. Anytime we have a linear fact persisting across a rule, there is little reason (I believe) not to use a persistent fact (which also allows to model replays in the protocol). We can now rewrite our previous rules with a bang:

rule creation:
  [Fr(~a)]-->[!Something(~a)]

rule loop:
  [!Something(a)]-->[Out(a)]

This way Tamarin will always correlate this Fact to its very creation.

comment on this story

How to obtain a nice trace in Tamarin

posted May 2017

ugly trace

This is not a nice trace

Learning about the inners of Tamarin, I've came up to the point where I wanted to get a nice clean trace to convince myself that the protocol I wrote made sense.

The way to check for proper functioning is to create exist-trace lemmas.

Unfortunately, using Tamarin to prove such lemma will not always yield the nicest trace.

Here are some tricks I've used to obtain a simple trace. Note that these matter less when you're writing lemmas to "attack" the protocol (as you really don't want to impose too many restrictions that might not hold in practice).

1. Use a fresh "session id". A random session id can be added to different states and messages to identify a trace from another one.

rule client_start_handshake:
    [!Init($client, ~privkey, pubkey), !Init($server, ~priv, pub), Fr(~sid)]
    --[Start(~sid, $client, $server)]->
    [Out(<'1', ~sid, $client, $server, pubkey>)]

2. Use temporals to make sure that everything happens in order.

Ex sid, client, server, sessionkey #i #j.
      Start(sid, client, server) @ #i & 
      Session_Success(sid, 'server', server, client, sessionkey) @ #j &
    #i < #j

3. Make sure your identities are different. Start by defining your lemma by making sure a server cannot communicate with itself.

Ex sid, client, server, sessionkey #i #j.
      not(server = client) & ...

4. Use constraints to force peers to have a unique key per identity. Without that, you cannot force Tamarin to believe that one identity could create multiple keys. Of course you might want your model to allow many-keys per identity, but we're talking about a nice trace here :)

restriction one_key_per_identity: "All X key key2 #i #j. Gen(X, key) @ #i & Gen(X, key2) @ #j ==> #i = #j"

rule generate_identity_and_key:
    let pub = 'g'^~priv
    in
    [ Fr(~priv) ]--[Gen($A, pub)]->[ !Init($A, ~priv, pub) ]

5. Disable the adversary. This might be the most important rule here, and it ended up being the breakthrough that allowed me to find a nice trace. Unfortunately you cannot disable the adversary, but you can force Ins and Outs to happen via facts (which are not seen by the adversary, and remove any possible interaction from the adversary). For this you can replace every In and Out with InSecure and OutSecure and have these rules:

#ifdef not_debug
[OutSecure(a)] -->[Out(a)]
[In(a)] -->[InSecure(a)]
#endif

#ifdef debug
[OutSecure(a)]-->[InSecure(a)]
#endif

You can now call tamarin like this:

tamarin interactive . -Dnot_debug

And we now have a nice trace!

nice trace

Our protocol is an opportunistic Diffie-Hellman. You can see the two generate_identity_and_key rules at the top that creates the two different identities: a client and a server.

The second rule client_start_handshake is used to start the handshake. The client sends the first message 1 with a fresh session id ~sid.

The third rule server_handshake propagates the received session id to the next message and to its action fact (so that the lemma can keep the session id consistent).

The protocol ends in the last rule client_handshake_finish being used.

Voila!

PS: thanks to Katriel Cohn-Gordon for answering ten thousands of my questions about Tamarin.

comment on this story

Bug in KangarooTwelve's reference implementation

posted May 2017

When playing around with KangarooTwelve I noticed that the python implementation's right_encode() function was not consistent with the C version:

python implementation of right_encode

λ python right_encode.py            
right_encode(0): [0]
right_encode(1): [1][3]
right_encode(2): [2][3]
right_encode(255): [255][5]
right_encode(256): [1][0][6]
right_encode(257): [1][1][6]

C implementation of right_encode:

λ ./right_encode        
right_encode(0): 00
right_encode(1): 0101
right_encode(2): 0201
right_encode(255): ff01
right_encode(256): 010002
right_encode(257): 010102

The problem was that the python implementation used the function bytes() to create byte strings, which has breaking behaviors between python2 and python3.

python2:

>>> bytes([1])
'[1]'

python3:

>>> bytes([1])
b'\x01'

The fix was to use bytearray() instead which is consistent across the two versions of Python.

comment on this story

KangarooTwelve

posted May 2017

kangaroo twelve

In late 2008, 64 candidates took part in NIST's hashing competition, fighting to replace the SHA-2 family of hash functions.

Keccak won the SHA-3 competition, and became the FIPS 202 standard on August 5, 2015.

All the other contenders lost. The final round included Skein, JH, Grøstl and BLAKE.

Interestingly in 2012, right in the middle of the competition, BLAKE2 was released. Being too late to take part in the fight to become SHA-3, it just stood on its own awkwardly.

Fast forward to this day, you can now read an impressive list of applications, libraries and standards making use of BLAKE2. Argon2 the winner of the Password Hashing Competition, the Noise protocol framework, libsodium, ...

On their page you can read:

BLAKE2 is a cryptographic hash function faster than MD5, SHA-1, SHA-2, and SHA-3, yet is at least as secure as the latest standard SHA-3. BLAKE2 has been adopted by many projects due to its high speed, security, and simplicity.

And it is indeed for the speed that BLAKE2 is praised. But what exactly is BLAKE2?

BLAKE2 relies on (essentially) the same core algorithm as BLAKE, which has been intensively analyzed since 2008 within the SHA-3 competition, and which was one of the 5 finalists. NIST's final report writes that BLAKE has a "very large security margin", and that the the cryptanalysis performed on it has "a great deal of depth". The best academic attack on BLAKE (and BLAKE2) works on a reduced version with 2.5 rounds, whereas BLAKE2b does 12 rounds, and BLAKE2s does 10 rounds. But even this attack is not practical: it only shows for example that with 2.5 rounds, the preimage security of BLAKE2b is downgraded from 512 bits to 481 bits, or that the collision security of BLAKE2s is downgraded from 128 bits to 112 bits (which is similar to the security of 2048-bit RSA).

And that's where BLAKE2 shines. It's a modern and secure hash function that performs better than SHA-3 in software. SHA-3 is fast in hardware, this is true, but Intel does not have a good track record of quickly implementing cryptography in hardware. It is only in 2013 that SHA-2 joined Intel's set of instructions, along with SHA-1...

What is being done about it?

Some people are recommending SHAKE. Here is what the Go SHA-3 library says on the subject:

If you aren't sure what function you need, use SHAKE256 with at least 64 bytes of output. The SHAKE instances are faster than the SHA3 instances; the latter have to allocate memory to conform to the hash.Hash interface.

But why is SHA-3 slow in software? How can SHAKE be faster?

This is because SHA-3's parameters are weirdly big. It is hard to know exactly why. The NIST provided some vague explanations. Other reasons might include:

  • Not to be less secure than SHA-2. SHA-2 had 256-bit security against pre-image attacks so SHA-3 had to provide the same (or at least a minimum of) 256-bit security against these attacks. (Thanks to Krisztián Pintér for pointing this out.)
  • To be more secure than SHA-2. Since SHA-2 revealed itself to be more secure than we previously thought, there was no reason to develop a similar replacement.
  • To give some credits to the NIST. In view of people's concerns with NIST's ties with the NSA, they could not afford to act lightly around security parameters.

sponge

The following table shows the parameters of the SHAKE and SHA-3 constructions. To understand them, you need to understand two of the parameters:

  • The rate (r): it is the block size of the hash function. The smaller, the more computation will have to be done to process the whole input (more calls to the permutation (f)).
  • The capacity (c): it is the secret part of the hash function. You never reveal this! The bigger the better, but the bigger the smaller is your rate :)
algorithmratecapacitymode (d)output length (bits)security level (bits)
SHAKE12813442560x1Funlimited128
SHAKE25610885120x1Funlimited256
SHA3-22411524480x06224112
SHA3-25610885120x06256128
SHA3-384832 7680x06384192
SHA3-512576 10240x06512256

The problem here is that SHA-3's capacities is way bigger than it has to be.

512 bits of capacity (secret) for SHA3-256 provides the square root of security against (second pre-image attacks (that is 256-bit security, assuming that the output used is long enough). SHA3-256's output is 256 bits in length, which gives us the square root of security (128-bit) against collision attacks (birthday bound). We end up with the same theoretical security properties of SHA-256, which is questionably large for most applications.

Keccak even has a page to tune its security requirements which doesn't advise such capacities.

tune keccak

Is this really the end?

Of course not. In the end, SHA-3 is a big name, people will use it because they hear about it being the standard. It's a good thing! SHA-3 is a good hash function. People who know what they are doing, and who care about speed, will use BLAKE2 which is equally as good of a choice.

But these are not the last words of the Keccak's team, and not the last one of this blog post either. Keccak is indeed a surprisingly well thought construction relying on a single permutation. Used in the right way it HAS to be fast.

Mid-2016, the Keccak team came back with a new algorithm called KangarooTwelve. And you might have guessed, it is to SHA-3 what BLAKE2 is to BLAKE.

It came with 12 rounds instead of 24 (which is still a huge number compared to current cryptanalysis) and some sane parameters: a capacity of 256 bits like the SHAKE functions.

Tree hashing, a feature allowing you to parallelize hashing of large files, was also added to the mix! Although sc00bz says that they do not provide a "tree hashing" construction but rather a" leaves stapled to a pole" construction.

leaves stapled to a pole construction

credit: sc00bz

Alright, how does KangarooTwelve work?

First your input is mixed with a "customization string" (which allows you to get different results in different contexts) and the encoding of the length of the customization string (done with right_encode())

input kangarootwelve

Second if the message is "short" (≤8192bytes), the tree hashing is not used. The previous blob is padded and permuted with the sponge. Here the sponge call is Keccak with its default multi-rate padding 101*. The output can be as long as you want like the SHAKE functions.

short message kangarootwelve

Third if the message is "long", it is divided into "chunks" of 8192 bytes. Each chunk (except the first one) is padded and permuted (so effectively hashed). The end result is structured/concatenated/padded and passed through the Sponge again.

long message kangarootwelve

This allows each different chunk to be hashed in parallel on different cores or via some SIMD magic.

Before you go, here is a simple python implementation and the reference C implementation.

Daemen gave some performance estimations for KangarooTwelve during the rump session of Crypto 2016:

  • Intel Core i5-4570 (Haswell) 4.15 c/b for short input and 1.44 c/b for long input
  • Intel Core i5-6500 (Skylake) 3.72 c/b for short input and 1.22 c/b for long input
  • Intel Xeon Phi 7250 (Knights Landing) 4.56 c/b for short input and 0.74 c/b for long input

Thank-you for reading!

comment on this story