up previous
Next: Conclusions Up: Mercy: a fast large Previous: Description of Mercy


Design of Mercy

Existing approaches to large block ciphers use a few strong rounds based on existing cryptographic primitives. These ciphers cannot achieve speeds better than half that of the fastest block ciphers [2] or a third of the fastest stream ciphers [1]. Current block cipher speeds don't approach the needs of the design goals even before the extra penalties for doubling up, while those solutions based on stream ciphers pay a heavy penalty in key scheduling overhead that puts their speeds well below those needed.

Mercy addresses this by using more rounds of a weaker function. This makes more efficient use of the work done in earlier rounds to introduce confusion in later rounds, and is closer to a traditional block cipher approach drawn across a larger block. It also draws more state between disparate parts of the block, protecting against the certificational weaknesses identified in Section 2.

Balanced Feistel network

Balanced Feistel networks are certainly the best studied frameworks from which to build a block cipher, although I know of no prior work applying them to such large block sizes. They allow the design of the non-linear transformations to disregard efficiency of reversal and provide a familiar framework by which to analyse mixing. Balanced networks seem better suited to large block ciphers than unbalanced networks, since an unbalanced network is likely to have to do work proportional to the larger of the input and output data width.

Feistel ciphers normally omit the final swap, so that decryption has the same structure as encryption. However, Mercy implementations will typically encrypt blocks in-place, and the cost of having an odd number of swaps (forcing a real swap) would be high, so the last swap is not omitted.

Mercy's round function, while weaker than those used in [1], is considerably stronger than that of traditional Feistel ciphers, necessitating many fewer rounds. The larger block size allows more absolute work to be done in each round, while keeping the work per byte small.

Key schedule and S-boxes

The function $ N $ used in building the $ T $ box is that used for nonlinear subsitution in [7]; it is bijective and has known good properties against linear and differential cryptanalysis, such as low LCmax and DCmax in the terminology of [13]. We use this function to build known good key-dependent bijective substitutions using an extension of the technique outlined in [4]; however, rather than a simple XOR, the $ d $ mappings before and after $ N $ are drawn at random from the entire space of bijective affine functions on $ GF(2)^{8} $, of which there are approximately $ 2^{70.2} $ , by determining first the constant term $ d[0] $ and then drawing candidate basis values from the CPRNG to find a linearly independent set. Because the $ d $ functions are affine, the LCmax and DCmax of the composite function $ d_1 \circ N \circ d_0 $ (and siblings) will be the same as those of $ N $ itself. The composite functions will also be bijective since each of the components are, and hence all the bytes in each column of the $ T $ table will be distinct.

However, there are fewer possible composite functions than there are pairs $ d_0, d_1 $. In fact for each possible composite function, there are $ 255 \times 8 = 2040 $ pairs $ d_0, d_1 $ which generate it. This follows from the following two properties of $ N $:

$\displaystyle \forall a \in GF(2^8) - \{0\}: \forall x \in GF(2^8): a N( ax )$ $\displaystyle =$ $\displaystyle N(x)$  
$\displaystyle \forall b \in 0 \ldots 7: \forall x \in GF(2^8): N( x^{2^j} )$ $\displaystyle =$ $\displaystyle N(x)^{2^j}$  

Since both multiplication and squaring in $ GF(2^8) $ are linear (and hence affine) in $ GF(2)^8 $ (squaring because in a field of characteristic 2, $ (x + y)^2 = x^2 + xy + yx + y^2 = x^2 + y^2 $), both of these properties provide independent ways of mapping from any pair $ d_0, d_1 $ to pairs which will generate the same composite function, distinct in every case except $ (a, b) = (1, 0) $. Taking this into account, the number of distinct composite functions possible is approximately $ 2^{129.4} $, and there are $ 2^{4613.7} $functionally distinct keys in total (considering $ W_0,W_1 $ as well as $ T $).

Little attention has been paid to either the time or memory requirements of the key schedule, since key scheduling will be very infrequent and typically carried out in user space.

Operation M

As with [21], this operation is bijective, since from the least significant byte of the output the input to the $ T $ box can be inferred. We differ from [21] in using the most significant byte for the lookup rather than the least; this slightly improves the mixing gained from addition and seems to be very slightly faster on the author's machine.

Q state machine

A key-dependent state machine is an efficient way to bring about strong dependencies between widely separated parts of the block; since the state machine is reversible, changing a given input word guarantees that every subsequent state will be different and makes it slightly more likely than chance that every output word will be different.

The basis of the state machine is 32 parallel four-stage LFSRs based on the polynomial $ x^{4}+x^{3}+1 $. The input to each LFSR is provided by a nonlinear mixing chain based on carry bits from other LFSRs and taps into the state which are then fed into Operation M after addition-based mixing with the input. The use of an LFSR ensures that any pair of distinct inputs of the same length which leave the LFSR in the same state must be at least 5 words long.

Every $ T $ box lookup depends on the previous lookup, even across rounds. This goes against the design principles outlined in [18,5] which suggest that ciphers should be designed to make best use of the parallelism that modern processors can achieve, and to be wary of the memory latency of table lookups. A variant on $ Q $ which allows several $ T $ box lookups to take place in parallel by taking taps from later in the LFSR is easy to construct, but surprisingly did not result in any speed improvements on the target platform. Ciphers similar to Mercy which use this technique to improve parallelism may be appropriate for other architectures.

Since the Feistel rounds use XOR mixing, $ Q $ is also designed such that the first operation on the input is addition, as is the last operation on the output. This improves operation mixing, helping to frustrate attacks which model the cipher using a single algebraic group. XOR is also used within $ Q $ for the same reason.

The output is chosen for the property that, where $ Q(S,x)=(S',y) $, if either of $ S $ or $ S' $ is known and either of $ x $ or $ y $ is known, the two unknowns can be inferred. We prove this in four cases below:

  1. $ S,x $ known, $ S',y $ unknown: use $ Q $ directly.
  2. $ S',x $ known, $ S,y $ unknown: from Equation 1 we infer $ y=S'_{3}+M(S'_{1}+x) $ and $ S=(S'_{1},S'_{2},S'_{3},S'_{0}\oplus y) $.
  3. $ S,y $ known, $ S',x $ unknown: $ x=M^{-1}(y-S_{2})-S_{0} $(defined since $ M $ is bijective), then apply $ Q $ as normal.
  4. $ S',y $ known, $ S,x $ unknown: find $ S $as in 2 and $ x $ as in 3.

F function

These properties of $ Q $ are used to build a fast, bijective F function. A bijective F function lends resistance to a number of attacks, making 2 round iterative differential characteristics impossible and frustrating truncated differential cryptanalysis, as well as lending confidence that the input is being used efficiently. For a fixed spice $ G $, we infer $ F $'s input $ A $ given the output $ B $ as follows: the final state of the state machine is simply the last four words of the output: $ S_{n+4}=B_{n-4\ldots n-1} $; from this, we can run the state machine backwards up to $ S_{8} $ as with point 4 of Section 5.4 above, inferring $ A_{0\ldots n-5} $ as we do so. We then use $ A_{n-8\ldots n-5} $(which we just inferred) and $ G $ to infer $ S_{0} $ using point 2 of Section 5.4 above, which gives us our last four inputs $ A_{n-4\ldots n-1} $.


This F function does not provide full avalanche between input and output. I see no secure way to build a balanced full-avalanche F function for large blocks that isn't so much slower than Mercy's F function that the time cost would not be better spent on more rounds with the weaker F function.

Instead of providing full avalanche, the F function makes two weaker guarantees that together are almost as effective:

A full avalanche F function achieves avalanche across the whole block after three rounds. This construction does so after four rounds. In addition, in encryption and decryption every keytable lookup depends on the result from the previous lookup.

The partial collision attack from Section 2 will demonstrate that after six rounds Mercy's avalanche is imperfect, since only 384 bits of state are carried between some parts of the block, but such an attack would require that $ 2^{193} $ chosen plaintexts be encrypted, and is thus outside the security goals of the cipher. This suggests a distinction between perfect avalanche, and avalanche sufficient to defeat cryptanalysis; this distinction is useful since mechanisms for providing perfect avalanche, such as networks based on Fourier transforms (used in SAFER [12] and proposed by [16] for large block ciphers), can carry a heavy performance penalty on large blocks. This distinction is not useful on small blocks: if this attack is possible against a cipher with a 64-bit block, it will not require more than $ 2^{33} $ chosen plaintexts.


Mercy only whitens one half of the input and output, since the cost both in time and storage of whitening both halves would be significant, and since the primary function of whitening is to hide the input of the F function from attackers. On large block ciphers, whitening also serves to frustrate attacks based on creating inputs with special structures, such as attempts to induce a repeating state in the state machine of $ F $.

Linear and differential cryptanalysis

We do not prove Mercy resistant to either linear or differential cryptanalysis; a large block cipher meeting the performance goals that could be proven resistant would be a worthy goal. However, four features of the cipher lend resistance to these classes of attack.

First, the key dependent subsitutions are optimised against linear and differential cryptanalysis as described in Section 5.2.

Second, the LFSR-based construction of the $ Q $ state machine forces any input to the F function with active substitutions (in the terminology of [7]) to make at least three substitutions active. In practice, the intent of the F function design is that any difference in the input causing a difference in a $ T $ box substitution will cause all subsequent $ T $ box subsitution to be uncorrelated; avoiding this effect will be very hard for attackers. Most F functions cannot afford the luxury of 68 chained non-linear substitutions.

Third, the initial and final whitening should force attackers to look for difference propagations or linear correlations over at least four rounds.

Fourth, the ways in which key material is introduced in the F function should mean that inferring a suggested key from a given plaintext-ciphertext pair should be extremely difficult.


up previous
Next: Conclusions Up: Mercy: a fast large Previous: Description of Mercy