[Courses] [C] A tutorial on Blowfish

Sat Dec 28 22:54:29 EST 2002

On Mon, Dec 23, 2002 at 09:29:19PM -0500, Morgon Kanter wrote:
> 
> (...)
> This is function F, which is a BIT confusing, because of inconsistancies in 
> Schneier's explanation and Schneier's code:
> Divide Xl into 4 8-bit quarters: a, b, c, and d.
> ((S1[a] + S2[b] mod 2(exp)32) XOR S3[c]) + S4[d] mod 2(exp)32;
> 
> A bit confusing, no? However, unless mod does not mean modulus, here's what 
> Schneier's code says instead:
> ((S1[a] + S2[b]) XOR S3[c]) + S4[d]
> 
> No modulus at all in there. Maybe the book has an errata list I can check 
> somewhere, I'll look at it later.

actually, those seemingly inconsistent pieces of pseudo-code are OK,
although Schneier is making more or less silent assumptions about the
underlying architecture or width of long-ints being used for the
computations (32-bit). Also, he implicitly assumes that you realise
that, for X being any power of 2, and unsigned integers, the modulus
operation

  Y % X

yields the same result as bitwise AND-masking with X-1

  Y & (X-1)

This is not too surprising, when you look at the binary representation
of the numbers. For example, with X=256 and Y=12345, you'd get

  12345 % 256   ->  57
  12345 & 255   ->  57

Expressed in binary notation we have

  12345 : 00000000000000000011000000111001  (written at 32-bit width)
    256 : 00000000000000000000000100000000
    255 : 00000000000000000000000011111111

Bitwise masking (12345 & 255) can be visualised as

          00000000000000000011000000111001
        & 00000000000000000000000011111111
          ^^^^^^^^^^^^^^^^^^^^^^^^
          |
          those bits (where mask==0) are being cut off from (or
zeroed in) the first number, so we get the result:

          00000000000000000000000000111001

which is 57. In other words, this masks off all integer multiples of
256 from the original value.  And, you might have guessed it, the
integer division is related to the other part that's being cut off by
the mask (though that's not relevant for the issue at hand):

          00000000000000000011000000000000 >> 8
       == 00000000000000000000000000110000
       == 48
       == 12345 / 256

As binary numbers are a bit unwieldy to write, we usually use 
hexadecimal notation instead (where 255 (dec) is 0xff (hex),
for example), when we want to hint the reader that we're operating
with some value that's somehow special in its binary representation,
such as 0xffff == 1111111111111111, which is more obvious than 65535.

OK, back to the original issue. When using 32 bits for integer
computations (which corresponds to the native width of the internal
CPU registers of Pentiums and Athlons), we can only represent the lower
32 bits of any number. Larger values simply overflow, so any bits
beyond the 32nd bit are being discarded.  This is effectively the same
as AND-masking with (2(exp)32 - 1) == 0xffffffff, or "mod 2(exp)32".

This is the long-winded explanation of what Schneier assumed you
already know...

> 
> Here is the challenge for you all right now: write a function or macro for 
> function F. Assume the 4 S arrays are global variables. It will be 
> interesting to see how you split the 32-bit block into four 8-bit blocks.

you could define a macro like

  #define B(i,v)  ((v >> ((i-1)*8)) & 0xff)

where B(1,x) would extract the low-order 8-bit block and B(4,x) the
high-order 8-bit of some 32-bit value x. So, that function F() would
become

  unsigned long F(unsigned long v) {
      return ((S1[B(1,v)] + S2[B(2,v)]) ^ S3[B(3,v)]) + S4[B(4,v)];
  }

(assuming the Ss are defined as globals "unsigned char S1[256], S2[256], S3[256], S4[256];" )

As the code looks a bit unwieldy after expansion by the C preprocessor
(use option -E to have the compiler do the expansion only):

  unsigned long F(unsigned long v) {
      return ((S1[(( v  >> (( 1 -1)*8)) & 0xff) ] + S2[(( v  >> (( 2 -1)*8)) & 0xff) ]) ^
      S3[(( v  >> (( 3 -1)*8)) & 0xff) ]) + S4[(( v  >> (( 4 -1)*8)) & 0xff) ];
  }

you might think that you could just as well shorten the code to
read something like the following:

  unsigned long F(unsigned long v) {
      return ((S1[v & 0xff] + S2[(v>>=8) & 0xff]) ^ S3[(v>>=8) & 0xff]) + S4[(v>>=8) & 0xff];
  }

Right you are, in principle, but think twice before doing so.
Although the code does what it's supposed to do in this context, it
generally has the side effect of modifying the value v, so the order of
evaluation becomes important. In a different context this would most
probably yield unexpected results, e.g. when you try something along
similar lines:

  printf("%x %x %x %x\n", v & 0xff, (v>>=8) & 0xff, (v>>=8) & 0xff, (v>>=8) & 0xff );

Assuming that you've defined the value v to be 0xDDCCBBAA, I guess you
would expect to see printf() printing "aa bb cc dd", yet it prints
"dd dd cc bb". Why is that? It has to do with the order of evaluation,
as mentioned before. For the function call printf(), the compiler
pushes the function's arguments onto some internal stack, from where
the function can then access them. As a stack is a LIFO
(last-in-first-out) kind of thing, the arguments are pushed in reverse
order. And, to minimize unnecessary rearranging of values, the compiler
also *executes* the expressions in reverse order (i.e. "v & 0xff" is
evaluated last), which explains what you see...

Using the macro from above instead does circumvent this subtle pitfall,

  printf("%x %x %x %x\n", B(1,v), B(2,v), B(3,v), B(4,v) );

since the value of v is not being modified within the macro -- so it's
irrelevant that B(4,v) is executed before B(1,v).

Anyways, "optimizations" like this are usually unnecessary at best.
The compiler is pretty clever at tidying up things at compile time,
especially when it's being told to optimize (-O, -O2, etc.). For
example, when comparing the disassembled code of both functions
(F1() being the expanded version of the function using the B() macro),
one can see that at the machine code level, there are hardly any
differences. For the snippet of code (temp.c)

  unsigned char S1[256], S2[256], S3[256], S4[256];

  unsigned long F1(unsigned long v) {
      return ((S1[(( v  >> (( 1 -1)*8)) & 0xff) ] + S2[(( v  >> (( 2 -1)*8)) & 0xff) ]) ^
      S3[(( v  >> (( 3 -1)*8)) & 0xff) ]) + S4[(( v  >> (( 4 -1)*8)) & 0xff) ];
  }
  unsigned long F2(unsigned long v) {
      return ((S1[v & 0xff] + S2[(v>>=8) & 0xff]) ^ S3[(v>>=8) & 0xff]) + S4[(v>>=8) & 0xff];
  }

the commands

  $ cc -c -O2 temp.c
  $ objdump -d temp.o

produce:

  00000000 <F1>:
     0:   55                      push   %ebp
     1:   89 e5                   mov    %esp,%ebp
     3:   8b 55 08                mov    0x8(%ebp),%edx
     6:   89 d0                   mov    %edx,%eax
     8:   25 ff 00 00 00          and    $0xff,%eax
     d:   0f b6 88 00 00 00 00    movzbl 0x0(%eax),%ecx
    14:   89 d0                   mov    %edx,%eax
    16:   c1 e8 08                shr    $0x8,%eax
    19:   25 ff 00 00 00          and    $0xff,%eax
    1e:   0f b6 80 00 00 00 00    movzbl 0x0(%eax),%eax
    25:   01 c1                   add    %eax,%ecx
    27:   c1 ea 10                shr    $0x10,%edx
    2a:   89 d0                   mov    %edx,%eax
    2c:   25 ff 00 00 00          and    $0xff,%eax
    31:   0f b6 80 00 00 00 00    movzbl 0x0(%eax),%eax
    38:   31 c1                   xor    %eax,%ecx
    3a:   c1 ea 08                shr    $0x8,%edx
    3d:   0f b6 82 00 00 00 00    movzbl 0x0(%edx),%eax
    44:   01 c8                   add    %ecx,%eax
    46:   c9                      leave  
    47:   c3                      ret    

  00000048 <F2>:
    48:   55                      push   %ebp
    49:   89 e5                   mov    %esp,%ebp
    4b:   8b 55 08                mov    0x8(%ebp),%edx
    4e:   89 d0                   mov    %edx,%eax
    50:   25 ff 00 00 00          and    $0xff,%eax
    55:   0f b6 88 00 00 00 00    movzbl 0x0(%eax),%ecx
    5c:   c1 ea 08                shr    $0x8,%edx
    5f:   89 d0                   mov    %edx,%eax
    61:   25 ff 00 00 00          and    $0xff,%eax
    66:   0f b6 80 00 00 00 00    movzbl 0x0(%eax),%eax
    6d:   01 c1                   add    %eax,%ecx
    6f:   c1 ea 08                shr    $0x8,%edx
    72:   89 d0                   mov    %edx,%eax
    74:   25 ff 00 00 00          and    $0xff,%eax
    79:   0f b6 80 00 00 00 00    movzbl 0x0(%eax),%eax
    80:   31 c1                   xor    %eax,%ecx
    82:   c1 ea 08                shr    $0x8,%edx
    85:   0f b6 82 00 00 00 00    movzbl 0x0(%edx),%eax
    8c:   01 c8                   add    %ecx,%eax
    8e:   c9                      leave  
    8f:   c3                      ret    

Even without understanding the details, one can see that the compiler is
generating almost identical machine code for both versions of the function.

(If you'd like to know how to read the assembly code, just ask -- as
this is a C course, I don't want to risk getting any further off-topic
than I already am right now ;-)

Just one more thing.  In order to get around having to fiddle with
bit-operators, you might feel tempted to treat the 32-bit longint as an
array of 4 chars (char[4]), and craft a macro which simply reads the
required 8-bit sub-block from the respective memory location.
For this, you'd have to cast the longint to an array of chars like this:

  #define B(i,v)  (((unsigned char *)&v)[i-1])

But don't. Although this is of course possible in C, and you could get
it to work without problems on a particular platform, it's *unportable*
code. This is due to the endian-ness issue (little-endian vs. big-endian,
sometimes also called byte-sex), which means that the internal
representation of integer values differs from platform to platform.
More precisely, the integer 0xDDCCBBAA is represented as the byte
sequence AA BB CC DD on the typical I32-PC-architecture, whereas on
Macs and SGI/Sun/HP machines, for example, the same number is
represented as the sequence DD CC BB AA in memory.  Thus, if you treat
the longint as an array char s[4], you'd get different values for s[i],
for example, s[1]==0xBB on PCs, and s[1]==0xCC on SGIs.
(More on that here, for example: http://www.cs.umass.edu/~verts/cs32/endian.html )

Even if you yourself are happy that your code runs on your current
machine, sooner that you think, there's someone else who wants to
compile it for a different platform... Do her a favor.

OK guys, enough random ranting for today -- most of the real work that
needs to be done to implement the algorithm is still left for anyone
else who wants to try...

Almut