Differential Power Analysis on AES - Hands On Single Bit Attack

Differential Power Analysis (DPA) attacks on cryptosystems have been around at least since the seminal paper by Kocher et al in 1999 [1]. But, surprisingly, this remains a very relevant topic for modern crypto hardware. Many papers, and much hardware and software exists to exploit systems. For example the Chipwhisperer:

https://newae.com/tools/chipwhisperer/

However, I think there is still use for a basic follow-along example implementation.

There is an excellent introduction to DPA here:

https://research.kudelskisecurity.com/2014/07/16/power-analysis-basics/

The basic idea behind these Differential Analysis attacks is that we can measure *anything* that depends on part of the key being correct, then we can exploit that to determine that part of the key. Electromagnetic emissions, run time variations, even sound emissions could be used. In DPA, there is usually a lot of noise in the data. For example, a power trace does not only represent a particular register or bit operation, but the whole operation of the chip, plus electrical noise.

I am going to write the code in MATLAB for three reasons.

  • First, I am starting this work with some sample traces and loading routines from a book called 'Hardware Security and Trust'. There is some example code for the loading routines already available, so that makes it easy to start and duplicate by others.
  • Second, there is some more code available here also using the same traces, so that I don't have to make big changes for the single bit case: https://github.com/Nodulaire/SCA-DPA
  • Third, MATLAB is a very high level language, and that allows me to keep the code very concise and easy to follow.

I use a binary file called traces-00112233445566778899aabbccddeeff.bin This data comes from From the book:

We encrypted 200 plaintexts (file plaintext-00112233445566778899aabbccddeeff.txt), obtaining 200 ciphertexts (file ciphertext-00112233445566778899aabbccddeeff.txt. During encryptions we measured power traces (traces-00112233445566778899aabbccddeeff.bin). Each trace has a length of 370 000 samples (in this case). Each sample is represented by 8 bit unsigned value (i.e., the length of the file is 370 000 bytes * 200 traces = 74 MB).

If your program/script is correct, then you should reveal the key 00 11 22 33 44 55 66 77 88 99 AA BB CC DD EE FF.

Let's use Matlab/Octave for this. First, load the traces into a variable.

offset = 0;             % we can limit the amount of data to be processed. Not strictly necessary
segmentLength = 370000; % we can limit amount of data to be processed. Not strictly necessary
myfile=fopen('traces-00112233445566778899aabbccddeeff.bin','r');
traces=zeros(200,segmentLength); % not strictly necessary, but speeds up memory access.
for i=1:200 % for each of the 200 traces
  fseek(myfile, offset, 'cof');                          % necessary if offset/segment points to less than a full trace
    if (segmentLength+offset > traceSize)
        t=fread(myfile, segmentLength-offset, 'uint8');      % read a single data trace into variable t
    else 
        t=fread(myfile, segmentLength, 'uint8');             % read a single data trace into variable t
    end;
 fseek(myfile, (traceSize-segmentLength-offset), 'cof'); % necessary if offset/segment points to less than a full trace
 traces(i,:)=t;                  % add t to an array of traces, called 'traces'
end
fclose(myfile);

and here is what a whole trace looks like:

And there are 200 of them. Now, how do we use that to extract the key? We make the assumption that all the data in all the traces consist of the sum of 3 components:

  • A constant term, C
  • Some random variations on top of the constant term, R
  • Some systematic variations on top of the constant term, due to the key computation, S

Ignore the systematic variations for a moment. If we exploit the central limit theorem and average a large number of traces, then we are left with the constant term because the random variations will average out. Here is a plot of the average of all 200 traces, essentially a plot of C:

It looks very similar to the single trace, so we know the random terms R are small compared to the constant term C - but it really does not matter. The average term should tend to zero as we average more and more traces. Lets look at that. Take the first 10 traces and average them and then take the next 10 traces and average them too. We then have two averages, each composed of the average of the constant term and the average of the random variations. Call it C1 and R1 for the first average and C2 and R2 for the second. The constant term, by definition, is the same for all traces and so C1 and C2 are the same as C.

This is important, because we can now do the following: Subtract the two averages from one another. Call the result A = (C1+R1) - (C2 + R2) = C+R1-C-R2 = R1-R2

R1 and R2 each are averages of random values and should go to zero as we average more data traces. Thus A should go to zero. This is what it looks like when we subtract those two averages of 10 traces each:

group1 = zeros(1,segmentLength);
group2 = zeros(1,segmentLength);
numberOfTracesInGroup1 = 0;
numberOfTracesInGroup2 = 0;
for L = 1:10
        group1(1,:) = group1(1,:) + traces(L,:);
        numberOfTracesInGroup1 = numberOfTracesInGroup1 + 1;
end;
for L = 11:20
        group2(2,:) = group2(2,:) + traces(L,:);
        numberOfTracesInGroup2 = numberOfTracesInGroup2 + 1;
end;
group1(1,:) = group1(1,:) / numberOfTracesInGroup1; % average of group 1
group2(1,:) = group2(1,:) / numberOfTracesInGroup2; % average of group 2
groupDifference = (group1(1,:)-group2(1,:)); % subtracting the averages

This is not quite zero, as we expected, but the sudden jump in the middle, which was part of the constant term is no longer there. Instead of taking the average of 10 traces, here is the result if we average the first 100 traces and the second 100 traces and subtract those two:

And we can see that we get closer to zero as we average out more data. Next, lets look at the term we called "systematic variations on top of the constant term, due to the key computation, S". We assume that *somewhere* in that trace - we don't know where - there chip performs the AES computation. Part of that computation is that for every key byte a XOR operation is performed with the clear text followed by a table lookup called the SBOX. From:

https://www.researchgate.net/publication/267769417_Differential_Power_Analysis_in_AES_A_Crypto_Anatomy

And this is at the heart of the attack. Somewhere - we don't know where - in the data trace there is a bit whose value is dependent ONLY on the changing data and key bytes. The SBOX is a constant lookup table, which is known. Either setting that bit increases or decreases the power draw of the function, but whatever it does, it will do so every time. We know the data bytes that are being fed to the device, but we don't know the key. However, if we guess a key byte then we can predict the outcome of this operation. For example, take trace #1 in the example data set and examine plain text byte 1: 0x25, both of which we are given. We don't know the key, but there are only 256 possible values. Lets guess them one by one. Start with 0x00, then 0x01 and so on. The XOR of 0x25 and 0x00 is 0x25. The SBOX lookup for that value is 0x3F. If we guessed instead that the key is 0x01, then the XOR result would be 0x24. The SBOX lookup for that value is 0x36. The least significant bit for the SBOX result of 0x3F is 1. And the least significant bit for the SBOX result of 0x36 is 0. And so on ... for all other values of the key guess up to 0xFF.

Every guess, sort the traces into 2 groups, one group for all the traces for which our predicted key predicts a least significant bit of 0 and one for the rest. If the predictions are wrong, the the term S, where the power draw is a bit higher are just as likely to be in group 1 as in group 2. However, if the predictions are right, then we have two averages that are identical with the exception of the term S at the particular point in time where the SBOX lookup is performed. Remember that plot of the difference of the first 100 traces and the second 100 traces?

It is the result of predicting that the 100 traces differ in some repeatable way somewhere from the second 100 traces. We can see that the prediction is wrong. However, if we use the correct key byte to predict which group the traces should go into, group 1 consists of traces and group 2 of traces. And the subtraction of their averages looks like this:

And there is a spike in one place where there is a clear difference of the averages. That means we have found the correct key byte as well as the place in the trace where the computation takes place. This can be done for all the other key bytes as well. There will be spikes in other places where the other key bytes cause a SBOX table lookup.

This is a good point to recap what exactly we have done:

  • We have traces that contain a summation of 3 different components, C,R and the S.
  • S is related to the key. We are not interested in C and R.
  • By splitting the data traces into any arbitrary 2 sets, we can remove C and R for any set by first averaging each set and then subtracting the results.
  • We make a prediction based on a guessed key and a known cryptographic operation.
  • If we guessed the key incorrectly, the result of the subtractions is tends to zero everywhere.
  • If we guessed the key correctly, then there is a spot somewhere in the subtraction that tends to a nonzero value.
  • Beause AES is implemented as a bytewise algorithm, we only have to perform 256 guesses for each of the 16 key bytes in AES-128, so that implies a loop of 4096 iterations over the data set of traces.

Since this is a statistical method, the more traces, the likelier it is that we get a good result. Also, other bits can be attacked. The prediction example above has only used the least significant bit, but there is no reason we can't use any of the other bits.

At this point, let's do two things: First, the code of interest seems to run towards the beginning of the trace, so we can limit the range of data considered for faster computational speed.

offset = 50000; % we can limit the amount of data to be processed. Not strictly necessary
segmentLength = 40000; % we can limit amount of data to be processed. Not strictly necessary

Next, lets see if the predicted key matches with the real key when using the least significant bit.

Correct Key:                00 11 22 33 44 55 66 77 88 99 AA BB CC DD EE FF
Predicted(bit 0):           00 2A 22 33 44 55 66 77 04 99 AA BB B5 91 CC FF

We did find quite a few of the bytes, but there don't seem to be enough traces to get them all like this. What if we used the other bits??

Predicted(bit 1):           00 11 22 C5 44 55 66 25 88 99 AA BB 9A 1A EE FF
Predicted(bit 2):           C8 B1 4C E5 D3 E0 6E 77 5D 37 49 82 D7 6B 3B B0
Predicted(bit 3):           FF D2 8A 33 3C 01 83 71 C1 99 AA 6E 47 DD EE FF
Predicted(bit 4):           0D D0 A4 41 A0 0C DC CC 05 85 DA D9 96 DD 03 EF
Predicted(bit 5):           49 AF 1D B4 71 C9 F3 1C FF A3 84 15 D3 65 FE DA
Predicted(bit 6):           30 C0 FA 2D 3A BB 57 B6 4F 30 7C EB AF 3A 6C 71
Predicted(bit 7):           50 BF 4A A1 0E 55 43 68 BE 7B 19 46 9C 85 20 7F
Most often predicted value: 00 ?? 22 33 44 55 66 77 ?? 99 AA BB ?? DD EE FF

Clearly some bits work better than others, but also they predict different key bytes to different extents. For example, using bit 0 does not yield the correct 0x11 for the second key byte. Bit 1 does, but does not yield the correct 0x33 for the fourth key byte, which bit 0 did....

One solution, of course is to acquire more traces, but in this case we only have the 200. An easy improvement is to count which values are predicted most often, and in this case we get 13 out of 16 key bytes predicted correctly that way. But this way ignores some useful information: the bits in the bytes are not independent. There has to be a way to use all those bits at the same time, doesn't it? We'll look into that in the next section.

It should be mentioned that the traces have to be well aligned. If they are not, then the signal averages out over time and the spike gets lost in the noise....

Here is the code used for producing the above data:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Matlab key recovery exercise template %% %% 2014, Florent Bruguier, PCM - CNFM %% 2019, modified by Jan Beck %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% declaration of the SBOX (might be useful to calculate the power hypothesis)SBOX=[099 124 119 123 242 107 111 197 048 001 103 043 254 215 171 118 ... 202 130 201 125 250 089 071 240 173 212 162 175 156 164 114 192 ... 183 253 147 038 054 063 247 204 052 165 229 241 113 216 049 021 ... 004 199 035 195 024 150 005 154 007 018 128 226 235 039 178 117 ... 009 131 044 026 027 110 090 160 082 059 214 179 041 227 047 132 ... 083 209 000 237 032 252 177 091 106 203 190 057 074 076 088 207 ... 208 239 170 251 067 077 051 133 069 249 002 127 080 060 159 168 ... 081 163 064 143 146 157 056 245 188 182 218 033 016 255 243 210 ... 205 012 019 236 095 151 068 023 196 167 126 061 100 093 025 115 ... 096 129 079 220 034 042 144 136 070 238 184 020 222 094 011 219 ... 224 050 058 010 073 006 036 092 194 211 172 098 145 149 228 121 ... 231 200 055 109 141 213 078 169 108 086 244 234 101 122 174 008 ... 186 120 037 046 028 166 180 198 232 221 116 031 075 189 139 138 ... 112 062 181 102 072 003 246 014 097 053 087 185 134 193 029 158 ... 225 248 152 017 105 217 142 148 155 030 135 233 206 085 040 223 ... 140 161 137 013 191 230 066 104 065 153 045 015 176 084 187 022];
NOBOX = linspace(0,255,256);%%%%%%%%%%%%%%%%%%%%% LOADING the DATA %%%%%%%%%%%%%%%%%%%%%numberOfTraces = 200;traceSize = 370000; % columns and rows variables are used as inputs % to the function loading the plaintext/ciphertextcolumns = 16;rows = numberOfTraces;plaintext = myin('plaintext-00112233445566778899aabbccddeeff.txt', columns, rows);
offset = 50000; % we can limit the amount of data to be processed. Not strictly necessarysegmentLength = 40000; % we can limit amount of data to be processed. Not strictly necessarymyfile=fopen('traces-00112233445566778899aabbccddeeff.bin','r');traces=zeros(200,segmentLength); % not strictly necessary, but speeds up memory access.for i=1:200 % for each of the 200 traces fseek(myfile, offset, 'cof'); if (segmentLength+offset > traceSize) t=fread(myfile, segmentLength-offset, 'uint8'); else t=fread(myfile, segmentLength, 'uint8'); end; fseek(myfile, (traceSize-segmentLength-offset), 'cof'); traces(i,:)=t; % add t to an array of traces, called 'traces'endfclose(myfile);

byteStart = 1;byteEnd = 16;keyCandidateStart = 0;keyCandidateStop = 255;solvedKey = zeros(1,byteEnd);
% for every byte in the key do:for currentKeyByte=byteStart:byteEnd currentKeyByte for currentKeyByteGuess = keyCandidateStart:keyCandidateStop % iterate through all candidate key bytes, 0x00 to 0xff currentKeyByteGuess xorResult = bitxor(plaintext(:,currentKeyByte),currentKeyByteGuess); % first operation is an XOR between the cleartext and the guessed key Hypothesis(:,currentKeyByteGuess+1)=SBOX(xorResult+1); % then the XOR gets substituted +1 because MATlab index starts at group1 = zeros(1,segmentLength); group2 = zeros(1,segmentLength); numberOfTracesInGroup1 = 0; numberOfTracesInGroup2 = 0;
% for all 200 traces put into one or other group based on predicted Least Significant Bit for L = 1:numberOfTraces firstByte = bitget(Hypothesis(L,currentKeyByteGuess+1),1); % get the expected least significant bit from the hypothesis if firstByte == 1 group1(1,:) = group1(1,:) + traces(L,:); numberOfTracesInGroup1 = numberOfTracesInGroup1 + 1; else group2(1,:) = group2(1,:) + traces(L,:); numberOfTracesInGroup2 = numberOfTracesInGroup2 + 1; end; end; % Calculation of the average of the groups group1(1,:) = group1(1,:) / numberOfTracesInGroup1; % average of group 1 group2(1,:) = group2(1,:) / numberOfTracesInGroup2; % average of group 2 groupDifference = abs(group1(1,:)-group2(1,:)); % subtracting the averages and taking the absolute value. We are interested in the largest difference, positive or negative from average groupFin(currentKeyByteGuess+1,:) = groupDifference; % storing the difference trace based on the current key byte guess. maxDifference(currentKeyByteGuess+1) = max(groupDifference); end; % so now we have all the trace differences and we need to find the one that contains the largest value [traceDifferenceWithLargestValueX, traceDifferenceWithLargestValueY] = find(groupFin==max(groupFin(:))); solvedKey(1,currentKeyByte) = traceDifferenceWithLargestValueX(1) - 1;end;solvedKey

All the files used here are available at https://github.com/janbbeck/SideChannelMatlab

[1] Differential power analysis, advances in cryptology – crypto 1999 , Kocher, P., Jaffe, J. and Jun, B. (1999) ‘ LNCS 1666, pp.388–397, Springer-Verlag.

[2] Hardware Security and Trust: Design and Deployment of Integrated Circuits in a Threatened Environment. Nicolas Sklavos, Ricardo Chaves, Giorgio Di Natale, Francesco Regazzoni. Springer, Jan 11, 2017

[3] https://newae.com/tools/chipwhisperer/

[4] https://research.kudelskisecurity.com/2014/07/16/power-analysis-basics/

[5] https://www.researchgate.net/publication/267769417_Differential_Power_Analysis_in_AES_A_Crypto_Anatomy

[6] https://github.com/Nodulaire/SCA-DPA

[7] Owen Lo, William J. Buchanan & Douglas Carson (2017) Poweranalysis attacks on the AES-128 S-box using differential power analysis (DPA) andcorrelation power analysis (CPA), Journal of Cyber Security Technology, 1:2, 88-107, DOI:10.1080/23742917.2016.1231523

[8] https://github.com/janbbeck/SideChannelMatlab