Symbolic Analysis With angr
I came across an interesting cybersecurity challenge which I think is an excellent example of the power of symbolic analysis.
The challenge is a binary which looks for a base64 encoded file and then simply checks that the file is the correct one by traversing it. The file is not provided, so presumably one is supposed to recreate the file based on the check routine. The interesting part is that the traversal is not a loop, but hardcoded in a very long routine which breaks some traditional tools.
- Disassembling with IDA works fine and is quick, but trying to build a graph view takes a whole day on my machine, followed by a warning that the maximum number of nodes is limited to 1000. We can increase that, but at somewhere between 100000 and 1000000 ida crashes instead of making a graph. In any case it's probably pointless since the function is so long.
- Trying HexRays to decompile the check routine results in a program crash.
- The RetDec IDA plugin seems to keep running forever. I don't know if it will ever finish
- RetDec command line version fails after taking a long time:
Running phase: Initialization ( 0.01s )
Running phase: LLVM ( 0.02s )
Running phase: Providers initialization ( 0.02s )
Running phase: Input binary to LLVM IR decoding ( 0.28s )
Running phase: LLVM ( 207465.34s )
Running phase: x87 fpu register analysis ( 207467.14s )
Running phase: Main function identification optimization ( 207467.63s )
Running phase: Libgcc idioms optimization ( 207467.92s )
Running phase: LLVM instruction optimization ( 207467.92s )
Running phase: Conditional branch optimization ( 207468.81s )
Running phase: Syscalls optimization ( 207477.50s )
Running phase: Stack optimization ( 207477.50s )
Running phase: Constants optimization ( 207495.33s )
Running phase: Function parameters and returns optimization ( 207519.06s )
Error: Decompilation to LLVM IR failed
- Cutter, interestingly, works fairly quickly in the pseudocode section, but limits the length of the code to 127 basic blocks as it's using the pdc decompilation command
- radare2 on the command line does a pretty good job decompiling with the pdd command:
$ r2pm -i r2dec
$ r2 a.out
[0x000007c0]> aa
[0x000007c0]> afl
... snip ...
0x000008f0 1 33 sym.badboy
0x00000911 1 31 sym.goodboy
0x00000930 10 405 main
0x00000ac5 127 3047 sym.check
... snip ...
[0x000007c0]> s sym.check
/* r2dec pseudo code output */
/* ch30.bin @ 0xac5 */
#include <stdint.h>
int64_t check (int32_t arg1) {
int32_t var_18h;
int32_t var_8h;
int32_t var_4h;
var_18h = rdi;// this is pointer to base64 encoded string from file
var_4h = 0;// initialize counter
var_8h = 0xf7;// copy constant to local variable
eax = var_4h;// copy counter value to eax
rdx = (int64_t) eax; //sign extend eax to rdx
rax = var_18h; // point rax to base64 encoded string from file
rax += rdx;// increment pointer by counter
eax = *(rax);// copy first character from string into eax
eax = (int32_t) al;// just keep low byte of character
al ^= 0xa3; // xor character with 0xa3
if (eax != var_8h) { // check if (first char xor 0xa3) == 0xf7 -- 'T'
eax = var_4h;
edi = var_4h;
badboy ();
}
... snip ...
var_4h++;
var_8h = 0x54;
eax = var_4h;
rdx = (int64_t) eax;
rax = var_18h;
rax += rdx;
eax = *(rax);
eax ^= 0x15;
eax = (int32_t) al;
/* Beware that this jump is a conditional jump.
* r2dec transformed it as a return, due being the
* last instruction. Please, check 'pdda' output
* for more hints. */
return void (*0x16b6)() ();
- Ghidra gobbles up resources and crashes when trying to create the pseudocode
The function goes character by character through the file, checks if it is the correct one and then moves on to the next. There are thousands of checks. So one approach is to take the pseudocode or disassembly listing and use that to create the proper string. I tried it and it works. But this is also a good case for using a symbolic analysis engine.
Angr is "angr is a python framework for analysing binaries. It combines both static and dynamic symbolic ("concolic") analysis, making it applicable to a variety of tasks." And the very interesting task we are going to use it for here is static symbolic analysis. In essence, angr can disassemble a binary and we analyse what series of actions will result in a desired outcome. The binary under analysis here has a pretty simple structure. After making some initial checks on the input file, such as length, it will pass the file contents to a check routine. The check
routine then calls one of two functions, goodboy
or badboy
. So what we can do with angr is determine which file contents result in the check
routine calling goodboy
instead of badboy
.
I find that a lot of the example content on how to use angr online does not function any more due to deprecation of API calls. Angr is being developed at a fast pace.....
Here are some resources on the details of symbolic execution:
So I think it's good to have another example out there.
Here I will use Ubuntu 18.10 in a clean virtual machine and set it up for angr with the following commands:
sudo apt update
sudo apt upgrade
sudo apt install python3-dev libffi-dev build-essential virtualenvwrapper
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
mkvirtualenv --python=$(which python3) angr && pip install angr
one can then run a script just by calling it like this:
python angr angrscript
There is a useful cheat sheet on the angr github:
https://github.com/angr/angr-doc/blob/master/CHEATSHEET.md
I think it's worth going through a simple example from angr.io in order to have a good starting point for the thing we are going to do here:
consider the code given:
import os
import angr
project = angr.Project("defcamp_quals_2015_r100", auto_load_libs=False)
path_group = project.factory.path_group()
path_group.explore(find=lambda path: 'Nice!' in path.state.posix.dumps(1))
print path_group.found[0].state.posix.dumps(0)
The imports are pretty self-explanatory.
project = angr.Project("defcamp_quals_2015_r100", auto_load_libs=False)
loads an executable into the current project. The executable is called defcamp_quals_2015_r100
and this executable takes a command line argument as a password. If the correct password is passed, then it prints a message: "Nice!
"
auto_load_libs=False
keeps angr from loading and analysing the libraries associated with the executable to save time. In this case all the logic we are interested in happens in the program, so the libraries need not be analysed.
After loading the executable, we have to decide on what analysis to perform, and in this case it's a path group. path_group = project.factory.path_group()
Then, we can look for the path group that causes the program to print "Nice!
" on stdout
, path_group.explore(find=lambda path: 'Nice!' in path.state.posix.dumps(1))
Finally, print the stdin
of the path group that leads us there: print path_group.found[0].state.posix.dumps(0)
This is very powerful! The executable at hand uses a file input, rather than stdin, so we need an angr script that works on that.
import angr
import claripy
project = angr.Project('ch30.bin', auto_load_libs=False) # load executable, don't analyse libaries
data = claripy.BVS('data', 0x100000*8) # declare how long the file can be; this is used by the concretize() function later
filename = 'test.txt' # name for simulated file to be passed to the executable
simfile = angr.SimFile(filename, content=bytestring) # declare contents for simulated file to be passed to the executable
state = project.factory.entry_state(args=['./executable', filename], fs={filename: simfile}) # initialize state machine to executable's entry point
simulationManager = project.factory.simulation_manager(state)
addr = project.loader.find_symbol('goodboy').rebased_addr # function we want to get to
avoid_addr = project.loader.find_symbol('badboy').rebased_addr # functin we want to avoid - allows angr to cull execution paths.
simulationManager.explore(find=[addr, addr2,addr3, addr4,addr5, addr6,addr7, addr8,addr9, addr10,addr11, addr12],avoid=avoid_addr, step_func=lambda lsimulationManager: lsimulationManager.drop(stash='avoid')) # find a way through to the goodboy routine
test = simulationManager.one_found.fs.get(filename).concretize() # actually patch together the file input that results in the desired path
print(repr(test))
This works, but it very very slow and memory intensive. I found that it was much faster to just simulate some way into the check routine, copy the concretized file input up to that point and start from that point:
import angr
import claripy
import logging
import time
t0 = time.time()
project = angr.Project('ch30.bin', auto_load_libs=False) # load executable, don't analyse libaries
bytestring = None
data = 'TVqQAAMAAAAEAAAA' # this is the start of the file as we know it up to now
stringAsList = list(data); # turn string into list
stringAsList.append(claripy.BVS('data', 0x1000*8)) # declare how long the file can be; this is used by the concretize() function later
bytestring = claripy.Concat(*stringAsList) # turn list into bytestring
filename = 'test.txt' # name for simulated file to be passed to the executable
simfile = angr.SimFile(filename, content=bytestring) # declare contents for simulated file to be passed to the executable
state = project.factory.entry_state(args=['./ch30.bin', filename], fs={filename: simfile}) # initialize state machine to executable's entry point
simulationManager = project.factory.simulation_manager(state)
simulationManager.active
print("start exploration")
logging.getLogger('angr.sim_manager').setLevel('INFO')
addr = project.loader.find_symbol('check').rebased_addr # + 51 * len(data) # go a little further into the check routine than last time
addr2 = addr + 1
addr3 = addr + 2
addr4 = addr + 3
addr5 = addr + 4
addr6 = addr + 5
addr7 = addr + 6
addr8 = addr + 7
addr9 = addr + 8
addr10 = addr + 9
addr11 = addr + 10
addr12 = addr + 11
avoid_addr = project.loader.find_symbol('badboy').rebased_addr
simulationManager.explore(find=[addr, addr2,addr3, addr4,addr5, addr6,addr7, addr8,addr9, addr10,addr11, addr12],avoid=avoid_addr, step_func=lambda lsimulationManager: lsimulationManager.drop(stash='avoid'))
print("start concretize")
test = simulationManager.one_found.fs.get(filename).concretize()
print(repr(test))
t1 = time.time()
print (t1-t0)
This results in a base64 encoded executable:
start concretize
b'TVqQAAMAAAAEAAAA
[.....]
nAA==\n\x00\x00\x00
[.....]
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1452.8494970798492
(angr) :~$
The rest of the challenge was straight forward - the executable was a windows PE file with no surprises.