By Brad Campbell –
Having worked with C and Assembly for most of my reverse-engineering endeavors, I decided to switch it up a bit and work with compiled Python bytecode. I really enjoyed creating this post and learned a great deal more about how Python works internally than I knew before. Resources are provided to allow the curious to explore these areas even farther, and to learn the things I did!
For those who are unaware, Python is an interpreted language that is commonly used for its combination of power, speed, and ease-of-use. Being an interpreted language, Python code must compile down into machine code at some point. Rather than directly compiling the typed Python code into an assembly language, the authors instead compile Python into a language-specific byte-code format.
Byte code tends to be a representation of a higher-level language that can be more easily
parsed and worked with than the source-code file. However, this leads to a rather significant degradation in the readability of the code, since you will not be actually reading the original Python code but the output of the compilation process. Before code can be reverse engineered and disassembled, it is always helpful to have at least a basic understanding of how the code was translated from its original state to the state that it currently rests at.
When a Python source-code file is run, the file is not actually “run” per-se. The compiler code is executed on the source file to compile it to a .pyc file. This entails multiple steps. One of the first steps is parsing the source file into an Abstract Syntax Tree, allowing the rest of the compiler code to decompose the typed source code into a series of basic blocks. These are then iterated over by the compiler in the next step, which is to process the basic blocks and turn them into meaningful data via the symbols parser. This parser determines the scope and values of variables. Following this, the compiler then generates a series of basic blocks from the AST to create the final representation of the code in-memory, before it is optimized and passed to the
Luckily, Python already provides utilities in the standard library to assist in the reading of
compiled python bytecode, meaning that third-party tools will not have to be downloaded, installed, and tested for accuracy (since the Python tools will always be up to date). The
marshal module is used to load and unload bytecode from a file. This is how a .pyc file is loaded into memory for execution in the Python Virtual Machine. The
load function of this module is used to load binary data in and represent it as a Python object. To create the .pyc file, the
dump method is used on the Python bytecode object and is then written to disk to save the bytecode to a file in a binary representation that can be read by Python. The companion
dis module is capable of taking Python objects and displaying their bytecode representation. This is invaluable to the budding reverse engineer, as it allows for a standard way to display the compiled code in a human-readable format!
Investigating the source of the bytecode object returned by
marshal.load (since this is the what Python runs when executing a .pyc file) yields useful attributes for the reverse-engineering of the program. Namely, a list of constant values and strings used (
code.co_consts), variable names(
code.co_names), and most importantly the bytecode itself! Disassembly can be generated by calling
dis.disassemble() on the object (in this case the
code object returned from
Before going any farther it is important to note that
marshal.load cannot be directly called on a .pyc file. The code will raise an error of type
ValueError: bad marshal data (unknown type code). This was quite confusing at first, however I found a blog post written by Ned Batchelder in 2008 explaining this error. When a Python file is compiled, the marshaled code object is prepended with a 4-byte magic number indicating the version of Python the file is compatible with, and a 4-byte timestamp field that tracks the
last modification time of the source file, so that it can be recompiled if the source file is updated.
One last thing to note is that python won’t generally save the compiled results for a program unless it is being used in a module instead of as a script. For single-program compilations, the file can simply be imported from another module. A quick one-liner to do this is
python -c "import filename", replacing
filename with the python file to compile. For larger compilations, Python includes a module that can be invoked to recursively compile all the files in a directory. This is called via
python -m compileall .,
. is the directory you want to be compiled. For these demos, the former method will be used due to the simplicity of the code.
With this knowledge in hand, we can finally begin the process of taking a .pyc file apart and seeing what it looks like from an internal perspective. I will be using Eric Snow’s inspect_pyc program program to take the file apart and look at the different code segments. This program takes the .pyc and will display the various code segments that it finds, their attributes, and the disassembly for any code it finds.
The simple Python program I will be taking apart:
i = 4 x = 2 if i == x: print("Uhoh") def fn(a,b): return a * b + 7 print(fn(i, x))
Output from the inspect_pyc utility:
root@1ec07f3af7cf:/opt# python inspect_pyc.py sample01.pyc ## inspecting pyc file ## filename: sample01.pyc magic number: 0x(03 f3 0d 0a) code co_argcount: 0 co_cellvars: () co_filename: 'sample01.py' co_firstlineno: 1 co_flags: 0x00040 co_freevars: () co_lnotab: '\x06\x01\x06\x01\x0c\x01\x08\x02\t\x03' co_name: '<module>' co_names: ('i', 'x', 'fn') co_nlocals: 0 co_stacksize: 3 co_varnames: () co_consts 0 4 1 2 2 'Uhoh' 3 (code object) co_argcount: 2 co_cellvars: () co_filename: 'sample01.py' co_firstlineno: 6 co_flags: 0x00043 co_freevars: () co_lnotab: '\x00\x01' co_name: 'fn' co_names: () co_nlocals: 2 co_stacksize: 2 co_varnames: ('a', 'b') co_consts 0 None 1 7 co_code 7c 00 00 7c 01 00 14 64 01 00 17 53 disassembled: 7 0 LOAD_FAST 0 (a) 3 LOAD_FAST 1 (b) 6 BINARY_MULTIPLY 7 LOAD_CONST 1 (7) 10 BINARY_ADD 11 RETURN_VALUE 4 None co_code 64 00 00 5a 00 00 64 01 00 5a 01 00 65 00 00 65 01 00 6b 02 00 72 20 00 64 02 00 47 48 6e 00 00 64 03 00 84 00 00 5a 02 00 65 02 00 65 00 00 65 01 00 83 02 00 47 48 64 04 00 53 disassembled: 1 0 LOAD_CONST 0 (4) 3 STORE_NAME 0 (i) 2 6 LOAD_CONST 1 (2) 9 STORE_NAME 1 (x) 3 12 LOAD_NAME 0 (i) 15 LOAD_NAME 1 (x) 18 COMPARE_OP 2 (==) 21 POP_JUMP_IF_FALSE 32 4 24 LOAD_CONST 2 ('Uhoh') 27 PRINT_ITEM 28 PRINT_NEWLINE 29 JUMP_FORWARD 0 (to 32) 6 >> 32 LOAD_CONST 3 (<code object fn at 0x7f7efaeafbb0, file "sample01.py", line 6>) 35 MAKE_FUNCTION 0 38 STORE_NAME 2 (fn) 9 41 LOAD_NAME 2 (fn) 44 LOAD_NAME 0 (i) 47 LOAD_NAME 1 (x) 50 CALL_FUNCTION 2 53 PRINT_ITEM 54 PRINT_NEWLINE 55 LOAD_CONST 4 (None) 58 RETURN_VALUE ## done inspecting pyc file ## root@1ec07f3af7cf:/opt#
The first code chunk that the analysis tool spits out contains the tags the Marshal class assigns to the segment. Most of these do not have much relevance for basic reverse engineering. The most important are
co_consts. These will give you the number of arguments the function takes (where applicable), the names of the variables contained within the function, and the constant values that the function needs. The presence of
co_conts is surprising, however not unwelcome in the slightest! Having a table of the constant values in a function is useful. Instead of encoding the values into the opcodes, values are referenced from offsets in their respective
co_consts array, same for variable names referenced from
To get some understanding of what is happening with this, understanding how the Python VM works is important. The PVM is a stack-oriented architecture. If you are not familiar with how a stack works in computing, the second document in “More Reading”, titled “Executing Code in the Python Virtual Machine”, contains a great deal of information on how this works, and it would be useful to spend about 10 minutes reading the first two sections.
Every operation in Python depends on the stack. As such, all the operations listed above are either pulling data off the stack, or putting data on the stack. LOAD operations take a value, be it an immediate constant value, a variable name, a function address, anything of the sort, and push it onto the stack. On the other hand, STORE operations take a value (or multiple values) from the stack, and will attach this to a variable. Mathematical operations, for example BINARY_ADD, will take two values, perform the operation on them, and then save the result back on the stack. These operations are simple enough and present little challenge in deciphering their meaning.
And thats about all there is to taking apart a compiled Python file! Learning the Python bytecode language would be an important step for those wishing to continue with this, and practicing taking apart various code structures to see how they are laid out in memory would also be productive. The extra readings and references below should proide a solid starting point for a budding reverse engineer.