This lab is to prepare you for project 3. I’m giving you a shared object file, but not the source code that was used to make it. You will dynamically load that shared object, then figure out how to call the three functions that are in it by reverse-engineering the assembly.
This lab starts off really easy, but just wait…
Before you get started: making gdb
use the Intel syntax
gdb
defaults to using the godawful AT&T syntax for x86, but you can change it. While logged into your VM, do this:
nano ~/.gdbinit
Inside that file, write this exactly:
set disassembly-flavor intel
and save. Now run gdb
. If you see this:
/afs/pitt.edu/home/a/b/abc123/.gdbinit:1: Error in sourced command file:
then you made a typo. Go fix it.
Now, when you view disassembly in gdb
, it will match the slides I gave you and will be way easier to understand overall.
Starting off
wget
these materials in your VM- unzip
- rename
abc123_lab7.c
with your username - (I made
compile.sh
work no matter what_lab7.c
is named so you don’t have to edit it this time) chmod +x compile.sh
./compile.sh
to ensure it compiles properly./lab7
should show:
run this like './lab7 ./mystery.so'
./lab7 ./mystery.so
should show nothing!
Dynamic loading
At the top of _lab7.c
, you’ll see #include <dlfcn.h>
. This is the POSIX header for dynamic loading. It gives you access to four functions:
void* dlopen(const char* filename, int flags)
- loads a dynamic library.
char* dlerror()
- returns a string representation of the most-recent error caused by a
dl
function.
- returns a string representation of the most-recent error caused by a
void* dlsym(void* lib, const char* name)
- gets the memory address of the symbol named
name
from the librarylib
(which was returned bydlopen
).
- gets the memory address of the symbol named
int dlclose(void* lib)
- closes an opened library (we won’t be using this, but just so you know, it’s there).
Using these functions is actually really simple!
1. Loading a library
In main
you’ll see // Delete this comment and write your code here.
Follow its instructions. Yes. Delete the comment. Why does everyone leave the comments. Why. Tell me why. EXPLAIN IT TO ME
- Do this:
void* lib = dlopen(argv[1], RTLD_NOW);
- This loads the dynamic library whose filename was passed as the first argument to this program.
- The
RTLD_NOW
tells the linker to do all the dynamic linking right now.- (The other mode is
RTLD_LAZY
which only performs linking as functions are called, which is useful if you load a HUGE dynamic library but only need a handful of things out of it.)
- (The other mode is
- The return value is an opaque pointer. You’re not supposed to know what it points to. It’s just some abstract pointer to “a library.” You will pass that to the other
dl
functions.- It’s kinda like how
FILE*
is returned fromfopen()
and then passed to all the other file functions.
- It’s kinda like how
- Check if
lib
is NULL. If so,- Print out what
dlerror()
returns (it’s a string) - Call
exit(1)
to exit the program indicating an error occurred.
- Print out what
- Just do
(void)lib;
to make the compiler shut up for now.
If you compile this, it should behave like so:
$ ./lab7
run this like './lab7 ./mystery.so'
$ ./lab7 ajsfijasiofha
ajsfijasiofha: cannot open shared object file: No such file or directory
$ ./lab7 mystery.so
mystery.so: cannot open shared object file: No such file or directory
$ ./lab7 ./mystery.so
$
So when we give it a nonexistent file, it gives us an error. But also note that you have to write ./mystery.so
in order for it to properly load the shared object. This is because without the ./
, it instead looks for mystery.so
in the system directories for shared objects, and it ain’t there.
2. Extracting symbols (function pointers) from the library
Now that you’ve successfully opened the library, you need to get symbols out of it. This is super easy.
- Remove the
(void)lib;
line you put to shut the compiler up. - Do this.
void (*func1)() = dlsym(lib, "func1");
dlsym
takes the library and the name of the symbol you want to look up.- It just… looks it up and returns a pointer to it.
- In this case, we’re looking up a function symbol, so we put the result into a function pointer variable.
- Similarly to
dlopen
,dlsym
returnsNULL
on failure. So, iffunc1
isNULL
,- print out the
dlerror()
like before exit(1)
like before- gee do you think you should COPY AND PASTE that code?
- or should you MAKE A FUNCTION???
- print out the
- Repeat for
func2
andfunc3
.- Don’t forget to change the
"func1"
to"func2"
indlsym
’s argument… aaa…
- Don’t forget to change the
- Call all three functions like
func1(); func2(); func3();
- You can technically write
(*func1)()
to call a function pointer but this just looks nicer
- You can technically write
Now compile it, and run it like ./lab7 libz.so.1
. libz.so
is a popular compression and decompression library that is installed in your VM. The .1
at the end indicates the version to load, in this case version 1. You should see this:
$ ./lab7 libz.so.1
/lib/x86_64-linux-gnu/libz.so.1: undefined symbol: func1
$
That is, your program should complain about the lack of func1
and then exit. If it doesn’t do that, fix it.
Finally, call it on ./mystery.so
a few times (the first two numbers will be random every time):
$ ./lab7 ./mystery.so
-594452135, -594187704, 0
no...
0
$ ./lab7 ./mystery.so
1373172057, 1373436488, 0
no...
0
$ ./lab7 ./mystery.so
-1165495975, -1165231544, 0
no...
0
Huh???
What’s going on?
./mystery.so
contains 3 functions, func1
, func2
, and func3
. You’ve successfully dynamically loaded them, but they’re malfunctioning when you call them because you didn’t pass the correct arguments to them.
Remember in 447 when you’d forget to put the arguments in the a
registers before calling a function and it would do really weird stuff? That’s what’s happening here. These lines:
void (*func1)() = dlsym(lib, "func1");
//...
func1();
are wrong, because func1
is not a zero-argument function. All of these functions take 1 or more arguments!
Your task for the rest of the lab is:
- Figure out what arguments these three functions are expecting
- Change the function pointers in
main
to have the right number and types of arguments - Change the calls in
main
to pass the right values for those arguments
Your goal is for your program to output this:
$ ./lab7 ./mystery.so
10, 20, 30
yes!
120
$
But by calling those functions, not just printing out those values :) I’m not that dumb :) the autograder is going to make sure you’re not just printing them out :) hahahaha :)))))))))
3. func1
and static analysis
There are two ways to reverse-engineer a piece of machine or assembly code:
- Static analysis, which means looking at the code and just figuring out what it does using your brain meats.
- This is better for small pieces of code that don’t do anything confusing.
- Dynamic analysis, which figuring out what the code does by running it and seeing what happens.
- This is better for bigger pieces of code where you don’t really need to understand everything that’s happening.
Both techniques are useful in different situations, but for this lab, you can do everything with static analysis. The project will definitely be easier with some dynamic analysis ;)
So where do we start? With gdb
! You can run gdb
on shared objects too. No code will be run, but you can still use its tools to disassemble code, look at global variables, etc.
- Run
gdb ./mystery.so
. - In
gdb
, doinfo functions
.- This prints out a list of functions in the shared object.
- Most of these are internal functions used to initialize and deinitialize global variables, but there are two groups of functions that stand out:
puts@plt
,printf@plt
,strcmp@plt
: these are C standard library functions! The@plt
stands for “procedure linkage table”. Basically it means these functions are coming from another shared library (libc.so
to be exact) and aren’t actually in this file.func1
,func2
,func3
- there they are.
- Now, do
disas func1
. This disassemblesfunc1
(converts the machine code back into assembly language code.)
If you see this at the beginning of the code, you didn’t properly set up your ~/.gdbinit
file as described at the beginning of the lab. You should never see any %
in the disassembly.
0x0000000000001159 <+0>: endbr64
0x000000000000115d <+4>: push %rbp
0x000000000000115e <+5>: mov %rsp,%rbp
Okay. This is what you should see.
0x0000000000001159 <+0>: endbr64 0x000000000000115d <+4>: push rbp 0x000000000000115e <+5>: mov rbp,rsp 0x0000000000001161 <+8>: sub rsp,0x10 0x0000000000001165 <+12>: mov DWORD PTR [rbp-0x4],edi 0x0000000000001168 <+15>: mov DWORD PTR [rbp-0x8],esi 0x000000000000116b <+18>: mov DWORD PTR [rbp-0xc],edx 0x000000000000116e <+21>: mov ecx,DWORD PTR [rbp-0x8] 0x0000000000001171 <+24>: mov edx,DWORD PTR [rbp-0x4] 0x0000000000001174 <+27>: mov eax,DWORD PTR [rbp-0xc] 0x0000000000001177 <+30>: mov esi,eax 0x0000000000001179 <+32>: lea rax,[rip+0xe80] # 0x2000 0x0000000000001180 <+39>: mov rdi,rax 0x0000000000001183 <+42>: mov eax,0x0 0x0000000000001188 <+47>: call 0x1080 <printf@plt> 0x000000000000118d <+52>: nop 0x000000000000118e <+53>: leave 0x000000000000118f <+54>: ret
Some notes:
endbr64
is a security feature and can be ignored.nop
is a “no-op”, an instruction that does nothing, and can be ignored.DWORD PTR
says this is performing a 32-bit load or store (see also the use ofedi
,esi
etc.). Other sizes of loads and stores are indicated with:BYTE PTR
(8-bit),WORD PTR
(16-bit), andQWORD PTR
(64-bit).- This is like how MIPS has e.g.
lw
,lh
,lb
(andld
on 64-bit MIPS)
- This is like how MIPS has e.g.
Okay, now to figure out what this does. Reverse engineering is like solving a puzzle: you have some information about the code, but not all of it. So you kind of “push” that knowledge through the code a bit at a time, until you can understand all of it.
- You know the x86 calling convention - which registers are used for arguments, which for return values, what instructions are used to call and return, which registers are callee saved.
- You also know many of the C standard library functions and what their signatures are -
strcmp
,printf
etc. - Things like string constants can be extremely helpful.
Alright, let’s get started.
- Copy and paste the disassembly into a blank file in your code editor. You’re not going to save this file, you’re just doing this so that you can write notes around the code.
- Find the function prologue and epilogue, and put some visual separation between them and the body of the function.
- Go back to the calling convention slides to see what the prologue and epilogue look like.
- We don’t really care what the prologue and epilogue are doing; we’re just focusing on the juicy code in the body of the function.
- All the references to
[rbp-whatever]
are local variables.- Notice that the first thing the function does is put some values into local variables… what’s the significance of the registers that it’s storing into them? Go look at the calling conventionnnnnnnnn
- remember that e.g.
edi
is just the lower 32 bits ofrdi
.
- remember that e.g.
- Once you know what those registers represent, you can come up with reasonable names for each of those stack locations (like you can replace
[rbp-0x4]
with[varname]
). - So now ask yourself: how many arguments does this function expect? And how big (how many bytes) are they? That should give you some clue as to the types of those arguments.
- Notice that the first thing the function does is put some values into local variables… what’s the significance of the registers that it’s storing into them? Go look at the calling conventionnnnnnnnn
- When the code switches from storing those variables to loading those variables, that’s because it’s moving on to the “next task.”
- The next sequence of instructions ends with
call
. Gee, what do you think all the instructions before it are for?- Look at which registers it is setting, and compare that to the argument registers used in the calling convention. So how many arguments are being passed to this function?
- Notice it says
call 0x1080 <printf@plt>
-gdb
is helping you by saying “this line callsprintf
.” - You can ignore the
mov eax, 0x0
line. It’s something to do with variadic functions. - You know what kind of thing is passed as the first argument to
printf
. And you see:0x0000000000001179 <+32>: lea rax,[rip+0xe80] # 0x2000 0x0000000000001180 <+39>: mov rdi,rax
- Yeah idk why the compiler decided to put the value in
rax
first either. Whatever. - The important bit is the
[rip+0xe80] # 0x2000
. This is computing an address relative torip
which is unimportant;gdb
calculated that address and put it on the right.0x2000
. So this is putting the address0x2000
intorax
. It’s not loading a value from0x2000
, it’s passing the address itself.
- Yeah idk why the compiler decided to put the value in
- The next sequence of instructions ends with
Investigating address 0x2000
.
Huh. The first argument to printf
is the address 0x2000
. You know that printf
takes a string as its first argument. So… that means 0x2000
should be the address of a string, right?
Well let’s try printing the value at address 0x2000
by dereferencing it:
(gdb) p *0x2000
$1 = 539780133
(gdb)
Uh. Hm. That doesn’t look like a string. Well, it’s our fault for not telling gdb
what kind of value is at 0x2000
. It defaults to loading an int
, but strings aren’t int
s, they’re char
s. So we can tell p
to print a char
with p/c
:
(gdb) p/c *0x2000
$2 = 37 '%'
(gdb)
Ahaaa. That’s a character. And a character that you would expect to be passed to printf
, right?
So let’s see what the next few characters are:
(gdb) p/c *0x2001
$3 = 100 'd'
(gdb) p/c *0x2002
$4 = 44 ','
(gdb)
%d, ...
, yeah this looks like a printf
format string alright. But printing out one character at a time is tedious. Thankfully there’s a command that prints out an entire string: x/s
(which means “eXamine memory as a String”):
(gdb) x/s 0x2000
0x2000: "%d, %d, %d\n"
(gdb)
Oh well that’s just TOO EASY isn’t it. Well there you go! That’s the first argument to the printf
call in func1
! And this tells you exactly what types the arguments being passed are, hint hint.
Alright, you finish it off
Go back to your disassembly of func1
and try to write the C equivalent of the printf
call, now that you know what the format string is. That tells you what types the local variables must be, and therefore what types the arguments to func1
must be.
Finally, in _lab7.c
, you can update these lines:
// put the argument TYPES in here
// | (just the types separated by commas, no argument names)
// |
// v
void (*func1)( ) = dlsym(lib, "func1");
//...
// pass the argument VALUES in here to make it print "10, 20, 30"
// |
// v
func1( );
And done correctly, you should now get this when you run your lab7
:
$ ./lab7 ./mystery.so
10, 20, 30
no...
0
If the numbers are printing out in a different order: yes, they are. But you’re the one calling the function from main
. YOU get to decide what arguments to pass in. (Look carefully at the order in which the values are being printed in func1
.)
4. func2
and control flow
Moving onto func2
, you might notice something a little odd about this function. Where’s the prologue and epilogue? Well, I compiled this function with the optimization level set to 1. That removes some needless instructions but still keeps the code mostly like the original C.
The second odd thing about func2
is that the ret
is in the middle?? I’m not sure why the compiler does this, but it put some of the function’s instructions after the ret
and then has it jump back up before the ret
, so it all works out in the end. It’s kinda ugly tho.
Last, this function has some control flow. It’s nothing too crazy, but it’s important that you learn how control flow looks in the disassembly. For example, you’ll see this line:
jne 0x11b9 <func2+41>
There are no labels because this was converted from machine code. But gdb
gives you enough information to see where this jne
is going: 0x11b9
is the address, and <func2+41>
means “41 bytes into func2
.” Every line of the function has that offset printed next to it, so you can see it’s coming down to this line:
0x00000000000011b9 <+41>: lea rdi,[rip+0xe5c] # 0x201c
See, there’s 0x00000000000011b9 <+41>
at the beginning of the line. SO:
- Copy and paste the disassembly into your editor so you can take notes on it again.
- Find each of the control flow instructions. They’re easy to find in x86 cause they all start with
j
. - Find where those instructions are going. Replace those cryptic addresses and offsets with labels like you’d use in assembly.
- e.g. I’d replace
jne 0x11b9
withjne _label2
and put_label2:
before the0x00000000000011b9
line. - this way, later, we can change those labels to meaningful names once we figure out what the code does.
- e.g. I’d replace
- We’ve got some more function calls and more references to constants…
- You know what
strcmp
takes and returns, and you know which register holds the return value. - Wait, where’s its first argument?
- Well, it’s did
func2
change that register before callingstrcmp
? ;D
- Well, it’s did
- The
jne
aftertest
could also be read asjnz
(jump if not zero), hint hint.
- You know what
- Try to figure out what C code would have produced this kind of control flow.
You should now have enough information to know how many arguments this function takes, why type(s) they are, and what value(s) to pass to make it say "yes!"
.
5. func3
This is a long function, but it’s not actually very complicated. This function was compiled without optimization so it does things in a fairly inefficient but easy-to-reverse-engineer way.
You know how to approach this by now! Tips:
- Just like before, find the prologue and epilogue and ignore them.
- Naming the local variables is very helpful for this one.
- At first you might name them e.g.
arg0
,arg1
,local0
,local1
just so they look nicer. - But as you uncover more information about what they’re used for, you will be able to rename them to more meaningful names.
- At first you might name them e.g.
cdqe
is just a sign-extension instruction. It sign-extends the value ineax
to fillrax
. You don’t really need to know that to figure out this function.- There is nothing weird or cryptic going on in this function. The control flow is something you use all the time, and parts of it should look familiar to you from CS 0447.
- In addition, the stuff inside the control flow should also look familiar in another way.
- Why would you ever multiply something by 4………? HMMMMMMMMMMMMMM
- Stop thinking about the assembly so literally and think more about “how would I do this same thing in C?”
- Remember that most high level language code becomes multiple assembly instructions. So try to find “clumps” or “groups” of instructions that are performing some higher-level purpose.
There are actually lots of things that you could pass to this function to get it to print out 120
. But you will need to declare something in your main
function…
Submission
You’ll submit your _lab7.c
to gradescope as usual. The autograder will test your executable against other versions of mystery.so
where the three functions all have the same signatures, but will do different things. Your program shouldn’t care what these functions do, just that they all take the same arguments.
The autograder may also give your program shared objects that don’t have one or more of the func1
, func2
, or func3
symbols, so you better make sure you’re checking the return values of dlsym
and handling the errors properly!