Multi-file development

Just like you really shouldn’t put your entire program inside main, once your programs start getting big enough, you really shouldn’t put your entire program inside one file. C supports programs that consist of multiple files, but in a really weird, old-fashioned way (of course). C dates from a much earlier time when compilers were far more limited in their abilities for practical reasons, so we have to use some very strange workarounds to make it work.

Some of this is just a preview for stuff we’ll be talking about later in the course. So don’t worry if you don’t quite get it now, because we’ll cover it again later.

C’s compilation model

In Java, you run javac on a .java file, and this produces a .class file which contains machine code for the JVM. When you run java, it loads that .class file and any other needed .class files to run your program. Something similar happens in C, but at different times.

The process of compiling one file

Here’s something to keep in mind: gcc is not the compiler. I know, it’s called the “GNU C compiler” but the gcc program itself is not the compiler. gcc is just an orchestrator that runs several other programs in a sequence in order to automate the compilation of C programs.

To compile a C file, gcc runs cc… but cc is itself an orchestrator that ends up calling cc1 which is the real compiler. What does this compiler output? Assembly code! Yes, really! It outputs a text source file of assembly code. I tried it on the 1_hello_world.c example:

/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -masm=intel 1_hello_world.c

And it produced a file 1_hello_world.s (.s is a common file extension for assembly code), which contains some x86-64 assembly code. Try it yourself!

Well now we need to assemble it into machine code, right? That’s what the as (sembler) program does:

as -o 1_hello_world.o 1_hello_world.s

This produces an object file named 1_hello_world.o. An object file is kind of like a Java .class file - it contains machine code, but it’s not an executable file itself. But unlike Java, we don’t have an equivalent of java to run the object file. Instead, there’s one more step: we need to link the object file into an executable.

The ld program is what links object files (and libraries) into executables (and shared libraries). The actual command for invoking ld on this 1_hello_world.o file is way too long to show here, so let’s summarize with a diagram instead:

This is what gcc does for you every time you run it! The orange rectangles are source code text files. The purple rectangles are programs in the compilation toolchain. The blue rectangles are object and library files, and the green rectangle is an actual executable program.

Future diagrams will leave out the assembly source and assembler step, so you can imagine that the C compiler’s job is to convert .c source code files into .o object files.

What about Java? Does it have a linking step? Yes! Actually it has many linking steps that happen at runtime instead of before the program runs. That’s one of the jobs of the java VM itself. If you’ve ever heard of “classpath” and “class loader”, that’s part of Java’s linking stuff.

What happens when you compile multiple files?

The whole reason I spent all this time explaining what happens with one file is because this is what happens if you compile multiple files, like if you ran gcc one.c two.c three.c:

Yes, that’s right, the C compiler cc is run three times, and each time it only compiles one .c file. Here’s the big thing that you need to understand:

Every .c file is compiled completely independently of every other .c file.
Therefore code in one .c file cannot see the code in any other .c file.

The pieces of a program do not actually come together until the linking step, and the linker doesn’t know anything about C! It’s working with machine code!

OK like what the hell man? Why is it like this??

When C was developed, this was all totally normal. At the time, computers had like a few dozen KILOBYTES of memory at most and it was PHYSICALLY IMPOSSIBLE to fit multiple source code files in memory at once so this was just the most natural way of breaking things up. But now we’re in the year 20XX and we’re still doing this. Cause it works. Kinda.

Probably one of the most enduring parts of C’s legacy isn’t so much the language itself but this compilation and linking model (and the machine code ABIs that C uses). Virtually every operating system today expects that you are building and running executables in this way. We’ve piled all kinds of hacks onto the linking step to support “new” languages (like C++, which is pushing 40 years old at this point), but it’s still mostly the same thing that we were doing in 1971.

The advantage to this linking model is that the linker doesn’t really care where the machine code comes from. So you can mix together code from multiple programming languages in the same executable, as long as their compiler outputs machine code that conforms to the C ABI!

In the Java ecosystem, something similar has happened - there are now multiple languages which target the JVM. Their compilers output .class files which behave just like the ones javac produces, and the JVM doesn’t know or care that the original code was written in Scala or Kotlin or Clojure or whatever.

The preprocessor to the rescue

In order to work around this limitation, early on the C toolchain acquired an extra first step: the preprocessor. This is a plain text processing step that essentially performs “automatic copy and paste” on the input source code.

You know how you write #include <stdio.h> at the top of your programs? Every line that starts with # is actually a command (called a “directive”) to the preprocessor, not the compiler. The compiler never sees them. So what does #include do? It does the dumbest thing possible: it copies and pastes the contents of the entire file right there.

So if you have a file two.h that contains this (just ignore the #pragma once for now):

#pragma once
void my_function();

and a file one.c that contains:

#include "two.h"

int main() {
    my_function();
    return 0;
}

Then the preprocessor will convert one.c into something like this before handing it off to the compiler:

void my_function();

int main() {
    my_function();
    return 0;
}

So now, when you call my_function in main, the compiler knows its signature. You can actually see the output of the preprocessor with gcc -E, but be warned, it can be gigantic. For example, if you do this:

gcc -E 1_hello_world.c

It will print out 745 lines of code. For a hello world program. Because that’s all of stdio.h.

Yeah.

Headers: bridging the gap between C files

Notice in the example above, I have one.c including two.h, which only has the prototype for my_function. So where the heck is the code? In two.c of course!

#include <stdio.h>
#include "two.h"

void my_function() {
    printf("Hello, I'm in another file!\n");
}

The way this all works is: two.h is #included in both one.c and two.c, so both times the compiler runs, it knows of the existence of my_function. Then, during the linking step, the linker figures out that main from one.o is trying to call my_function from two.o and links the caller and callee together. That’s why it’s called a linker.

Here’s a diagram of what’s happening (the red dashed arrows mean “is included by” and indicates a copy-and-paste performed by the preprocessor):

Header files and what goes in them

These .h files are called header files and are pretty much unique to C and C++. Compilers for most other programming languages can process multiple files at once and therefore don’t need header files - they just extract the information they need from all the source code files.

In most C projects, each source.c file will have a corresponding source.h file. This is such a common arrangement that IDEs will automatically create both for you, and code editors often have a shortcut to swap between the .c and .h files.

You can think of the header as describing the public interface of its .c file. That means it advertises what is available, but it does not actually contain any code. All the code (the private implementation) goes into the .c file. In this way you can kind of sort of think of the relationship between a Java interface (says what methods are available) and a class that implements that interface (implements those methods), but without the Java OOP system attached.

C doesn’t have any public or private the way Java does, but it can kind of do something similar:

to make something public, put it (or if it’s a function, its prototype) in the header file.
to make something private… don’t put it in the header. (and if it’s a function, declare it static too.)

DON'T put these in headers	DO put these in headers
Any code, ever. (unless you’re using C++ and it’s a template or an inline function kasdjlfjklasdf) Prototypes for functions that you want to be private to the `.c` file. Global variables. Not just because they’re bad, but because it’ll mess up linking really badly. Anything else that you want to be private to the `.c` file.	Prototypes for public functions that you want other `.c` files to use. Public `struct`s. Public `enum`s. Public `typedef`s. Public `#define`s (used for constants and macros)

Header FAQs

“So you mean every time I want to add a function to my .c file, I have to put its prototype in the header?”
- Only if you want other .c files to be able to call that function. If you don’t, don’t.
- But yes it means you have to duplicate that info in two different files. IDEs probably help with this.
“I included "header.h" which has all the functions I want to use but now I’m getting all these undefined reference errors. Why?”
- Those are linker errors, not compiler errors. Notice each error starts with /bin/ld.
- The header file just tells the compiler that those functions exist. It doesn’t tell the linker anything. Those functions are in another .o file so the linker goes “I dunno what these are lol”
- You need to list all the .c files on the gcc line, so that it will properly give all the .o files to the linker.
“What happens if I change a function’s signature in the .c file and forget to make the .h file match?”
- Bad things.
- If you have all the compiler warnings set to maximum and you include source.h in source.c, the compiler will probably catch your mistake and tell you that the function declaration doesn’t match the prototype.
- But if you don’t, well… have fun with bizarre runtime bugs that you can’t figure out! Because the linker has no idea what a function signature is and will gladly link together a caller and a callee who have different ideas of what the arguments and return value are supposed to be.
“What if I want a struct with private fields?”
- It’s kind of possible, but it’s all-or-nothing.
- If you put the whole struct definition in the header, all its fields are public.
- If you put this in the header (e.g. mystery.h): typedef struct Mysterious Mysterious;
  - This is a “struct prototype.” It says “there is a struct named Mysterious” but doesn’t say what’s in it.
- Then in mystery.h you can declare functions that take and return a Mysterious*, and that’s the only way other files can use a Mysterious. They can’t even use sizeof(Mysterious)!
- Finally in mystery.c, you can write the private definition of struct Mysterious { ... }; and use its fields as normal. Boom, private to one file.