Just like you really shouldn’t put your entire program inside main
, once your programs start getting big enough, you really shouldn’t put your entire program inside one file. C supports programs that consist of multiple files, but in a really weird, old-fashioned way (of course). C dates from a much earlier time when compilers were far more limited in their abilities for practical reasons, so we have to use some very strange workarounds to make it work.
Some of this is just a preview for stuff we’ll be talking about later in the course. So don’t worry if you don’t quite get it now, because we’ll cover it again later.
C’s compilation model
In Java, you run javac
on a .java
file, and this produces a .class
file which contains machine code for the JVM. When you run java
, it loads that .class
file and any other needed .class
files to run your program. Something similar happens in C, but at different times.
The process of compiling one file
Here’s something to keep in mind: gcc
is not the compiler. I know, it’s called the “GNU C compiler” but the gcc
program itself is not the compiler. gcc
is just an orchestrator that runs several other programs in a sequence in order to automate the compilation of C programs.
To compile a C file, gcc
runs cc
… but cc
is itself an orchestrator that ends up calling cc1
which is the real compiler. What does this compiler output? Assembly code! Yes, really! It outputs a text source file of assembly code. I tried it on the 1_hello_world.c
example:
/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -masm=intel 1_hello_world.c
And it produced a file 1_hello_world.s
(.s
is a common file extension for assembly code), which contains some x86-64 assembly code. Try it yourself!
Well now we need to assemble it into machine code, right? That’s what the as
(sembler) program does:
as -o 1_hello_world.o 1_hello_world.s
This produces an object file named 1_hello_world.o
. An object file is kind of like a Java .class
file - it contains machine code, but it’s not an executable file itself. But unlike Java, we don’t have an equivalent of java
to run the object file. Instead, there’s one more step: we need to link the object file into an executable.
The ld
program is what links object files (and libraries) into executables (and shared libraries). The actual command for invoking ld
on this 1_hello_world.o
file is way too long to show here, so let’s summarize with a diagram instead:
This is what gcc
does for you every time you run it! The orange rectangles are source code text files. The purple rectangles are programs in the compilation toolchain. The blue rectangles are object and library files, and the green rectangle is an actual executable program.
Future diagrams will leave out the assembly source and assembler step, so you can imagine that the C compiler’s job is to convert .c
source code files into .o
object files.
What about Java? Does it have a linking step? Yes! Actually it has many linking steps that happen at runtime instead of before the program runs. That’s one of the jobs of the
java
VM itself. If you’ve ever heard of “classpath” and “class loader”, that’s part of Java’s linking stuff.
What happens when you compile multiple files?
The whole reason I spent all this time explaining what happens with one file is because this is what happens if you compile multiple files, like if you ran gcc one.c two.c three.c
:
Yes, that’s right, the C compiler cc
is run three times, and each time it only compiles one .c
file. Here’s the big thing that you need to understand:
Every .c
file is compiled completely independently of every other .c
file.
Therefore code in one .c
file cannot see the code in any other .c
file.
The pieces of a program do not actually come together until the linking step, and the linker doesn’t know anything about C! It’s working with machine code!
OK like what the hell man? Why is it like this??
When C was developed, this was all totally normal. At the time, computers had like a few dozen KILOBYTES of memory at most and it was PHYSICALLY IMPOSSIBLE to fit multiple source code files in memory at once so this was just the most natural way of breaking things up. But now we’re in the year 20XX and we’re still doing this. Cause it works. Kinda.
Probably one of the most enduring parts of C’s legacy isn’t so much the language itself but this compilation and linking model (and the machine code ABIs that C uses). Virtually every operating system today expects that you are building and running executables in this way. We’ve piled all kinds of hacks onto the linking step to support “new” languages (like C++, which is pushing 40 years old at this point), but it’s still mostly the same thing that we were doing in 1971.
The advantage to this linking model is that the linker doesn’t really care where the machine code comes from. So you can mix together code from multiple programming languages in the same executable, as long as their compiler outputs machine code that conforms to the C ABI!
In the Java ecosystem, something similar has happened - there are now multiple languages which target the JVM. Their compilers output
.class
files which behave just like the onesjavac
produces, and the JVM doesn’t know or care that the original code was written in Scala or Kotlin or Clojure or whatever.
The preprocessor to the rescue
In order to work around this limitation, early on the C toolchain acquired an extra first step: the preprocessor. This is a plain text processing step that essentially performs “automatic copy and paste” on the input source code.
You know how you write #include <stdio.h>
at the top of your programs? Every line that starts with #
is actually a command (called a “directive”) to the preprocessor, not the compiler. The compiler never sees them. So what does #include
do? It does the dumbest thing possible: it copies and pastes the contents of the entire file right there.
So if you have a file two.h
that contains this (just ignore the #pragma once
for now):
#pragma once
void my_function();
and a file one.c
that contains:
#include "two.h"
int main() {
my_function();
return 0;
}
Then the preprocessor will convert one.c
into something like this before handing it off to the compiler:
void my_function();
int main() {
my_function();
return 0;
}
So now, when you call my_function
in main, the compiler knows its signature. You can actually see the output of the preprocessor with gcc -E
, but be warned, it can be gigantic. For example, if you do this:
gcc -E 1_hello_world.c
It will print out 745 lines of code. For a hello world program. Because that’s all of stdio.h
.
Yeah.
Headers: bridging the gap between C files
Notice in the example above, I have one.c
including two.h
, which only has the prototype for my_function
. So where the heck is the code? In two.c
of course!
#include <stdio.h>
#include "two.h"
void my_function() {
printf("Hello, I'm in another file!\n");
}
The way this all works is: two.h
is #include
d in both one.c
and two.c
, so both times the compiler runs, it knows of the existence of my_function
. Then, during the linking step, the linker figures out that main
from one.o
is trying to call my_function
from two.o
and links the caller and callee together. That’s why it’s called a linker.
Here’s a diagram of what’s happening (the red dashed arrows mean “is included by” and indicates a copy-and-paste performed by the preprocessor):
Header files and what goes in them
These .h
files are called header files and are pretty much unique to C and C++. Compilers for most other programming languages can process multiple files at once and therefore don’t need header files - they just extract the information they need from all the source code files.
In most C projects, each source.c
file will have a corresponding source.h
file. This is such a common arrangement that IDEs will automatically create both for you, and code editors often have a shortcut to swap between the .c
and .h
files.
You can think of the header as describing the public interface of its .c
file. That means it advertises what is available, but it does not actually contain any code. All the code (the private implementation) goes into the .c
file. In this way you can kind of sort of think of the relationship between a Java interface
(says what methods are available) and a class
that implements that interface (implements those methods), but without the Java OOP system attached.
C doesn’t have any public
or private
the way Java does, but it can kind of do something similar:
- to make something public, put it (or if it’s a function, its prototype) in the header file.
- to make something private… don’t put it in the header. (and if it’s a function, declare it
static
too.)
DON'T put these in headers | DO put these in headers |
---|---|
|
|
Header FAQs
- “So you mean every time I want to add a function to my
.c
file, I have to put its prototype in the header?”- Only if you want other
.c
files to be able to call that function. If you don’t, don’t. - But yes it means you have to duplicate that info in two different files. IDEs probably help with this.
- Only if you want other
- “I included
"header.h"
which has all the functions I want to use but now I’m getting all theseundefined reference
errors. Why?”- Those are linker errors, not compiler errors. Notice each error starts with
/bin/ld
. - The header file just tells the compiler that those functions exist. It doesn’t tell the linker anything. Those functions are in another
.o
file so the linker goes “I dunno what these are lol” - You need to list all the
.c
files on thegcc
line, so that it will properly give all the.o
files to the linker.
- Those are linker errors, not compiler errors. Notice each error starts with
- “What happens if I change a function’s signature in the
.c
file and forget to make the.h
file match?”- Bad things.
- If you have all the compiler warnings set to maximum and you include
source.h
insource.c
, the compiler will probably catch your mistake and tell you that the function declaration doesn’t match the prototype. - But if you don’t, well… have fun with bizarre runtime bugs that you can’t figure out! Because the linker has no idea what a function signature is and will gladly link together a caller and a callee who have different ideas of what the arguments and return value are supposed to be.
- “What if I want a
struct
with private fields?”- It’s kind of possible, but it’s all-or-nothing.
- If you put the whole struct definition in the header, all its fields are public.
- If you put this in the header (e.g.
mystery.h
):typedef struct Mysterious Mysterious;
- This is a “struct prototype.” It says “there is a struct named
Mysterious
” but doesn’t say what’s in it.
- This is a “struct prototype.” It says “there is a struct named
- Then in
mystery.h
you can declare functions that take and return aMysterious*
, and that’s the only way other files can use aMysterious
. They can’t even usesizeof(Mysterious)
! - Finally in
mystery.c
, you can write the private definition ofstruct Mysterious { ... };
and use its fields as normal. Boom, private to one file.