## Exam format

• When/where
• During class, here, like normal
• 75 minutes (4:00-5:15 PM)
• Closed-note, no calculator
• You may not have any notes, cheat sheets etc. to take the exam
• The math on the exam has been designed to be doable either in your head or very quickly on paper (e.g. 2 x 1 digit multiplication); if you find yourself needing a calculator, you did something wrong
• Keep numbers in scientific notation, do not take them out of it until the end
• Avoid division when you can - do reciprocal first, then multiply by that
• I literally design the test questions to be easy to do reciprocals
• Length
• Very much like the first exam.
• 75 minutes
• Topic point distribution
• It is not cumulative, omg
• More credit for earlier topics (e.g. AND, OR, multiplexer)
• Less credit for more recent ones (e.g. microcode, pipelining)
• More credit for things I expect you to know because of your experience (labs, project)
• VERY ROUGHLY:
• ~30% Logic (combinational and sequential)
• ~40% CPU design
• ~25% Performance
• ~5% Other
• Kinds of questions
• Very much like the first exam.

Remember, these are just the things that people asked about. There may be topics on the exam not on this list; and there may be topics on this list that are not on the exam.

• Combinational logic
• Anything that doesn’t have memory (latches, flip flops, registers, RAM)
• Includes gates (AND, OR, NOT etc), plexers, arithmetic computations
• Boolean expressions/functions
• Boolean inputs, one (or more) boolean output
• (multiple boolean outputs are really separate expressions)
• Basically, if you can represent it as a truth table, it’s a boolean expression
• Turning a truth table into a boolean expression is extremely straightforward:
1. find every row of the truth table where the output is 1
2. for each of those, write a term that is all the input variables ANDed together, with bars (NOTs) on each variable that is 0 in that row
3. OR all those terms together. you will get a “sum-of-products” expression that is like `Y = term + term + term...`
• using the Engineering notation for these is very compact - `Y = AB + CD` or something
• Propagation delay
• Propagation delay is how long it takes for a signal to pass through some circuit
• Nothing moves infinitely fast in the real world, so there are limits on how quickly we can compute things
• The propagation delay of a sequential circuit’s critical path (longest series of operations that cannot be done in parallel) limits clock speed
• Ripple carry
• Method of implementing multi-bit addition where the carry-out of each bit becomes the carry-in of the next higher bit
• Simple to implement, but linear time in the number of bits
• Double the number of bits? Doubles the time
• When the inputs change, it will produce invalid results for a while, because the carries must “ripple” from LSB to MSB
• The critical path is from the LSB’s input to the MSB’s output - hence why it’s linear time
• Sequential logic
• can remember things unlike combinational logic
• any memory (flip flop, register, RAM), or any circuit that contains any memory
• relies on the clock signal to tell the memory components when to update their contents
• combined with combinational logic to make finite state machines
• Latches, flip-flops, and registers
• a latch is the simplest circuit that can remember 1 bit of information
• (there are actually several kinds of latches, but we only looked at the RS Latch)
• a flip-flop is a latch surrounded by some extra circuitry which:
• makes it more stable and less prone to oscillation
• makes it work with the clock signal
• may also give it a write enable input
• a flip-flop is a 1-bit register
• an n-bit register is n flip-flops
• Multiplication and division
• Addition is commutative and associative,
• This means that the sub-steps of multiplication can be reordered and even done in parallel
• This gives us two practical multiplication algorithms:
• the slow, sequential, linear, grade-school multiplication algorithm is `O(n)` time (n = number of bits)
• This is the algorithm implemented with the FSM, with 3 registers and an adder
• the fast, combinational, parallel multiplication algorithm is `O(log n)` time
• This is the algorithm implemented as a tree of adders, no registers at all
• but this is a time-space tradeoff
• the linear time multiplier needs only `O(n)` 1-bit full adders
• while the logarithmic time multiplier needs `O(n^2)` 1-bit full adders - double the number of bits, quadruple the space needed for the circuitry!
• Division is made of multiple subtractions
• and subtraction is neither commutative nor associative
• which means the sub-steps of division must always be done in order
• therefore, division is always `O(n)`
• yes, even if you guess with the SRT algorithm
• This is the algorithm that looks a lot like the multiplication FSM but remixed
• division is not slower “because of the remainder” or something. the remainder is calculated at the same time as the quotient.
• there are Logisim examples of all three things on the materials page!!
• FSMs (Finite State Machines)
• inputs, state, transition logic, outputs
• the transition logic determines the next state based on the current state and inputs
• the output logic determines the outputs from the current state (and optionally, also from the current inputs)
• this is the distinction between Moore and Mealy machines and I forget which is which but it’s not important for this class and you can look it up if you’re curious
• transition logic can be shown as either a state diagram (the nodes with arrows indicating transitions) or as a transition table (for each combination of state + inputs, show the “next” state)
• these two are equivalent - each arrow in the state diagram is a row in the transition table
• but the table is easier to mechanically translate into circuitry to implement the transition logic
• Parts of the CPU
• PC FSM controls the PC and lets it advance to next instruction, do absolute jumps, or do relative branches
• Absolute jumps just set the PC to some value (e.g. `PC = 0x80004004`)
• Relative branches add a number to the current PC to move forward or backwards by a certain amount (e.g. `PC = PC + 12`)
• Branches do not have a return address, I don’t know why so many people put this on the exam, branches make a choice and never return, you are somehow confusing them with function calls
• MIPS’s `jal` instruction is an absolute jump, but it also sets `ra` to the address of the `jal` plus 4. totally different thing
• Instruction memory contains the instructions, and is addressed by the PC - corresponds to the `.text` segment of your program
• This is where instructions are. Instructions are not “in” the PC FSM.
• Control decodes the instruction and produces all the control signals for the rest of the CPU
• Control signals are things like write enables and MUX/DEMUX selects - they control what the other components do.
• Register file is an array of general-purpose registers; typically we can read and write multiple registers simultaneously
• ALU is the Arithmetic and Logic Unit - performs arithmetic and logic (bitwise) operations - add, subtract, AND, OR, NOT, shifts…
• Data memory contains variables that you can load or store - corresponds to the `.data` segment of your program
• Interconnect is all the wires and multiplexers that connect all of the above components together, so that data can be flexibly routed to different components depending on the instruction
• Phases of instruction execution
• Fetch: use PC to get the instruction from memory
• Decode: control decodes instruction and sets control signals
• eXecute: wait for ALU to do its work
• Memory: (only for loads and stores) do the load or store
• Writeback: (only for instructions that have a destination reg) put result the register file, not the memory
• How instructions are decoded/control the datapath
• the opcode identifies which instruction it is (`add, lw, beq,` etc)
• in a single-cycle machine, the control takes that opcode, and combinationally produces all the various control signals for the CPU (e.g. ALU operation, register write enable, memory write enable, jump enable etc.)
• for example, an `add` instruction might…
• ALUSrc = register (chooses the second input to the ALU)
• RegDataSrc = ALU (chooses what data to write into the register file)
• RegWrite = 1 (yes, we’re writing a value into the register file)
• MemWrite = 0 (no, we’re not storing a value into memory)
• and the `rd, rs, rt` signals come from the encoded instruction itself.
• Critical path + clock speed
• Critical path is the longest possible path through a circuit
• If it’s a sequential “loop-shaped” circuit, it’s the longest path “through the loop”
• Think of a race track with multiple routes
• The critical path is important because it’s the slowest operation that the circuit can perform…
• And therefore the clock cannot tick faster than that without breaking things
• The maximum clock speed is the reciprocal of the time it takes for a signal to propagate through the critical path
• e.g. if the critical path lenth is 2 ns (= 2 x 10-9 s), then the maximum clock speed is the reciprocal of that - 500 MHz (= 5 x 108 Hz)
• Harvard vs. von Neumann and Single- vs. Multi-cycle
• In a single-cycle machine, every instruction takes one clock cycle.
• In a multi-cycle machine, instructions take 2 or more clock cycles.
• Harvard = 2 memories: one for instructions and one for data
• von Neumann = 1 memory: contains everything!
• We tend to prefer this - it’s just easier to deal with a single address space, a single “flavor” of pointer, a single “flavor” of loads/stores etc.
• There is a fundamental limitation on most memory: you cannot access two different addresses in one piece of memory at the same time.
• This is a practical issue - adding circuitry to do so would make the memory way more expensive and slower, so we just… don’t.
• If you want to make a single cycle machine, you must use a Harvard (2-memory) architecture
• because you cannot do the fetch and memory phases at the same time (within 1 cycle)
• If you want to make a von Neumann (1-memory) machine, you must make it multi-cycle
• that way we can use the same memory for fetch and memory phases, but at different times
• So,
• single-cycle => Harvard (that is, “single cycle implies Harvard” - if you want a simple single-cycle machine, you must accept that you will have two memories)
• von Neumann => multi-cycle (“von Neumann implies multi-cycle” - if you want a von Neumann architecture, you must build the CPU to be multi-cycle)
• single-cycle von Neumann is impossible to build
• (multi-cycle Harvard is useful for pipelined CPUs - separate instruction and data caches so one instruction can fetch at the same time another instruction does a load/store)
• Average CPI calculation
• In a multi-cycle machine, each instruction takes a certain number of cycles
• E.g. ALU = 4 cycles, loads = 10 cycles, stores = 8 cycles, jumps = 5 cycles, branches = 3 cycles
• If we run a test (benchmark) program, we can count how many of each instruction will be executed to come up with proportions for each kind of instruction
• E.g. 40% ALU instructions, 20% loads, 20% stores, 10% jumps, 10% branches
• Then CPI is the weighted average of those instruction classes
• E.g. (4 * 0.4) + (10 * 0.2) + (8 * 0.2) + (5 * 0.1) + (3 * 0.1) = 6.0
• You can then compare the CPI of different CPUs (different numbers of cycles) by using the same program (instruction proportions)
• You can also compare the performance of different programs (different instruction proportions) on the same CPU
• Performance equation
• Calculates how long a program will take to run on a given CPU
• (n instructions) x (CPI cycles per instruction) x (t seconds per cycle); or
• (n instructions) x (CPI cycles per instruction) x (1 / f Hz)
• nano is negative nine
• Kinds of control
• Hardwired single-cycle
• Entirely combinational: instruction goes in, control signals come out
• Simple to design, terrible performance
• Hardwired multi-cycle (FSM)
• Multiple steps/phases for each instruction
• Have to keep track of what phase we’re on (what FSM state we’re in)
• Number of phases is tailored to each instruction to avoid wasting time
• Each phase is still hardwired though
• Microcoded multi-cycle (FSM, but fancy)
• Like the FSM one, but the states and transition table can be reprogrammed
• Firmware is the “program” that implements the control FSM
• (Details on microcode below)
• Hybrid microcoded and hardwired
• Use hardwired control for really common and simple instructions
• Fall back on microcode for more complex operations
• Microcode!
• What is it?
• A way of designing multi-cycle control so that each ISA instruction is implemented as a sequence of “micro-instructions” that perform the various phases of execution.
• What are the benefits?
• FLEXIBILITY!
• While designing the CPU, you can change the instruction set, add instructions easily, etc. without having to change the circuitry of the CPU itself
• And if the microcode is in a writable ROM, we can update the CPU after it’s already been sold and installed in users’ computers
• What’s the downside?
• slower than a hardwired FSM because of the complexity - accessing the microcode ROM and decoding the microinstructions adds a bunch of propagation delay.
• Caching is keeping copies of recently-used data in a smaller but faster memory so it can be accessed more quickly in the near future
• Pipelining is partially overlapping instruction execution to improve throughput and complete instructions faster
• Superscalar CPUs can complete > 1 instruction per cycle by fetching and executing multiple instructions simultaneously (completely overlapping instruction execution)
• Out-of-order CPUs analyze several (a dozen or more) instructions in advance, then dynamically reorder them so they can be executed more quickly than they would as written