Please read this page about taking my exams!

## Exam format

**When/where**- During class, here, like normal
- 75 minutes (4:00-5:15 PM)

**Closed-note, no calculator**- You may not have any notes, cheat sheets etc. to take the exam
- The math on the exam has been designed to be doable either in your head or very quickly on paper (e.g. 2 x 1 digit multiplication);
**if you find yourself needing a calculator, you did something wrong**- Keep numbers in scientific notation, do not take them out of it until the end
- Avoid division when you can - do reciprocal first, then multiply by that
- I literally design the test questions to be easy to do reciprocals

**Length**- Very much like the first exam.
- 75 minutes

**Topic point distribution****It is not cumulative, omg**- More credit for earlier topics (e.g. AND, OR, multiplexer)
- Less credit for more recent ones (e.g. microcode, pipelining)
- More credit for
**things I expect you to know because of your experience**(labs, project) **VERY ROUGHLY:**- ~30% Logic (combinational and sequential)
- ~40% CPU design
- ~25% Performance
- ~5% Other

**Kinds of questions**- Very much like the first exam.

## Things people asked about in the reviews

**Remember, these are just the things that people asked about. There may be topics on the exam not on this list; and there may be topics on this list that are not on the exam.**

**Combinational logic**- Anything that
*doesn’t*have memory (latches, flip flops, registers, RAM) - Includes gates (AND, OR, NOT etc), plexers, arithmetic computations

- Anything that
**Boolean expressions/functions**- Boolean inputs, one (or more) boolean output
- (multiple boolean outputs are really separate expressions)

- Basically, if you can represent it as a truth table, it’s a boolean expression
- Turning a truth table into a boolean expression is extremely straightforward:
- find every row of the truth table where the output is 1
- for each of those, write a term that is all the input variables ANDed together, with bars (NOTs) on each variable that is 0 in that row
- OR all those terms together. you will get a “sum-of-products” expression that is like
`Y = term + term + term...`

- using the Engineering notation for these is very compact -
`Y = AB + CD`

or something

- using the Engineering notation for these is very compact -

- Boolean inputs, one (or more) boolean output
**Propagation delay****Propagation delay**is how long it takes for a signal to pass through some circuit- Nothing moves infinitely fast in the real world, so there are limits on how quickly we can compute things
- The propagation delay of a sequential circuit’s
**critical path**(longest series of operations that cannot be done in parallel) limits clock speed

**Ripple carry**- Method of implementing multi-bit addition where the carry-out of each bit becomes the carry-in of the next higher bit
- Simple to implement, but
**linear time**in the number of bits- Double the number of bits? Doubles the time

- When the inputs change, it will produce
**invalid results**for a while, because the carries must “ripple” from LSB to MSB - The
**critical path**is from the LSB’s input to the MSB’s output - hence why it’s linear time

**Sequential logic****can remember things**unlike combinational logic- any memory (flip flop, register, RAM), or any circuit that contains any memory
- relies on the
**clock signal**to tell the memory components when to update their contents - combined with combinational logic to make finite state machines

**Latches, flip-flops, and registers**- a
**latch**is the**simplest circuit that can remember 1 bit of information**- (there are actually several kinds of latches, but we only looked at the
**RS Latch**)

- (there are actually several kinds of latches, but we only looked at the
- a
**flip-flop**is a latch surrounded by some extra circuitry which:- makes it more stable and less prone to oscillation
- makes it work with the clock signal
- may also give it a write enable input

- a flip-flop
*is*a 1-bit register- an
*n*-bit register is*n*flip-flops

- an

- a
**Multiplication and division**- Multiplication is made of multiple
**additions**- Addition is
**commutative and associative,** - This means that the sub-steps of multiplication can be
**reordered**and even**done in parallel** - This gives us two practical multiplication algorithms:
- the slow, sequential, linear,
**grade-school**multiplication algorithm is`O(n)`

time (n = number of bits)- This is the algorithm implemented with the
**FSM,**with 3 registers and an adder

- This is the algorithm implemented with the
- the fast, combinational,
**parallel**multiplication algorithm is`O(log n)`

time- This is the algorithm implemented as a
**tree of adders,**no registers at all

- This is the algorithm implemented as a

- the slow, sequential, linear,
- but this is a
**time-space tradeoff**- the linear time multiplier needs only
`O(n)`

1-bit full adders - while the logarithmic time multiplier needs
`O(n^2)`

1-bit full adders - double the number of bits,*quadruple*the space needed for the circuitry!

- the linear time multiplier needs only

- Addition is
- Division is made of multiple
**subtractions**- and
**subtraction is neither commutative nor associative** - which means the sub-steps of division
**must always be done in order** - therefore, division is
**always**`O(n)`

- yes, even if you guess with the SRT algorithm
- This is the algorithm that looks a lot like the multiplication FSM but remixed
**division is**the remainder is calculated at the same time as the quotient.*not*slower “because of the remainder” or something.

- and
**there are Logisim examples of all three things on the materials page!!**

- Multiplication is made of multiple
**FSMs (Finite State Machines)**- inputs, state, transition logic, outputs
- the transition logic determines the
*next*state based on the*current*state and inputs - the output logic determines the outputs from the
*current*state (and optionally, also from the current*inputs*)- this is the distinction between Moore and Mealy machines and I forget which is which but it’s not important for this class and you can look it up if you’re curious

- transition logic can be shown as either a
**state diagram**(the nodes with arrows indicating transitions) or as a**transition table**(for each combination of state + inputs, show the “next” state)- these two are equivalent - each
**arrow**in the state diagram is a**row**in the transition table - but the table is easier to mechanically translate into circuitry to implement the transition logic

- these two are equivalent - each

**Parts of the CPU****PC FSM**controls the PC and lets it advance to next instruction, do absolute jumps, or do relative branches**Absolute**jumps just set the PC to some value (e.g.`PC = 0x80004004`

)**Relative**branches*add*a number to the*current*PC to move forward or backwards by a certain amount (e.g.`PC = PC + 12`

)**Branches do not have a return address, I don’t know why so many people put this on the exam, branches make a choice and never return, you are somehow confusing them with function calls**- MIPS’s
`jal`

instruction is an absolute jump, but it also sets`ra`

to the address of the`jal`

plus 4. totally different thing

- MIPS’s

**Instruction memory**contains the instructions, and is addressed by the PC - corresponds to the`.text`

segment of your program**This is where instructions are. Instructions are not “in” the PC FSM.**

**Control**decodes the instruction and produces all the**control signals**for the rest of the CPU**Control signals**are things like write enables and MUX/DEMUX selects - they control what the other components do.

**Register file**is an array of general-purpose registers; typically we can read and write multiple registers simultaneously**ALU**is the Arithmetic and Logic Unit - performs arithmetic and logic (bitwise) operations - add, subtract, AND, OR, NOT, shifts…**Data memory**contains variables that you can load or store - corresponds to the`.data`

segment of your program**Interconnect**is all the wires and multiplexers that connect all of the above components together, so that data can be flexibly routed to different components depending on the instruction

**Phases of instruction execution****Fetch**: use PC to get the instruction from memory**Decode**: control decodes instruction and sets control signals**eXecute**: wait for ALU to do its work**Memory**:*(only for loads and stores)*do the load or store**Writeback**:*(only for instructions that have a destination reg)*put result the register file, not the memory

**How instructions are decoded/control the datapath**- the
**opcode**identifies*which*instruction it is (`add, lw, beq,`

etc) - in a single-cycle machine, the control takes that opcode, and combinationally produces all the various control signals for the CPU (e.g. ALU operation, register write enable, memory write enable, jump enable etc.)
- for example, an
`add`

instruction might…- ALUOp = add
*(makes the ALU add)* - ALUSrc = register
*(chooses the second input to the ALU)* - RegDataSrc = ALU
*(chooses what data to write into the register file)* - RegWrite = 1
*(yes, we’re writing a value into the register file)* - MemWrite = 0
*(no, we’re not storing a value into memory)* - and the
`rd, rs, rt`

signals come from the encoded instruction itself.

- ALUOp = add

- the
**Critical path + clock speed**- Critical path is the
**longest possible path**through a circuit - If it’s a sequential “loop-shaped” circuit, it’s the longest path “through the loop”
- Think of a race track with multiple routes

- The critical path is important because it’s the
**slowest operation**that the circuit can perform…- And therefore
**the clock cannot tick faster than that**without breaking things

- And therefore
- The
**maximum**clock speed is the**reciprocal**of the time it takes for a signal to propagate through the critical path- e.g. if the critical path lenth is 2 ns (= 2 x 10
^{-9}s), then the maximum clock speed is the reciprocal of that - 500 MHz (= 5 x 10^{8}Hz)

- e.g. if the critical path lenth is 2 ns (= 2 x 10

- Critical path is the
**Harvard vs. von Neumann and Single- vs. Multi-cycle**- In a single-cycle machine,
**every instruction takes one clock cycle.** - In a multi-cycle machine,
**instructions take 2 or more clock cycles.** - Harvard =
**2 memories:**one for instructions and one for data - von Neumann =
**1 memory:**contains everything!- We tend to prefer this - it’s just easier to deal with a single address space, a single “flavor” of pointer, a single “flavor” of loads/stores etc.

- There is a fundamental limitation on most memory:
**you cannot access two different addresses in one piece of memory at the same time.**- This is a practical issue - adding circuitry to do so would make the memory
*way*more expensive and slower, so we just… don’t.

- This is a practical issue - adding circuitry to do so would make the memory
- If you want to make a single cycle machine, you
**must**use a Harvard (2-memory) architecture- because you
*cannot*do the fetch and memory phases*at the same time (within 1 cycle)*

- because you
- If you want to make a von Neumann (1-memory) machine, you
**must**make it multi-cycle- that way we can use the same memory for fetch and memory phases, but at
*different times*

- that way we can use the same memory for fetch and memory phases, but at
- So,
- single-cycle => Harvard (that is, “single cycle implies Harvard” - if you want a simple single-cycle machine, you must accept that you will have two memories)
- von Neumann => multi-cycle (“von Neumann implies multi-cycle” - if you want a von Neumann architecture, you must build the CPU to be multi-cycle)
- single-cycle von Neumann is
**impossible to build** - (multi-cycle Harvard is useful for pipelined CPUs - separate instruction and data
*caches*so one instruction can fetch at the same time another instruction does a load/store)

- In a single-cycle machine,
**Average CPI calculation**- In a multi-cycle machine, each instruction takes a certain number of cycles
- E.g. ALU = 4 cycles, loads = 10 cycles, stores = 8 cycles, jumps = 5 cycles, branches = 3 cycles

- If we run a test (benchmark) program, we can count
**how many of each instruction**will be executed to come up with proportions for each kind of instruction- E.g. 40% ALU instructions, 20% loads, 20% stores, 10% jumps, 10% branches

- Then CPI is the
**weighted average**of those instruction classes- E.g. (4 * 0.4) + (10 * 0.2) + (8 * 0.2) + (5 * 0.1) + (3 * 0.1) =
**6.0**

- E.g. (4 * 0.4) + (10 * 0.2) + (8 * 0.2) + (5 * 0.1) + (3 * 0.1) =
- You can then compare the CPI of
**different CPUs**(different numbers of cycles) by using the same program (instruction proportions) - You can also compare the performance of
**different programs**(different instruction proportions) on the**same CPU**

- In a multi-cycle machine, each instruction takes a certain number of cycles
**Performance equation**- Calculates how long a program will take to run on a given CPU
- (
*n*instructions) x (*CPI*cycles per instruction) x (*t*seconds per cycle); or - (
*n*instructions) x (*CPI*cycles per instruction) x (1 /*f*Hz)

- (
- Be careful about your exponents and SI prefixes here
- nano is negative nine

- Calculates how long a program will take to run on a given CPU
**Kinds of control**- Hardwired single-cycle
- Entirely combinational: instruction goes in, control signals come out
- Simple to design, terrible performance

- Hardwired multi-cycle (FSM)
- Multiple steps/phases for each instruction
- Have to keep track of what phase we’re on (what FSM state we’re in)
- Number of phases is tailored to each instruction to avoid wasting time
- Each phase is still hardwired though

- Microcoded multi-cycle (FSM, but fancy)
- Like the FSM one, but the states and transition table can be reprogrammed
- Firmware is the “program” that implements the control FSM
- (Details on microcode below)

- Hybrid microcoded and hardwired
- Use hardwired control for really common and simple instructions
- Fall back on microcode for more complex operations

- Hardwired single-cycle
**Microcode!**- What is it?
- A way of designing
**multi-cycle control**so that each ISA instruction is implemented as a sequence of “micro-instructions” that perform the various phases of execution.

- A way of designing
- What are the benefits?
- FLEXIBILITY!
- While designing the CPU, you can change the instruction set, add instructions easily, etc. without having to change the circuitry of the CPU itself
- And if the microcode is in a writable ROM, we can
*update the CPU after it’s already been sold and installed*in users’ computers

- What’s the downside?
**slower than a hardwired FSM***because*of the complexity - accessing the microcode ROM and decoding the microinstructions adds a bunch of propagation delay.

- What is it?
**Caching**is keeping copies of recently-used data in a smaller but faster memory so it can be accessed more quickly in the near future**Pipelining**is*partially*overlapping instruction execution to improve throughput and complete instructions faster**Superscalar**CPUs can complete > 1 instruction per cycle by fetching and executing multiple instructions simultaneously (*completely*overlapping instruction execution)**Out-of-order**CPUs analyze several (a dozen or more) instructions in advance, then dynamically reorder them so they can be executed more quickly than they would as written