Com Sci 222/322
Machine Organization

Lectures

Chapter 3: Pipelining

Department of Computer Science

The University of Chicago

Last modified: Wed Feb 18 09:49:15 CST

Outline

The information in the text is pretty clear. I decided to condense out a conceptual outline, in a logical order a bit different from the one in the text. I recommend that you read the text in order, but correlate each section to my outline. You may skim the material in 3.7. Make sure that you know what additional complications arise with floating point operations, and other relatively slow instructions, but you don't need to assimilate the material in detail. I will grill you (in assignment and exams) on 3.1-3.6. I'll expect you to be able to do the quantitative reasoning in 3.1-3.5, to explain the issues in 3.6 thoroughly, and to show a general appreciation of the added complications in 3.7. Read 3.8-3.11 to help assimilate the earlier material, and 3.12 for fun.

I. Pipelining is a technique for speeding up CPU performance with a given instruction set. Pipelining is normally transparent to the functionality of programs, and affects only performance.
II. How to design a pipeline.
- Start with an unpipelined CPU.
- Divide each instruction execution into segments (``stages'') taking equal time, using a small number (usually 1) of clock cycles per segment. The number of segments/stages is called the ``depth'' of the pipeline.
- Divide the circuitry in the CPU datapath into sections corresponding to the execution stages. Keep the incrementing of the CPU entirely in the first section.
- Add memory (registers or latches) for every wire that goes from one section to the next.
- Find and correct bugs (deal with ``hazards'') with additional circuitry.
III. How a pipeline executes.
- A. Under ideal conditions, a pipeline of depth d executes d instructions simultaneously: the dth stage of instruction 1 is simultaneous with the d-1th stage of instruction 2, the d-2th stage of instruction 3, ... , the 1st stage of instruction d.
- B. When full simultaneous execution of d stages doesn't work (for reasons given below), ``stall'' one or more instructions in the pipeline, inserting ``bubbles'' amounting essentially to null operations.
- C. Some pipeline designs execute instructions that may need to be skipped (e.g., after conditional branches). Such pipelines occasionally have to cancel a partially executed instruction in the pipeline. If the cancelled instruction has modified any sort of memory or machine state, the modification must be undone.
IV. Performance of a pipeline.
- A. Latency, throughput, overhead.
  - Pipelining keeps the latency of an individual instruction execution the same, or makes it worse.
  - Pipelining is intended to improve CPU throughput, which in turn improves the latency (total execution time) of programs that execute many instructions.
  - ``Overhead'' in pipelining usually refers to the extra circuitry and delay involved in a pipelined CPU (including extra clock skew due to the extra circuitry). If CPU work is conceived as the execution of separate instructions, one after another, delays from stalling and cancelling instructions are another sort of overhead.
- B. Calculating pipeline throughput improvement.
  The speedup attributed to pipelining depends, of course, on the unpipelined standard of comparison. We assume that the unpipelined and pipelined architectures have the same instruction set and the same number of clock cycles in the latency of each instruction, and that there is no cancellation. SCPI stands for the average number of Stall Cycles Per Instruction in the pipelined architecture.
```
Throughput pipelined       Clock cycle unpipelined    depth
----------------------  =  ----------------------- * --------
Throughput unpipelined     Clock cycle pipelined     1 + SCPI
          
```
  If there are cancellations, the average time per uncancelled instruction wasted in cancellation must be added in the denominator along with SCPI.
V. Hazards.
- A. Types of hazards.
  - 1. ``Structural hazards'' occur when more than one instruction needs the same circuitry at a given time.
    - a. Within the CPU.
      The obvious place for structural hazards is in the CPU, e.g. if the same ALU is used for address calculation as well as data arithmetic. Conflicts within the CPU are the result of failure to complete the design step of dividing the circuitry into sections.
    - b. Outside the CPU.
      There may also be conflicts outside of the CPU, e.g. if more than one instruction needs to access memory at the same time. Such a memory conflict is easy to create, since the first stage of each instruction fetches the instruction from memory, and a later stage may do a load or a store.
    - c. Worth removing?
      In principle, all structural hazards can be removed by additional circuitry, but the cost of the extra circuitry (which may require a longer clock cycle as well as the space and design costs) is not always justified.
  - 2. ``Data hazards'' occur when the interleaving of instructions causes a write to be executed out of order. It may be a write to register or to memory, but register conflicts are the most common because of register allocation patterns. Whenever two references to the same storage are reversed, and at least one of them is a write, we get the wrong result.
  - 3. ``Control hazards'' occur when jumps or branches take control away from the instructions in the pipeline.
- B. Detecting hazards.
  Hazard detection requires circuitry, and constitutes part of the overhead of pipelining. I was surprised that all the equality tests required for hazard detection are considered a reasonable amount of overhead circuitry for pipelining, but I guess that's just a failure of my intuitions about circuit size.
- C. Surviving hazards.
  - 1. Stall. A stall is the only solution to a structural hazard, and a long enough stall works for all hazards.
    - a. When a hazard is detected in the Instruction Fetch stage, the usual way to stall is to insert a null operation as the next instruction in the pipeline, and suppress incrementing the PC. The null operation ``bubbles'' through the pipeline. This bubble adds directly to that unpleasant SPCI term in the denominator of the speedup formula.
    - b. If a hazard is detected after the first stage, then there are instructions already in the pipeline that must be delayed. I'm not sure whether this is done in real machines or not: I haven't found guidance in the text yet. Besides some sort of disabling/enabling lines running backwards from the circuitry for each stage to the previous, this might require more stable interstage storage. Does anyone know more about late initiation of stalls?
  - 2. Cancel. The impact of a branch control hazard is only known after the branch condition is tested. Unless we stall during every branch instruction, which is costly because it usually requires more than one stall cycle, then there are invalid instructions in the pipeline, and they must be cancelled. This may be more costly than stalling, particularly if a cancelled instruction can modify a register before the cancellation is known.
  - 3. Forward. Data hazards may sometimes be resolved by forwarding data from one stage of the pipeline to another, bypassing the usual register write/read. This requires extra circuitry, and it only works when a read and a write occur in the same machine cycle, but it might be accomplished without a speed penalty.
VI. Impact of pipelining on compilers.
Compilers producing code for pipelined machines typically re-order instructions to fill delay slots. This is usually an optimization step after initial code generation. By interleaving independent arithmetic operations, compilers can avoid most delays from data hazards and structural hazards. With the delayed branch in VII E below, compilers can avoid most delays from control hazards as well. The usual approach is to first compile naive code, then find the delay cycles, then look for instructions that can move to cover the delays.
VII. Impact of pipelining on instruction set design.
If we expect pipelined circuitry when we design the instruction set, we will choose instructions that favor pipelining.
- A. Simple and uniform instructions.
- B. 2 operands rather than 3.
- C. Load/store.
- D. Consistent location of operands independent of op code.
- E. Delayed branch instructions, or condition codes (a delayed branch is a lot like an implicit condition code).

A simple pipeline for DLX

Here is a table showing the 5 stages of execution for pipelining DLX instructions. The arrays Regs and Mem and the single register PC are global, and all other values are local to each instruction. Wherever a global variable is written in stage i, and also read or written in a stage <i, hazards will occur. They are data hazards if the global variable in question is Regs or Mem, control hazards if it is PC. There are four different pairs of entries in the table above that display hazards. Make sure that you find all four, and understand precisely what the danger is. One of them is a bit subtle, and I haven't seen it mentioned in the text.

Think of two different reasons why the idle stage for ALU instructions is stage 4 instead of stage 5.

Miscellaneous variant views of pipelining issues

Pipelining is a sort of parallel computation. The interleaving of operations in parallel computation is notorious for causing surprises. Notice that pipelining interleaves parts of instructions, so it is not safe to think of individual instructions as atomic events. Most machines have other sources of timing complications, such as interrupts, so the precises behavior of the pipeline is not determined by the instructions in it. Data hazards are even worse than you think at first. If an unresolved hazard was guaranteed to do operations in the same wrong order each time, at least we could discover that order and debug. But, with interrupts or other timing complications, a data hazard may lead to different results on different runs, depending on the timing of disk and network accesses.
The performance impact of pipelining is more subtle than you think at first, since compilers will naturally try to take advantage of the known behavior of the pipeline. Fortunately, this effect typically acts in the designer's favor.
The resolution of hazards with stalls and cancellations in effect reintroduces some of the latency in individual instructions as a degradation of throughput.
Recall the four principles for improving performance from the discussion of Chapter 1:
1. Identify and improve the bottleneck.
2. Cater to common patterns of behavior.
3. Overlap latency of different steps.
4. Keep the bottleneck busy.
How do pipelining, forwarding, delayed branches, etc. relate to these principles?

Maintained by Michael J. O'Donnell, email:

odonnell@cs.uchicago.edu