Chapter 3: Pipelining
  
    
  Department of Computer Science
  
    
    
    The University of Chicago
  
Last modified: Wed Feb 18 09:49:15 CST 
Outline
The information in the text is pretty clear. I decided to condense
out a conceptual outline, in a logical order a bit different from the
one in the text. I recommend that you read the text in order, but
correlate each section to my outline. You may skim the material in
3.7. Make sure that you know what additional complications arise with
floating point operations, and other relatively slow instructions, but
you don't need to assimilate the material in detail. I will grill you
(in assignment and exams) on 3.1-3.6. I'll expect you to be able to do
the quantitative reasoning in 3.1-3.5, to explain the issues in 3.6
thoroughly, and to show a general appreciation of the added
complications in 3.7. Read 3.8-3.11 to help assimilate the earlier
material, and 3.12 for fun.
  - I. Pipelining is a technique for speeding up CPU performance with a
  given instruction set. Pipelining is normally transparent to the
  functionality of programs, and affects only performance.
 
  - II. How to design a pipeline.
    
      - Start with an unpipelined CPU.
 
      - Divide each instruction execution into segments (``stages'')
      taking equal time, using a small number (usually 1) of clock
      cycles per segment. The number of segments/stages is called the
      ``depth'' of the pipeline.
 
      - Divide the circuitry in the CPU datapath into sections
      corresponding to the execution stages. Keep the incrementing
      of the CPU entirely in the first section.
 
      - Add memory (registers or latches) for every wire that goes
      from one section to the next.
 
      - Find and correct bugs (deal with ``hazards'') with
      additional circuitry.
 
    
   
  - III. How a pipeline executes.
    
      - A. Under ideal conditions, a pipeline of depth d
      executes d instructions simultaneously: the dth
      stage of instruction 1 is simultaneous with the d-1th
      stage of instruction 2, the d-2th stage of instruction 3,
      ... , the 1st stage of instruction d.
 
      - B. When full simultaneous execution of d stages doesn't
      work (for reasons given below), ``stall'' one or more
      instructions in the pipeline, inserting ``bubbles'' amounting
      essentially to null operations.
 
      - C. Some pipeline designs execute instructions that may need to
      be skipped (e.g., after conditional branches). Such pipelines
      occasionally have to cancel a partially executed instruction in
      the pipeline. If the cancelled instruction has modified any sort
      of memory or machine state, the modification must be
      undone.
 
    
   
  - IV. Performance of a pipeline.
    
      - A. Latency, throughput, overhead.
        
          - Pipelining keeps the latency of an individual
          instruction execution the same, or makes it worse.
 
          - Pipelining is intended to improve CPU
          throughput, which in turn improves the latency
          (total execution time) of programs that execute many
          instructions.
 
          - ``Overhead'' in pipelining usually refers to the extra
          circuitry and delay involved in a pipelined CPU (including
          extra clock skew due to the extra circuitry). If CPU work is
          conceived as the execution of separate instructions, one
          after another, delays from stalling and cancelling
          instructions are another sort of overhead.
 
        
       
      - B. Calculating pipeline throughput improvement.
        
The speedup attributed to pipelining depends, of course, on
        the unpipelined standard of comparison. We assume that the
        unpipelined and pipelined architectures have the same
        instruction set and the same number of clock cycles in the
        latency of each instruction, and that there is no
        cancellation. SCPI stands for the average number of
        Stall Cycles Per Instruction in the pipelined architecture.
          
Throughput pipelined       Clock cycle unpipelined    depth
----------------------  =  ----------------------- * --------
Throughput unpipelined     Clock cycle pipelined     1 + SCPI
          
        If there are cancellations, the average time per uncancelled
        instruction wasted in cancellation must be added in the
        denominator along with SCPI.
        
       
    
   
  - V. Hazards.
    
      - A. Types of hazards.
        
          - 1. ``Structural hazards'' occur when more than one
          instruction needs the same circuitry at a given time.
            
              - a. Within the CPU.
                
The obvious place for structural hazards is in the
                CPU, e.g. if the same ALU is used for address calculation
                as well as data arithmetic. Conflicts within the CPU are
                the result of failure to complete the design step of
                dividing the circuitry into sections.
               
              - b. Outside the CPU.
                
There may also be conflicts outside of the CPU,
                e.g. if more than one instruction needs to access memory
                at the same time. Such a memory conflict is easy to
                create, since the first stage of each instruction fetches
                the instruction from memory, and a later stage may do a
                load or a store.
               
              - c. Worth removing?
                
In principle, all structural hazards can be removed
                by additional circuitry, but the cost of the extra
                circuitry (which may require a longer clock cycle as
                well as the space and design costs) is not always
                justified.
               
            
           
          - 2. ``Data hazards'' occur when the interleaving of
          instructions causes a write to be executed out of order. It may
          be a write to register or to memory, but register conflicts are
          the most common because of register allocation
          patterns. Whenever two references to the same storage are
          reversed, and at least one of them is a write, we get the wrong
          result.
 
          - 3. ``Control hazards'' occur when jumps or branches take
          control away from the instructions in the pipeline.
 
        
       
      - B. Detecting hazards.
        
Hazard detection requires circuitry, and constitutes part
        of the overhead of pipelining. I was surprised that all the
        equality tests required for hazard detection are considered a
        reasonable amount of overhead circuitry for pipelining, but I
        guess that's just a failure of my intuitions about circuit
        size.
       
      - C. Surviving hazards.
        
          - 1. Stall. A stall is the only solution to a structural
          hazard, and a long enough stall works for all hazards.
            
              - a. When a hazard is detected in the Instruction Fetch
              stage, the usual way to stall is to insert a null
              operation as the next instruction in the pipeline, and
              suppress incrementing the PC. The null operation
              ``bubbles'' through the pipeline. This bubble adds
              directly to that unpleasant SPCI term in the
              denominator of the speedup formula.
 
              - b. If a hazard is detected after the first stage, then
              there are instructions already in the pipeline that must
              be delayed. I'm not sure whether this is done in real
              machines or not: I haven't found guidance in the text
              yet. Besides some sort of disabling/enabling lines
              running backwards from the circuitry for each stage to
              the previous, this might require more stable interstage
              storage. Does anyone know more about late initiation of
              stalls?
 
            
           - 2. Cancel. The impact of a branch control hazard is only
          known after the branch condition is tested. Unless we stall
          during every branch instruction, which is costly because it
          usually requires more than one stall cycle, then there are
          invalid instructions in the pipeline, and they must be
          cancelled. This may be more costly than stalling,
          particularly if a cancelled instruction can modify a
          register before the cancellation is known.
          
 
          - 3. Forward. Data hazards may sometimes be resolved by
          forwarding data from one stage of the pipeline to another,
          bypassing the usual register write/read. This requires extra
          circuitry, and it only works when a read and a write occur
          in the same machine cycle, but it might be accomplished
          without a speed penalty.
 
        
       D. Cost of hazards.
         
           - 1. Structural and data hazards are often resolved by
           stalling for a single machine cycle.
 
           - 2. Control hazards may easily lead to 3 wasted machine
           cycles. Control hazards are usually the biggest performance
           worry for pipeline designers and compiler writers.
 
         
      
 
    
  
  VI. Impact of pipelining on compilers.
    Compilers producing code for pipelined machines typically
    re-order instructions to fill delay slots. This is usually an
    optimization step after initial code generation. By interleaving
    independent arithmetic operations, compilers can avoid most delays
    from data hazards and structural hazards. With the delayed branch
    in VII E below, compilers can avoid most delays from control
    hazards as well. The usual approach is to first compile naive
    code, then find the delay cycles, then look for instructions that
    can move to cover the delays.
  
  VII. Impact of pipelining on instruction set design.
    If we expect pipelined circuitry when we design the
    instruction set, we will choose instructions that favor
    pipelining.
    
      - A. Simple and uniform instructions.
 
      - B. 2 operands rather than 3.
 
      - C. Load/store.
 
      - D. Consistent location of operands independent of op code.
 
      - E. Delayed branch instructions, or condition codes (a
      delayed branch is a lot like an implicit condition code).
 
    
    
  
A simple pipeline for DLX
Here is a table showing the
5 stages of execution for pipelining DLX instructions. The arrays Regs
and Mem and the single register PC are global, and all other values
are local to each instruction. Wherever a global variable is written
in stage i, and also read or written in a stage <i, hazards will
occur. They are data hazards if the global variable in question is
Regs or Mem, control hazards if it is PC. There are four different
pairs of entries in the table above that display hazards. Make sure
that you find all four, and understand precisely what the danger
is. One of them is a bit subtle, and I haven't seen it mentioned in
the text.
Think of two different reasons why the idle stage for ALU
instructions is stage 4 instead of stage 5.
Miscellaneous variant views of pipelining issues
  - Pipelining is a sort of parallel computation. The interleaving
  of operations in parallel computation is notorious for causing
  surprises. Notice that pipelining interleaves parts of
  instructions, so it is not safe to think of individual instructions
  as atomic events. Most machines have other sources of timing
  complications, such as interrupts, so the precises behavior of the
  pipeline is not determined by the instructions in it. Data hazards
  are even worse than you think at first. If an unresolved hazard was
  guaranteed to do operations in the same wrong order each
  time, at least we could discover that order and debug. But, with
  interrupts or other timing complications, a data hazard may lead to
  different results on different runs, depending on the timing of disk
  and network accesses.
 
  - The performance impact of pipelining is more subtle than you
  think at first, since compilers will naturally try to take advantage
  of the known behavior of the pipeline. Fortunately, this effect
  typically acts in the designer's favor.
 
  - The resolution of hazards with stalls and cancellations in
  effect reintroduces some of the latency in individual instructions
  as a degradation of throughput.
 
  - Recall the four principles for improving performance from the
  discussion of Chapter 1:
  
    - Identify and improve the bottleneck.
  
    - Cater to common patterns of behavior.
  
    - Overlap latency of different steps.
  
    - Keep the bottleneck busy.
  
  
  How do pipelining, forwarding, delayed branches, etc. relate to
  these principles? 
Maintained by Michael J. O'Donnell, email:
  
  odonnell@cs.uchicago.edu