In computer architecture, pipelining is the process of implementing instruction-level parallelism (ILP) within a single processor. The idea with pipelining is that multiple instructions overlap — every part of the processor is kept busy doing something.

Incoming instructions are divided into a series of sequential steps. Each step can be performed by different components in parallel. For example, RISC-style instructions usually take the following five steps, so have a five-stage pipeline:

  • Fetch instruction from memory.
  • Read registers and decode instruction.
  • Execute operation or calculation.
  • Access an operand in memory (if necessary).
  • Write the result into a register (if necessary).

Hazards

When can’t we pipeline? We have a few different cases:

  • Structural hazards: have to do with hardware support of pipelining in the same clock cycle. Most ISAs don’t have this problem since they’re built to support pipelining from the get-go.
  • Data hazards: when one step must wait for another to complete. Memory accesses notably take much longer than digital circuitry delays. We can fix this sometimes with extra hardware in forwarding (or bypassing) or compiler optimisations.
    • Sometimes loads/stores rely on the instruction before (for data). In this case, we have to stall the pipeline for a load-use data hazard. This is generally called a pipeline stall (or bubble).
  • Control hazards: this is mainly when branches occur, and we need to execute an instruction conditionally. Often we’re not interested in waiting for the branch to execute, which motivates branch prediction.

Multiple issue parallelism

Multiple issue pipelines improves ILP performance by essentially using multiple cores (i.e., multiple datapaths and control paths). This has the consequence of a higher silicon cost but does effectively allow more instructions to be launched per clock cycle. Modern CPUs aim for 3-6 instructions per clock cycle, meaning up to 20 instructions being executed at a single time.

This has a lot of overhead (scheduling, circuit costs) but has a big performance gain.

Note that most programs can’t actually support that many instructions per clock cycle. This is because there can be major bottlenecks in dependencies (memory accesses with pointers, etc) and slowdowns in memory accesses can vary.