本文最后更新于：Sunday, September 27th 2020, 7:48 pm

1、An Overview of Pipeline

流水线就像工厂的工人一样，每个工人只做一道工序（每个硬件只做一个功能)，同一时间几道工序同时在做(同一时间多条指令同时执行，每条指令在不同的阶段)，最后一道工序做完即完成了一件产品(所以吞吐量非常大)。

mark

RISC-V经典流水线可以分为五步

Fetch instruction from memory（IF)
Read register and decode the instruction(ID)
Execute the operation or calculate an address(EX)
Access an operand in data memory (MEM) [if necessary]
Write the result into a register(WB) [if necessary]

面向流水线的指令集设计

RISC-V 指令等长，这一限制简化了第一级取指与第二级译码。
【x86指令长度不等，从1字节到15字节不等。最近x86体系结构：先转化为简单的操作】
RISC-V只有几种指令格式，每一种指令源和目的寄存器字段位置不变。
【使得第二级在确定指令类型的同时开始读寄存器堆】
RISC-V的memory operands仅仅出现在存取指令中(常规ALU指令的操作数直接在第二级寄存器堆读出)。意味着我们可以在执行阶段计算内存地址，然后在下一个阶段访存。
【如果像x86那样可以操作在内存中的操作数，那么第三、四级将扩展为address stage，memory stage，execute stage】

流水线控制

mark

2、Pipeline Hazards

There are situation in pipelining when the next instruction cannot execute in the following clock cycle. These events are called hazards.

1、Structural Hazard

When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute

[由于硬件资源不够导致的冒险]

这也是为什么我们我们的IF和MEM分开（指令寄存器和数据寄存器）

2、Data Hazard

When a planned instruction cannot execute in the proper clock cycle because data that are needed to execute the instruction are not yet available

[由于操作数没有准备好导致的冒险]

三种经典解决办法

Reorder code(重新安排代码)
stall the pipeline(阻塞一个或几个周期)
bypass or forwarding(旁路或者前推上一条指令运算的结果)

①Reorder code

//code segment in C
a = b + e;
c = b + f;

//generated RISC-V code for above segment
ld x1, 0(x31)	// Load b   1
ld x2, 8(x31)	// Load e	2
add x3, x1, x2	// b + e	3
sd x3, 24(x31)	// Store a	4
ld x4, 16(x31)	// Load f	5
add x5, x1, x4	// b + f	6
sd x5, 32(x31)	// Store c	7

/************** 说明 *****************
1、通过旁路可以去除3对1的依赖(load 指令最少需要两个周期，ALU指令在旁路技术下对下一条指令不会构成数据冒险）
2、通过旁路也可解决sd指令对上一条add指令的依赖
3、需要解决的: 3V2和6V5         */

//************* 解决办法 ***************
把第5条指令提到第二条指令和第三条中间。

②Bypassing

mark

and指令需要x2，而x2只有等到第一条指令写回才有效(即第五个周期前半段)
同理or指令
❓假设: 写寄存器操作发生时钟周期的前半段而读寄存器操作发生在时钟周期后半段

旁路的核心：前一条指令计算的结果不用等到第五周期写回寄存器堆而提前旁路到其后指令的ALU操作数输入上。

mark

EX冒险（EX/MEM流水线寄存器有需要的值）

if  (EX/MEM.RegWrite
and  (EX/MEM.RegisterRd ≠ 0)
and  (EX/MEM.RegisterRd = ID/EX.RegisterRs1)) ForwardA = 10

if  (EX/MEM.RegWrite
and  (EX/MEM.RegisterRd ≠ 0)
and  (EX/MEM.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 10

MEM hazard

if  (MEM/WB.RegWrite
and  (MEM/WB.RegisterRd ≠ 0)
and  not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs1)) 
     // 先判断不是EX冒险，不然应该是旁路上一条指令EX/MEM结果旁路上上条MEM/WB
and  (MEM/WB.RegisterRd = ID/EX.RegisterRs1)) ForwardA = 01

if  (MEM/WB.RegWrite
and  (MEM/WB.RegisterRd ≠ 0)
and  not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs2))
and  (MEM/WB.RegisterRd = ID/EX.RegisterRs2)) ForwardB = 01

mark

旁路单元产生控制信号选着ALU的输入operand。

③Stalls

需要阻塞的情形：当一条指令试图读取一个由前一条装载指令读入的寄存器时，就无法使用旁路解决冒险(因为lw指令需要在第四阶段才能产生结果)

冒险检测单元

if  (ID/EX.MemRead and
	((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
		(ID/EX.RegisterRd = IF/ID.RegisterRs2)))
stall the pipeline

检测单元工作在ID阶段。
阻塞后面指令的方法：保持PC寄存器和IF/ID流水线寄存器不变。
插入一条空指令（nop)：一种不进行任何操作或不改变任何状态的指令。
实现方法：控制信号全部置为0，这些控制信号在每个时钟周期都向前传递，但不会产生不良影响，因为控制为0，那么所有寄存器和存储器都不进行写操作。

mark

3、Control Hazard

An instruction must be fetched at every clock cycle to sustain the pipeline, yet in our design the decision about whether to branch doesn't occur until the MEM pipeline stage.
(后面的优化🙆‍看到在ID级就可以确定分支)

[由于选择✔的指令需要延迟]

①Branch

假定分支不发生（如果发生预取和译码的指令要丢弃）
缩短分支延迟（提早确定分支，减少flush的指令数）
- 计算分支目标地址（IF/ID流水线寄存器已经有了PC和立即数字段的值）
- 判断分支条件：需要额外的旁路和冒险检测硬件。【因为分支条件的判断可能依赖于还在流水线中的结果】
  两个难点：
  Ⅰ、前面的ALU旁路单元在EX级，所以这里需要一个新的旁路单元工作在ID级。还需要一个equality test logic（对两个寄存器的值按位异或接着或操作）
  Ⅱ、可能数据在ID级旁路不过来。上一条是ALU指令，那么只能stall a cycle；如果是lw指令，那么必须stall two cycles。
  Ⅲ、控制信号新增一个IF.flush信号，把预取的那条指令变成nop指令。
动态分支预测(缓存之前运行分支的信息进行判断)
fetching new instructions from the same places as the last time.)
- 分支预测缓存(branch prediction buffer)也称为分支历史记录表(branch history table):使用分支指令地址地位索引的一小块存储区。
- 这类缓存我们实际上不知道预测是否正确，而且它还可能由其他具有相同地址地位的分支设置。
- 预测错误❌时，错误的预取指令删除，预测位取反，回到原来的位置（❓得有缓存），继续按照正确的方向取指并执行。
- 分支预测缓存可以用很小，用指令地址访问的special buffer in IF pipe stage。如果预测分支，那么从分支target取指令。
- 为了改善非常有规律的分支的预测正确率（比如循环，9次分支只有最后一次循环退出不分支）；可以使用两位的预测机制。
- 相关预测器（correlation predictors)：不仅使用local branch的信息，还综合global behavior of recently executed branches 。典型的相关预测器为每个分支提供两个两位的预测器，根据上一次分支是否执行选择其中一个预测器，因此全局分支行为可以看成adding additional index bits for the prediction lookup.
- 竞争预测器(tournament branch predictor)：为每个分支使用多个预测器，并记录哪个预测器预测结果最好。典型的竞争预测器：对每个分支索引包含两个预测结果，一个基于本地信息，一个基于全局分支行为。一个选择器选择哪个作为预测结果。
- 条件移动指令(conditional move instruction)：不同于分支指令改变PC值，条件移动指令将根据条件改变move指令的目的寄存器。在ARMv8指令集架构中：CSEL X8, X11,X4,NE 如果条件码不为零，复制x11到x8；否则复制X4到X8;

②Exception

Control is the most challenging aspect of processor design: it is both the hardest part to get right and toughest part to make fast
然而控制中最难的就是实现异常或中断——除分支外改变正常指令执行流

当异常发生时，处理器必须做的基本事情是：

在SEPC(supervisor exception cause register)保存出错指令的地址
把控制权交给操作系统的特定地址处

对于处理异常的OS,它必须知道异常的原因：

设置一个原因寄存器（Supervisor Exception Cause Register or SCAUSE):其中有个域指示异常的原因
使用向量中断(vectored interrupts), 控制权被转移到的地址是由异常原因决定，该地址可能被添加到指向向量中断内存范围的base register中。例如，我们可以使用下面的异常中断向量地址来表示异常种类。