TSK3000A Pipeline
The TSK3000A uses a 5-stage execution pipeline structure. The execution of a single instruction is therefore performed in five different stages, as summarized in Figure 1 and detailed in the sections that follow.
Instruction Fetch IF | Instruction Decode ID | Execute EX | Memory Access MA | Write Back WB |
Instruction Fetch Stage (IF)
In this stage, the content of the Program Counter is used to access memory and fetch the next instruction to be executed.
Instruction Decode Stage (ID)
During this stage, the instruction is decoded and the required operands are retrieved from the general purpose registers (GPRs) or special function registers (SFRs).
Execute Stage (EX)
Any calculations are performed during this stage. This includes effective address calculation for Load or Store instructions. The next Program Counter value is also calculated during this stage of the pipeline so that branches, where applicable, can be executed.
Some initial pre-calculation for memory decoding is also performed in this stage.
Memory Access Stage (MA)
If the instruction being executed is of the Load or Store variety, then the data memory is accessed during this stage. The previously calculated effective address is applied to the data memory and the read or write is performed in accordance with the instruction type.
Write Back Stage (WB)
During this stage, the results of the calculation from the Execute stage, or the memory load from the Memory Access stage, are updated into the general purpose registers or special function registers.
Simultaneous Instruction Execution
The technique of pipelining allows for the simultaneous execution of a number of different instructions, each instruction being at a different stage in the pipeline. For the TSK3000A, up to five different instructions can be executed simultaneously in the processor's pipeline, as illustrated in Figure 2.
Pipeline Hazards
With a pipelined processor such as the TSK3000A, there are a number of events that can disrupt the pipeline, lowering its overall instruction execution rate.
Data Forwarding Hazards
If an instruction in the Execute stage requires the result of a previous instruction as one of its operands, and that instruction is still in the pipeline, then the instruction cannot complete until the prior instruction has completed the pipeline.
To avoid stalling the pipeline in this case, the TSK3000A "forwards" the data from the prior instruction, making it immediately available to the current instruction in the Execute stage. This process happens transparently to the application software.
Long Instruction Hazards
Some instructions, notably multiply and divides, require more than one cycle to execute. In these cases the pipeline will be stalled while the instruction completes.
Load Hazards
If the instruction in the Execute stage requires the result of a Load instruction that is in the Memory Access stage, then that data will not be available since it has not been loaded from memory yet. In this case the processor will stall the first half of the pipeline and let the memory access complete, effectively inserting a NOP instruction into the instruction flow. Again, this will be transparent to the application software.
Branch Hazards
When the processor encounters a branch or jump in the Execute stage and decides to take the branch, then the instructions in the IF and ID stages will no longer be valid since execution will continue from a different location.
In this case, on the next rising clock edge (the beginning of the next clock cycle) as the new Program Counter value is loaded, the processor will kill the instruction that is being loaded from the instruction memory, effectively converting it into a NOP. The instruction that was in the ID stage will move into the EX stage and be executed. This instruction is said to be in the 'branch delay slot'.
Any instruction that follows a Branch or Jump instruction will be executed before the first instruction at the new address. This technique allows the processor to only lose one cycle when taking a branch. Optimizing compilers will attempt to fill the branch delay slots with useful instructions, increasing the overall throughput of the processor.