If Designers were only dependent on increasing the clock freq of the processor then CPU performance enhancement could have stalled years ago as laws of physics limit how fast a processor can run. Instead designers have focused on internal structure as well clock speeds to enhance the performance. Since the days of basic 8088 processors, clock speed has gone up by 650 times but performance has been enhanced by over 6500 time which shows that how big is the contribution of internal structure to the performance enhancement. Some of the major architecture improvements are as given next.


In the basic microprocessors we used to deal with 4 bit or 8 bit data at one time so if had to operate on 16 bit data then 4 bit processors would take at least 4 cycles but nowadays we deal in 16 bit or 32 bit or higher data at one time. Hence we can have higher processing speeds.


Nowadays separate processors are used which are called co-processors, to process floating point numbers and other math operations which may take a lot of time on a simple processor. Main processor just sends an interrupt to the co processor and sends the data to the coprocessor and wait for the results. Hence FPUs have been a great help to enhance the speed of processors. E.g. 8087 is the co processor for 8086. Hmmm….this is similar to case where a person takes a calculator with him just to enhance his performance.


In the basic architecture processor an instruction is first fetched and then executed and then the next instruction is fetched and so on while in the pipelined architecture and execution goes in parallel. so firstly one instruction is fetched and then while instruction already fetched is EXECUTED, 2nd instruction is fetched. now when in the next cycle 2nd instruction is executed, third instruction is fetched and so on. Hence this way fetching and execution goes in parallel. So we get the enhanced performance of the system.

So a 4 instruction program gets executed in 5 cycles while without pipelining it takes 8 cycles(1 for each fetch and execution)

So speedup=8/5=1.6

But there is a problem in pipelining in case if branch instructions. As when there is a branch instruction, we have to jump to some new address so the instruction which is already fetched has to be flushed.

But there is a problem in pipelining in case if branch instructions. As when there is a branch instruction, we have to jump to some new address so the instruction which is already fetched has to be flushed.

Eg –

1. Mov a,b

2. Add a,c

3. Jmp  10

4. Mov b, a



10. Mov c,a

And corresponding pipelining diagram would be

Now we can see there is wastage of 1 complete cycle due to one branch instruction so speedup would be lesser.


Suppose we have n segment pipelining system:

If a first segment takes T1 seconds and second segment takes T2 and so on…

Then clock period T is given by

T= max (T1, T2………..Tn) + latch time

And clock freq is given by F= 1/T

Suppose if we have total of n segments/ stages of the pipelining system and m no. of tasks then

Total time taken to execute m instructions = (m+n-1)*T while without pipelining time taken is m*n*T

SPEEDUP (while ignoring branch instructions) = S (n) =m*n*T / (m+n-1)*T = mn / m+n-1


Divide numerator and denominator by m and we see that for m -> very large value we get efficiency as 1

And for a single branch instruction there would be n-1 extra cycles as shown below for a 5 stage pipeline system and following are the 5 stages of five stage pipelining:

S1.  Instructions fetch

S2   Instruction decode

S3   Operand fetch

S4   Operation execution

S5   Result saving

Let the sample program be

  1. I1
  2. I2
  3. I3(branch instruction to I4)
  4. I6
  5. I7
  6. I8
  7. I9
  8. I4

And the pipelining system diagram is

So if we have n pipeline system and m no of instructions and p is the probability of branch instructions then m*p is the no of branch instructions so

Total no of clock cycles are (m+n-1) + m*p*(n-1)

CPI (no. of clock cycles per instruction) = {(m+n-1) + m*p*(n-1)} / m


The branch predictor attempts to guess, before the execution actually reaches the branch instruction, where the program will jump (or branch) next, allowing the prefetch and decode unit to retrieve corresponding instructions and data in advance so that they will already be available when the CPU requests them. So this way we reduce the occurrence of the extra cycles due to branch instruction and hence enhance the performance. However if branches are predicted poorly then we may have to flush the entire pipeline.


If we are using more than one pipeline in the processor architecture then it is called super scalar architecture. So in this architecture multiple instructions are executed in one clock cycle. These processors employ out of order execution governed by data dependencies. It means that the instructions are evaluated and only independent instructions are executed in parallel.


Mov a, c

Mov r1, b

Mov r2, r3

All these instructions are independent and hence can be executed in parallel while the instructions

Mov a, b

Mov c, a

are dependent and can not be executed in parallel.

So this way dependences are checked and then they are executed in parallel if found independent

Leave a Reply

Your email address will not be published.