Main Memory Design

We generally have a standard memory chip available and we have to inter-connect several chips to obtain the specification of memory required for our system design. Suppose we have the following standard chip available:

As it is a chip of 1024 (210) words with each word of 4 bits, so we require 10 address lines to access each word and 4 data lines which are bidirectional lines. We also have two control signals WR bar and CS bar, WR bar is an active low control signal to select whether we’ll writing data into RAM or reading data out of RAM while CS bar is an active low signal, used to select the memory chip. So whenever CS bar is low chip is selected and if CS bar is high then device is disabled and chip reduces its power requirements.

If we require a memory chip of size M * N and we have a standard memory of m * n, then

No. of standard chips required= (M * N) / (m * n)

Q- If we require a memory of 1K * 8 bits then show the arrangement using the chip given above to achieve the configuration needed.

Ans: As we need to increase the width of data lines we’ll connect the two chips horizontally with the same address lone and control feeding the both memory chips and concatenate the data lines of two as shown:

No. of standard chips required= (1024 * 8) / (1024 * 4) = 2

Q- If we require a memory of 4K * 4 bits then show the arrangement using the chip given above to achieve the configuration needed.

Ans: In this case we need to increase theaddressable capability so we’ll arrange the chip in vertical with address lines A0 – A9 are common while A10 – A11 are used to select the chip as shown below :                  

No. of standard chips required= (4096 * 4) / (1024 * 4) = 4

00 0000000000 to 00 1111111111 is address range of chip at bottom

01 0000000000 to 01 1111111111 is address range of chip 2nd from bottom

10 0000000000 to 10 1111111111 is address range of chip 3rd from bottom

11 0000000000 to 11 1111111111 is address range of topmost chip


Processor Memory

A Computer system can be classified into following groups:

  1. Internal memory
  2. Primary memory
  3. Secondary memory

INTERNAL MEMORY refers to the registers of CPU which hold temporary results. These registers are very fast.

PRIMARY MEMORY is the storage area where all the programs are executed. The CPU can directly access only those items that are stored in primary memory. Primary memory is fast but slower than internal registers by factor of 10. In short period of time addresses generated by  CPU have the tendency to get clustered around small regions in main memory. This phenomenon is known as locality of reference. Hence these days a small but fast memory called cache memory is installed between CPU and primary memory.

SECONDARY MEMORY refers to storage medium like hard disks, CDs etc

Memory has two major access modes:

  1. Sequential access mode: Here memory can be accessed only in serial order i.e. if we have to access 4th memory location then we first would have to move to first three locations and skip them and only then we can access the 4th location. This is also called serial access mode. Here access time depends on the location. If location is away from the starting point of the memory then access time would be more. E.g. magnetic tape
  2. Random access mode: Any memory location can be accessed at random i.e. if we have to access 4th location then we can directly go to 4th location and access it. So access time is independent of the location we are accessing.

There is also a memory known as semi random access memory in which we have both the modes e.g. memory is divided in different tracks or sectors where we have a memory head for each sector. So we can move to any sector randomly but within the sector we can only move serially.

Memory can also be categorized as follow:

Passive memory: The memory in which no processing activities takes place. In this type of memory only 2 types of operations can be carried out: read & write. e.g. all the memories we use in daily life.

Active memory: This intelligent memory system has a small processor associated with every memory cell and this processor can perform operations like increment & addition and hence can improve the performance of the main processor.


This is a volatile memory (i.e. memory would lose its data whenever there is power loss). This can be of two types as shown:

This is a static RAM where two transistors are used to store a single bit and they can hold data without external assistance as far as power is supplied to the circuit. The circuit is also known as flip-flop or bi-stable multi-vibrator. If Q is high we get high as an input to the transistor T1 and hence Q bar is low and as Q bar is low we get LOW as input to transistor T2 and hence Q is HIGH. We can see how output is maintained by the circuit itself. So we don’t need t refresh the circuit again and again.


This is dynamic RAM where we store the bits in capacitors and we have to charge the capacitors after some time again and again as the charge stored on the capacitors gets leaked out. This is called dynamic RAM as we have to refresh the circuit again and again. Due to the fact that capacitors have to be charged again and again, there is a dedicated circuitry required just to refresh the capacitors. DRAM is cheaper that SRAM.

ROM (Read Only Memory)

This is a non volatile memory which is used to store permanent codes. Mainly this is of following types:

Mask ROM: This type of ROM is used by the manufacturers for programming  and is mass produced, hence are quiet in-expensive.

PROM (Programmable Read only memory): This is the type of ROM which can be programmed by the user in the working environment but can not be re-programmed.

EPROM (Erasable programmable ROM): This type of ROM can be reprogrammed hence called so. This type of ROM is required when programs are in the development stage.


 Flash memory is a solid-state, non-volatile, rewritable memory that works like RAM and a hard-disk drive combined. It resembles conventional memory, coming in the form of discrete chips, modules, or memory cards. Just like with DRAM and SRAM, bits of electronic data are stored in memory cells, and just like with a hard disk drive, flash memory is non-volatile, retaining its data even when the power is turned off. Although there are advantages over both RAM (its non-volatility) and hard disk (the absence of moving parts) but following are the reasons why flash memory is not a viable replacement for either.

  1. Because of its design, flash memory must be erased in blocks of data rather than single bytes like RAM.
  2. It has a higher cost
  3. The memory cells in a flash chip have a limited lifespan of around 100,000 write cycles, thus making it an inappropriate alternative to RAM for use as a PC s main memory.


This memory is used as an intermediate between CPU and RAM. It is fastest form of storage. It basically decreases the delay for interaction of CPU and RAM.  First data is transferred to cache memory and then from cache memory data is used by CPU. The cache memory is implemented using SRAM. This is used to bridge the performance gap between processor and the RAM.

WRITE BACK CACHE: In this cache memory whatever the changes are made to the cache are not transferred immediately to main memory but changes are transferred to main memory only when corresponding data is to be replaced in the cache. It avoids unnecessary writes to the memory.

WRITE THROUGH CACHE: In this type of cache memory what ever changes are made to cache are transferred immediately to the main memory. This policy is easily implemented but may lead to unnecessary writes to the memory.


Memory system is often designed using different technologies to achieve the optimum performance. So memory system is a multilevel organization. The following a typical computers’ memory hierarchy organization:


It is defined as relative number of successful references to the cache memory. In rough words, if h=0.9, then it means that if we access cache memory for a particular data, then 9 out of 10 times we’ll be able to get it.                                                                                                                                                    

Q- Calculate the average access time if cache memory access time is 150ns and memory access time is 900 ns and we have the hit ratio for cache memory as h=0.8. Assume every time we go to main memory we’ll be get the data we want.

Ans:  tav= h * t+ (1 – h)*(t+ tm)

As when there is cache hit then we’ll need to go to cache memory only for data hence only time tc is taken otherwise we would have to access main memory after unsuccessfully accessing cache memory so total time taken in this case would be t+ tm

So tav= 0.8 * 150 + (1 – 0.8) * 900 = 120 + 180 = 300 ns

If there had been no cache memory then average time taken would have been 900 ns as every time we would have to access main memory for data so tav= 900ns

So we see there is a 3 times increase in the performance of the system by employing CACHE memory in the hierarchy.


Instruction Encoding

This topic is intended to make a better understanding of the microprocessor instructions which can very helpful during the course of an interview of any embedded company. In this topic we’ll see how instruction opcodes are decided and how they vary with no of instructions and instruction format. This a very easy topic and everyone must go though it.

In general an instruction has 2 components:

  1. Op-code field
  2. Address field

Op-code field tells how data is to manipulated and address field tells us about the address of different data. Address field may contain zero or one or two or more addresses. Consider the instruction

  Mov                       A, B

Opcode              Address

Field                      Field

Depending on the different addressing modes one can have different instruction formats:

  1. Zero address instructions
  2. One address instructions
  3. Two address instructions

Size of instruction word is decided by the designer depending upon the no of instructions required and instruction’s format. Suppose we have 8 bit instruction word. We’ll see how many instructions we can have and Address is of 3 bits.

Zero address instructions:

As there is no address to be specified so we can use all 8 bits for opcode hence we can have 28=256 opcodes and 2=256 instructions with no address fields.

One address instructions:

As we have to specify one 3 bit address field so we’ll use 3 bits for address and rest 5 bits for opcodes. Hence we can have 25=32 opcodes and 2=32 instructions with 1 address fields.

Opcode                                                add1                     

            p4 p3 p2 p1p0                                          

                 00000                                   a2a1a0                   

                  00001                                  a2a1a0                             

                  00010                                  a2a1a0                             

                  00011                                  a2a1a0     



11111                                    a2a1a0     

Two address instructions:

Here we have to specify 2 3-bit address fields so we use 6 bits for the address and rest 2 bits for opcode. So we can have 22=4 opcodes and hence 22=4 instructions with 2 address fields.

 And we can assign different opcodes to different instructions which have the similar address fields.

Now if we reduce the no of two address instructions from 4 to 3 then we can have 8 more one addresss instructions as

And now if we reduce no. of one address instructions from 8 to 7 then we can accommodate 8 more zero address instructions.

So we can have 3 two address instructions, 7 one address instructions and 8 zero address instructions with total 8 bit of instruction word.




If Designers were only dependent on increasing the clock freq of the processor then CPU performance enhancement could have stalled years ago as laws of physics limit how fast a processor can run. Instead designers have focused on internal structure as well clock speeds to enhance the performance. Since the days of basic 8088 processors, clock speed has gone up by 650 times but performance has been enhanced by over 6500 time which shows that how big is the contribution of internal structure to the performance enhancement. Some of the major architecture improvements are as given next.


In the basic microprocessors we used to deal with 4 bit or 8 bit data at one time so if had to operate on 16 bit data then 4 bit processors would take at least 4 cycles but nowadays we deal in 16 bit or 32 bit or higher data at one time. Hence we can have higher processing speeds.


Nowadays separate processors are used which are called co-processors, to process floating point numbers and other math operations which may take a lot of time on a simple processor. Main processor just sends an interrupt to the co processor and sends the data to the coprocessor and wait for the results. Hence FPUs have been a great help to enhance the speed of processors. E.g. 8087 is the co processor for 8086. Hmmm….this is similar to case where a person takes a calculator with him just to enhance his performance.


In the basic architecture processor an instruction is first fetched and then executed and then the next instruction is fetched and so on while in the pipelined architecture and execution goes in parallel. so firstly one instruction is fetched and then while instruction already fetched is EXECUTED, 2nd instruction is fetched. now when in the next cycle 2nd instruction is executed, third instruction is fetched and so on. Hence this way fetching and execution goes in parallel. So we get the enhanced performance of the system.

So a 4 instruction program gets executed in 5 cycles while without pipelining it takes 8 cycles(1 for each fetch and execution)

So speedup=8/5=1.6

But there is a problem in pipelining in case if branch instructions. As when there is a branch instruction, we have to jump to some new address so the instruction which is already fetched has to be flushed.

But there is a problem in pipelining in case if branch instructions. As when there is a branch instruction, we have to jump to some new address so the instruction which is already fetched has to be flushed.

Eg –

1. Mov a,b

2. Add a,c

3. Jmp  10

4. Mov b, a



10. Mov c,a

And corresponding pipelining diagram would be

Now we can see there is wastage of 1 complete cycle due to one branch instruction so speedup would be lesser.


Suppose we have n segment pipelining system:

If a first segment takes T1 seconds and second segment takes T2 and so on…

Then clock period T is given by

T= max (T1, T2………..Tn) + latch time

And clock freq is given by F= 1/T

Suppose if we have total of n segments/ stages of the pipelining system and m no. of tasks then

Total time taken to execute m instructions = (m+n-1)*T while without pipelining time taken is m*n*T

SPEEDUP (while ignoring branch instructions) = S (n) =m*n*T / (m+n-1)*T = mn / m+n-1


Divide numerator and denominator by m and we see that for m -> very large value we get efficiency as 1

And for a single branch instruction there would be n-1 extra cycles as shown below for a 5 stage pipeline system and following are the 5 stages of five stage pipelining:

S1.  Instructions fetch

S2   Instruction decode

S3   Operand fetch

S4   Operation execution

S5   Result saving

Let the sample program be

  1. I1
  2. I2
  3. I3(branch instruction to I4)
  4. I6
  5. I7
  6. I8
  7. I9
  8. I4

And the pipelining system diagram is

So if we have n pipeline system and m no of instructions and p is the probability of branch instructions then m*p is the no of branch instructions so

Total no of clock cycles are (m+n-1) + m*p*(n-1)

CPI (no. of clock cycles per instruction) = {(m+n-1) + m*p*(n-1)} / m


The branch predictor attempts to guess, before the execution actually reaches the branch instruction, where the program will jump (or branch) next, allowing the prefetch and decode unit to retrieve corresponding instructions and data in advance so that they will already be available when the CPU requests them. So this way we reduce the occurrence of the extra cycles due to branch instruction and hence enhance the performance. However if branches are predicted poorly then we may have to flush the entire pipeline.


If we are using more than one pipeline in the processor architecture then it is called super scalar architecture. So in this architecture multiple instructions are executed in one clock cycle. These processors employ out of order execution governed by data dependencies. It means that the instructions are evaluated and only independent instructions are executed in parallel.


Mov a, c

Mov r1, b

Mov r2, r3

All these instructions are independent and hence can be executed in parallel while the instructions

Mov a, b

Mov c, a

are dependent and can not be executed in parallel.

So this way dependences are checked and then they are executed in parallel if found independent


Measuring Processor Performance

To measure the performance of the processor clock speed can not be the sole criteria. But yes with in the same family of microprocessors (i.e. with same internal architecture) we can compare clock speeds to check which is faster. As A 1.2 GHz Tualatin-core Pentium III, for example, is roughly 20% faster than a 1.0 GHz Tualatin-core Pentium III. But an AMD Athlon XP 3000+, which actually runs at 2.167 GHz, may be faster than an Intel Pentium 4 running at 3.06 GHz, depending on the application. The comparison is complicated because different CPUs have different strengths and weaknesses. For example, the Athlon is generally faster than the Pentium 4 clock for clock on both integer and floating-point but the Pentium 4 has an extended instruction set that may allow it to run optimized software literally twice as fast as the Athlon. 



Q- What does a clock frequency mean which is mentioned in a processor configuration e.g.  1.7GHz processor?

Ans: This clock frequency is the max clock frequency at which a processor can run. An instruction takes a fixed amount of clock cycles to execute a particular instruction. So higher is the clock frequency and lesser is the clock cycle time. So the time taken to execute an instruction reduces with increase in the clock frequency. Hence higher the clock frequency better it is (but with in same family of processors).

Q- Why is there a limit on clock frequency in any processor?

Ans : There are 2 different reasons in limiting the clock freq.

  1. Delays
  2. Heat

Delay: Because there is always a fixed amount of time delay in executing a particular operation. A chip is nothing more than a collection of thousands of transistors and wires that hook them together and a transistor is just an on/off switch. So when ever there is a state change, we have to charge up or drain off the electrons. Hence here is a finite delay in changing of the transistor states. Also we have a delay in transmission of signals. So this imposes the limit on clock frequency.

So if we have n no of different operations and t is the max time delay out of different delays for each of the n operations. Then we have the maximum clock freq as f=1/t Hz

Heat: Also there is also a leakage of heat in every change of transistor state. So more the no of transistor switching of states, more would be the heat leaked so this is another reason we have the limit on clock frequency.

Industry is continuously working on decreasing the delays and on technology to cool the chips and hence we are having continuous increase in the clock frequency.


Q- What if we apply a clock more than the clock frequency?

Ans: Then there would not be enough time for a particular operation to complete and we may have absurd outputs. Also there may be the case that chip may get destroyed due to overheating and some of the chips may work proper when it is over clocked if it is cooled externally. 


Processor architecture types


In this architecture complex instructions are used and hence each instruction does a lot of work but takes many clock cycles. E.g. 8086, 8088


In this architecture simple instructions are used and one instruction takes only one clock cycle for execution. It is a common misunderstanding that RISC systems have fewer number of instructions while actually reduced means that instructions are very simple and takes only little time to execute. E.g. MIPS and SPARC


In this architecture data and program are stored together. Programs are fetched from the memory for execution by CPU. In Von Neumann architecture following pattern of instruction execution is followed:

  1. Instruction fetch: instruction and necessary data is fetched from memory.
  2. Decode: instruction and data are separated and instructions is decoded.
  3. Execute: instructions is executed and the corresponding data is manipulated and results are stored.

e.g.  8086 etc


In thismemory is divided into 2 parts: data and instructions. In pure Harvard architecture there were 2 memories for data and instruction. Instruction memory is for storing instructions only. Many DSPs (Digital Signal Processors) are modified Harvard architectures which is designed to access 3 distinct memory areas simultaneously: the program instructions, the signal data samples, and the filter coefficients.


Different computers put their multi-byte data words (i.e., 16-, 32-, or 64-bit words) in different ways in RAM. Each individual byte in a multi-byte word is still separately addressable. Some computers order their data with the most significant byte of a word in the lowest address, while others order their data with the most significant byte of a word in the highest address. This distinction is known as endianness.

Computers that order data with the least significant byte in the lowest address are known as “Little Endian“, and computers that order the data with the most significant byte in the lowest address are known as “Big Endian”. It is easier for a human to view multi-word data dumped to a screen one byte at a time if it is ordered as Big Endian (i.e. most significant at lowest address).

The user cannot tell the difference between computers that use the different formats. However, difficulty arises when different types of computers attempt to communicate with one another over a network.



Modern processors have following internal components:

Execution Unit: This unit processes the instructions.

Bus Interface Unit: This unit interfaces with memory

Primary cache: This is also called Level 1 or L1 cache. Primary cache is a small amount of very fast memory that allows the CPU to retrieve data immediately

Branch predictor: This is used to predictabout the branch instructions.

Floating point unit: This part is used to manipulate floating point numbers.