



















## **Memory Hierarchy Technology**

° Random Access:

- · "Random" is good: access time is the same for all locations
- DRAM: Dynamic Random Access Memory
  - High density, low power, cheap, slow
  - Dynamic: need to be "refreshed" regularly
- SRAM: Static Random Access Memory
  - Low density, high power, expensive, fast
  - Static: content will last "forever"

° "Non-so-random" Access Technology:

· Access time varies from location to location and from time to time

Adapted from ©UCB97 & ©UCB03

• Examples: Disk, tape drive, CDROM

° The next two lectures will concentrate on random access technology

UC. Colorado Springs

- The Main Memory: DRAMs
- Caches: SRAMs

```
CS420/520 memory.11
```

Levels of the Memory Hierarchy Upper Level Capacity Access Time Unit size Cost faster CPU Registers Registers 100s Bytes <10s ns prog./compiler Instr. Operands 1-8 bytes Cache K-M Bytes Cache 10-100 ns \$.01-.001/bit cache cntl Blocks 8-128 bytes Main Memory Memory M Bytes 100ns-1us \$.01-.001 os Pages 512-4K bytes Disk G Bytes Disk ms\_\_\_\_4 10<sup>-3</sup>- 10 cents user/operator Mbytes Files Larger Tape infinite sec-min 10<sup>-6</sup> Tape Lower Level CS420/520 memory.12 UC. Colorado Springs Adapted from ©UCB97 & ©UCB03







| Consid<br>mappe<br>respon | ler an ei<br>d cache<br>ds to a<br>22, 26, | ight-wo<br>e. Show<br>series o<br>22, 26, | ord (blo<br>the cor<br>f reque<br>16, 3, 1 | ck size<br>ntents o<br>sts (dec<br>6, 18 | is one v<br>f the ca<br>cimal w | word) d<br>iche as i<br>ord ado | irect<br>it<br>dresses): |
|---------------------------|--------------------------------------------|-------------------------------------------|--------------------------------------------|------------------------------------------|---------------------------------|---------------------------------|--------------------------|
| 22                        | 26                                         | 22                                        | 26                                         | 16                                       | 3                               | 16                              | 18                       |
| 10110                     | 11010                                      | 10110                                     | 11010                                      | 10000                                    | 00011                           | 10000                           | 10010                    |
| miss<br>110               | miss<br>010                                | hit<br>110                                | hit<br>010                                 | miss<br>000                              | miss<br>011                     | hit<br>000                      | miss<br>010              |
|                           |                                            |                                           |                                            |                                          |                                 |                                 |                          |





| Definition of                                   | of a Cache Block                                                                                                                         |                                  |  |  |  |  |
|-------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|--|--|--|--|
| ° Cache Block: t                                | he cache data that has                                                                                                                   | in its own cache tag             |  |  |  |  |
| ° Our previous '                                | 'extreme" example:                                                                                                                       |                                  |  |  |  |  |
| <ul> <li>4-byte Dire</li> </ul>                 | ect Mapped cache: Bloc                                                                                                                   | k Size = 1 Byte                  |  |  |  |  |
| <ul> <li>Take adva<br/>it will tend</li> </ul>  | ntage of Temporal Loca<br>to be referenced soon.                                                                                         | lity: If a byte is referenced,   |  |  |  |  |
| <ul> <li>Did not tal<br/>its adjacer</li> </ul> | <ul> <li>Did not take advantage of Spatial Locality: If a byte is referenced,<br/>its adjacent bytes will be referenced soon.</li> </ul> |                                  |  |  |  |  |
| ° In order to take                              | advantage of Spatial L                                                                                                                   | ocality: increase the block size |  |  |  |  |
| Valid                                           | Cache Tag                                                                                                                                | Direct Mapped Cache Data         |  |  |  |  |
|                                                 |                                                                                                                                          | Byte 0                           |  |  |  |  |
|                                                 |                                                                                                                                          | Byte 1                           |  |  |  |  |
|                                                 |                                                                                                                                          | Byte 2                           |  |  |  |  |
|                                                 |                                                                                                                                          | Byte 3                           |  |  |  |  |
|                                                 |                                                                                                                                          |                                  |  |  |  |  |
| CS420/520 memory.19                             | UC. Colorado Springs                                                                                                                     | Adapted from ©UCB97 & ©UCB03     |  |  |  |  |













|                                                                                      | Cache size                                                                                        |                                                                     |                                                            |                                              |
|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|------------------------------------------------------------|----------------------------------------------|
| Block size                                                                           | 4K                                                                                                | 16K                                                                 | 64K                                                        | 256K                                         |
| 16                                                                                   | 8.57%                                                                                             | 3.94%                                                               | 2.04%                                                      | 1.09%                                        |
| 32                                                                                   | 7.24%                                                                                             | 2.87%                                                               | 1.35%                                                      | 0.70%                                        |
| 64                                                                                   | 7.00%                                                                                             | 2.64%                                                               | 1.06%                                                      | 0.51%                                        |
| 128                                                                                  | 7.78%                                                                                             | 2.77%                                                               | 1.02%                                                      | 0.49%                                        |
| 2.5.4                                                                                |                                                                                                   | 2 2007                                                              | 1 1501                                                     | 0.400/                                       |
| 256<br>Figure 5.17 Ac<br>Figure 5.16. Not                                            | 9.51%<br>tual miss rate vers<br>that for a 4 KB cao                                               | us block size for fi<br>che, 256-byte blocks                        | ve different-sized                                         | d caches in<br>ss rate than                  |
| 256<br>Figure 5.17 Ac<br>Figure 5.16. Not<br>32-byte blocks. It<br>byte block to dee | 9.51%<br>tual miss rate vers<br>te that for a 4 KB car<br>n this example, the o<br>crease misses. | us block size for fi<br>che, 256-byte blocks<br>cache would have to | te different-sized<br>have a higher mi<br>be 256 KB in ord | d caches in<br>ss rate than<br>er for a 256- |

|                                                          |                                                                           |                                          | Cach                                          | e size                                      |                               |
|----------------------------------------------------------|---------------------------------------------------------------------------|------------------------------------------|-----------------------------------------------|---------------------------------------------|-------------------------------|
| Block size                                               | Miss penalty                                                              | 4K                                       | 16K                                           | 64K                                         | 256                           |
| 16                                                       | 82                                                                        | 8.027                                    | 4.231                                         | 2.673                                       | 1.89                          |
| 32                                                       | 84                                                                        | 7.082                                    | 3.411                                         | 2.134                                       | 1.58                          |
| 64                                                       | 88                                                                        | 7.160                                    | 3.323                                         | 1.933                                       | 1.44                          |
| 128                                                      | 96                                                                        | 8.469                                    | 3.659                                         | 1.979                                       | 1.47                          |
| 256                                                      | 112                                                                       | 11.651                                   | 4.685                                         | 2.288                                       | 1.54                          |
| 256<br>Figure 5.18 A<br>caches in Figu<br>time per cache | 112<br>verage memory acco<br>re 5.16. Block sizes o<br>size is boldfaced. | 11.651<br>ess time vers<br>f 32 and 64 b | 4.685<br><b>us block size</b><br>bytes domina | 2.288<br>e for five diffe<br>te. The smalle | 1.5<br>erent-siz<br>est avera |
| average mem                                              | orv access time = h                                                       | it time + mis                            | s rate x miss                                 | s penalty @                                 | mem *                         |













|                                    | an a block be placed?                      |                         |
|------------------------------------|--------------------------------------------|-------------------------|
| Scheme name                        | Number of Sets                             | Block per set           |
| Directed mapped cacl               | Number of blocks in cache                  | 2 1                     |
| Set Associative                    | Number of blocks in cache<br>Associativity | Associativity           |
| Fully associative                  | 1                                          | # of blocks in cache    |
| Question 2: How is                 | a block Found?                             |                         |
| Associativity                      | Location method                            | Comparisons required    |
|                                    | index                                      | 1                       |
| Directed mapped                    | macx                                       |                         |
| Directed mapped<br>Set associative | index the set                              | degree of associativity |



| <ul> <li>B cache, 1B blocks, 6</li> <li>4 blocks, 4 sets</li> </ul> | directed mapped ca               | cne              |
|---------------------------------------------------------------------|----------------------------------|------------------|
| Current contents                                                    | Access address                   | Hit/Miss (which) |
| 4, 1, 2, 7                                                          | 4                                | hit              |
| 4, 1, 2, 7                                                          | 8                                | Compulsory miss  |
| 8, 1, 2, 7                                                          | 5                                | Compulsory miss  |
| 8, 5, 2, 7                                                          | 1                                | Conflict miss    |
| 8, 1, 2, 7                                                          | 6                                | Compulsory miss  |
| 8, 1, 6, 7                                                          | 4                                | Capacity miss    |
| First time -> Compulsory<br>N=4 distinct blocks used                | / miss<br>I from the last access | -> Capacity miss |











|        |          |              | Assoc          | ciativity   |               |          |
|--------|----------|--------------|----------------|-------------|---------------|----------|
|        | Т        | wo-way       |                |             | Four-way      |          |
| Size   | LRU      | Random       | FIFO           | LRU         | Random        | FIFO     |
| 16KB   | 114.1    | 117.3        | 115.5          | 111.7       | 115.1         | 113.3    |
| 64KB   | 103.4    | 104.3        | 103.9          | 102.4       | 102.3         | 103.1    |
| 256KB  | 92.2     | 92.1         | 92.5           | 92.1        | 92.1          | 92.5     |
|        | Per 10   | 000 instruct | ions; block    | size 64B; 1 | 0 SPEC200     | 0        |
| What i | t tells? |              |                |             |               |          |
| For    | the larg | est cache, l | ittle differen | nce betwee  | n different s | schemes. |











| that starts er<br>address is in | Ily associative write-back cache<br>npty. Below is a sequence of five<br>square brackets). | with many cache entries<br>e memory operations (the |
|---------------------------------|--------------------------------------------------------------------------------------------|-----------------------------------------------------|
| Write                           | Mem[100];                                                                                  |                                                     |
| Write                           | Mem[100];                                                                                  |                                                     |
| Read                            | Mem[200];                                                                                  |                                                     |
| Write                           | Mem[200];                                                                                  |                                                     |
| Write                           | Mem[100];                                                                                  |                                                     |
| What are nur write allocate     | nber of hits and misses when us<br>e, respectively?                                        | sing write not allocation and                       |
| Write not alloc                 | ate: m, m, m, h, m (4 misses and                                                           | d 1 hit)                                            |
| Write allocate:                 | m, h, m, h, h (2misses and                                                                 | 3 hits)                                             |
| Observations:                   | Write through + not allocate                                                               |                                                     |
|                                 | U                                                                                          |                                                     |





## Memory Performance Example I

Q1: Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a machine (M1) has a CPI of 2 if without any memory stalls and the miss penalty is 40 cycles for all misses, determine how much faster a machine (M2) would run with a perfect cache that never missed. It is known that the frequency of all loads and stores in gcc is 36%

Answer:

Instruction miss cycles = I \* 2% \* 40 = 0.80I Data miss cycles = I \* 36% \* 4% \* 40 = 0.576I = 0.58 I M1: The CPI with memory stalls is 2 + 1.38 = 3.38 M2: The CPI with a perfect cache is 2 The performance with the perfect cache is: Per\_M2/Per\_M1 = 3.38 / 2 = 1.69

Q2: Suppose we speed up the machine by reducing CPI from 2 to 1 without changing the clock rate & memory system, how much faster M2 than M1

Adapted from ©UCB97 & ©UCB03



UC. Colorado Springs

























| Eleven Adva                                    | nced Cache Optimiza                                    | ations                       |
|------------------------------------------------|--------------------------------------------------------|------------------------------|
| ° Reducing the h<br>trace caches               | it time: small and simple ca                           | ches, way prediction, and    |
| <sup>°</sup> Increasing cacl<br>and non-blocki | ne bandwidth: pipelined cac<br>ng caches               | hes, multi-banked caches,    |
| <sup>°</sup> Reducing the n<br>buffers         | niss penalty: critical word fin                        | rst and merging write        |
| ° Reducing the n                               | niss rate: compiler optimiza                           | tion                         |
| <sup>°</sup> Reducing the n<br>fetching and co | niss penalty or miss rat via p<br>ompiler pre-fetching | parallelism: hardware pre-   |
| ° Where to read f                              | or memory hierarchy?                                   |                              |
| • CO 4: Cha                                    | oter 5                                                 |                              |
| • CA 5: Appe                                   | endix B                                                |                              |
|                                                |                                                        |                              |
| CS420/520 memory.64                            | UC. Colorado Springs                                   | Adapted from ©UCB97 & ©UCB03 |



- ° The Principle of Locality: Temporal Locality vs Spatial Locality
- ° Four Questions For Any Cache
  - Where to place in the cache
  - How to locate a block in the cache
  - Replacement: Random, LRU, NRU, LFU
  - Write policy: Write through vs Write back
    - Write miss: Write Allocate vs. Write Not Allocate
    - Write buffer
- ° Three Major Categories of Cache Misses:
  - Compulsory Misses: sad facts of life. Example: cold start misses.
  - Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect!
  - Capacity Misses: increase cache size

° Three general options to improve cache performance

UC. Colorado Springs

- Reduce the miss rate / increase the hit rate,
- · Reduce the miss penalty, or
- Reduce the time to hit in the cache

CS420/520 memory.65

Adapted from ©UCB97 & ©UCB03