CS520 S99 Final

CS520 S99 Final
Please email me your answer by 3/20 Saturday 11:59pm

I/O System Design Example (similar to page 530 exercise)

Given CPU 500MIPS; 32-byte-wide memory, 60 ns cycle time;

100 MB/s I/O Bus with room for 20 ultra wide SCSI-2 buses/controller (also called strings); ultra wide SCSI-2 buses can transfer 40 MB/s, 15 disks/bus;

A ultra wide SCSI-2 bus controller costs $300 and adds 0.5 ms overhead to Disk I/O; OS uses 10K CPU instructions for a disk I/O.

You have choices of large disk — 20 GB; small disk — 5 GB; ($0.1/MB).

Assume that a computer system with CPU+mem+I/O Bus (without SCSI controller and hard drives) cost $3000.

Disks rotate at 10000 RPM with 8 ms average seek time & 40 MB/s transfer time. The I/O system requires 1000 GB storage capacity and average I/O size is 32 KB. Evaluate the cost per I/O per second (IOPS) of using small and large drives. Assume that every disk I/O requires an average seek and rotational delay; all devices are used in 100% capacity; and work load is evenly divided among all disks.

What are the IOPS for disks, SCSI strings, memory, I/O bus, and CPU?
With the design goal of having all the hard drives operating at their full potential and allow multiple computer system. What will be the final system configuration and its cost?

Byte Ordering and Memory Alignment.

The following data structure is used in recording simulation data.

struct timedLocation {

unsigned char msgType;

short sessionID;

double longitude; // double precision FP #

double latitude;

} tloc;

How many bytes will be allocated for tloc on a SPARC?
If tloc.sessionID is assigned with value of 3, and tloc.session is allocated at a memory location starting at 2000, what are the values for bytes at address 2000 and 2001? Express them in hexadecimal and assume this is big-endian machine.
If the binary data of tloc variable was created by a SPARC and then read in by a PC, what is the tloc.sessionID value that will be print out or evaluated by the PC?

Problem 3. Pipeline hazards and Pipeline Scheduling.

Given the following DLX code that computes Y=a*X+b*Y where X and Y are integer vectors of 100 elements.

        ADDI    R1, R0, #1         ;; keep the value of i in register R1
        LW        R2, 1500(R0)    ;; keep the value of a in R2
        LW        R3, 2500(R0)    ;; keep the value of b in R3
        ADDI     R4, R0, #100    ;; keep the loop count, 100, in R4
L1:   SLLI      R5, R1, #2         ;; multiply i by 4
         LW        R6, 5000(R5)    ;; load X[i] with its address = 5000+R5
         MULT    R6, R2, R6       ;; a*X[i]
         LW        R7, 6000(R5)    ;; load Y[i] with its address = 6000+R5
         MULT    R7, R3, R7       ;; b*Y[i}
         ADD      R7, R6, R7        ;; a*X[i]+b*Y[i]
         SW        6000(R5), R7    ;; Y[i]=a*X[i]+b*Y[i]
         ADDI     R1, R1, #1         ;; i++
         SLE       R8, R1, R3
         BNEZ    R8, L1
L2:    SW        2000(R0), R1    ;; save final value of i to memory

Is there a pipeline hazard between MULT R7, R3, R7 and ADD R7, R6, R7? If there is a pipeline hazard, how to solve it?
Is there a pipeline hazard between ADD R7, R6, R7 and SW 6000(R5), R7? If there is a pipeline hazard, how to solve it?
The ADDI R1, R1, #1 can be scheduled to be executed after LW R6, 5000(R5) to fill the delay slot and avoid one stall cycle. But there is another instruction that is also a good candidate to fill that delay slot (probably a better candidate.) Please identify the instruction.
Show the improved code after pipeline-scheduling is applied to avoid all possible pipeline hazards.

Assume the latencies as shown in Figure 4.2 page 224. The following code add the two elements of vectors a and b and assign it to vector c.

Loop: LD F0, 400(R1) ; Load a[i] from 400(R1)

LD F2, 200(R1) ; Load b[i] from 200(R1)

ADDD F4,F2,F0 ; F4=a[i]+b[i]

SD 0(R1), F4 ; c[i]=a[i]+b[i]

SUB R1,R1,#8 ; Decrease the index

BNEZ R1, Loop

Without loop unrolling, how many clock cycles does it take to finish a loop element?
Unwind the loop three times, show the code after renaming registers, removing redundant instructions, and scheduling the code to optimize its performance.
How many clock cycles does it take to finish an element in the improved code?
Assume a DLX superscalar which can issues 4 independent instructions (any combination of integer and FP operations) in one cycle and has plenty of functional units to carry out the concurrent execution of instructions. Repeat b) and c).

CS520 S99 Final Please email me your answer by 3/20 Saturday 11:59pm

CS520 S99 Final
Please email me your answer by 3/20 Saturday 11:59pm