-
I/O System Design Example (similar
to page 530 exercise)
Given CPU 500MIPS; 32-byte-wide
memory, 60 ns cycle time;
100 MB/s I/O Bus with room
for 20 ultra wide SCSI-2 buses/controller (also called strings); ultra
wide SCSI-2 buses can transfer 40 MB/s, 15 disks/bus;
A ultra wide SCSI-2 bus
controller costs $300 and adds 0.5 ms overhead to Disk I/O; OS uses 10K
CPU instructions for a disk I/O.
You have choices of large
disk — 20 GB; small disk — 5 GB; ($0.1/MB).
Assume that a computer system
with CPU+mem+I/O Bus (without SCSI controller and hard drives) cost $3000.
Disks rotate at 10000 RPM
with 8 ms average seek time & 40 MB/s transfer time. The I/O system
requires 1000 GB storage capacity and average I/O size is 32 KB. Evaluate
the cost per I/O per second (IOPS) of using small and large drives. Assume
that every disk I/O requires an average seek and rotational delay; all
devices are used in 100% capacity; and work load is evenly divided among
all disks.
-
What are
the IOPS for disks, SCSI strings, memory, I/O bus, and CPU?
-
With the
design goal of having all the hard drives operating at their full
potential and allow multiple computer system. What will be the final
system configuration and its cost?
-
Byte Ordering and Memory Alignment.
The following data structure
is used in recording simulation data.
struct timedLocation {
unsigned char msgType;
short sessionID;
double longitude; // double precision FP #
double latitude;
} tloc;
-
How many
bytes will be allocated for tloc on a SPARC?
-
If tloc.sessionID
is assigned with value of 3, and tloc.session is allocated at a memory
location starting at 2000, what are the values for bytes at address 2000
and 2001? Express them in hexadecimal and assume this is big-endian machine.
-
If the
binary data of tloc variable was created by a SPARC and then read
in by a PC, what is the tloc.sessionID value that will be print out or
evaluated by the PC?
-
Problem 3. Pipeline hazards
and Pipeline Scheduling.
Given the following DLX
code that computes Y=a*X+b*Y where X and Y are integer vectors of 100 elements.
ADDI R1, R0, #1
;; keep the value of i in register R1
LW R2, 1500(R0)
;; keep the value of a in R2
LW R3, 2500(R0)
;; keep the value of b in R3
ADDI R4, R0, #100 ;; keep the
loop count, 100, in R4
L1: SLLI
R5, R1, #2 ;; multiply
i by 4
LW R6, 5000(R5)
;; load X[i] with its address = 5000+R5
MULT R6, R2, R6 ;;
a*X[i]
LW R7, 6000(R5)
;; load Y[i] with its address = 6000+R5
MULT R7, R3, R7 ;;
b*Y[i}
ADD R7, R6, R7
;; a*X[i]+b*Y[i]
SW 6000(R5), R7
;; Y[i]=a*X[i]+b*Y[i]
ADDI R1, R1, #1
;; i++
SLE R8, R1, R3
BNEZ R8, L1
L2: SW
2000(R0), R1 ;; save final value of i to memory
-
Is there
a pipeline hazard between MULT R7, R3, R7 and ADD R7, R6, R7? If there
is a pipeline hazard, how to solve it?
-
Is there
a pipeline hazard between ADD R7, R6, R7 and SW 6000(R5), R7? If there
is a pipeline hazard, how to solve it?
-
The ADDI
R1, R1, #1 can be scheduled to be executed after LW R6, 5000(R5) to fill
the delay slot and avoid one stall cycle. But there is another instruction
that is also a good candidate to fill that delay slot (probably a better
candidate.) Please identify the instruction.
-
Show the
improved code after pipeline-scheduling is applied to avoid all possible
pipeline hazards.
-
Assume the latencies as shown
in Figure 4.2 page 224. The following code add the two elements
of vectors a and b and assign it to vector c.
Loop: LD
F0, 400(R1) ; Load a[i] from 400(R1)
LD F2, 200(R1) ; Load
b[i] from 200(R1)
ADDD F4,F2,F0 ; F4=a[i]+b[i]
SD 0(R1), F4
; c[i]=a[i]+b[i]
SUB R1,R1,#8
; Decrease the index
BNEZ R1, Loop
-
Without
loop unrolling, how many clock cycles does it take to finish a loop element?
-
Unwind
the loop three times, show the code after renaming registers, removing
redundant instructions, and scheduling the code to optimize its performance.
-
How many
clock cycles does it take to finish an element in the improved code?
-
Assume
a DLX superscalar which can issues 4 independent instructions (any
combination of integer and FP operations) in one cycle and has plenty of
functional units to carry out the concurrent execution of instructions.
Repeat b) and c).