The existing FDDI and ATM networks offer high speed bandwidth among
connected units. Explore the design issues on utilizing these high speed
networks as an interconnection network which connects processors and
storage systems. Compare the design(s) using FDDI vs. that using ATM.
In particular, examine the protocol functionalities
provided by the high speed communication protocols in these networks,
compare them with the functionality provided by buses or
interconnection networks, design additional network primitives to
facilitate the design of distributed applications, such as distributed
(Suggested by Richard Lary, Peter T. McLean, and David W. Thiel of
Digital Equipment Corporation)
Multimedia Computing. With the advances in video and voice processing
equipments and communication networks, we see the price decreases in those
equipments and communication bandwidth. Some of the big companies
such as IBM and Apple, are emphasizing multimedia computing. One of
the problems in existing computer architecture for supporting multimedia
computing is the low speed of the bus is not capable of handling the
traffic among video equipment, frame buffer, and communication
devices. Assume that we replace those low
speed bus with small high speed switches, such as cross-point switches
or ATM switches. Design the primitives for these new multimedia
computer architecture. You have to collect the operations that are needed
to support those devices. Your can design these primitives at the programming
level or implement them at the ISA level. Discuss the trade-off of these
Evaluate a "stall" cache. There are several ways to fix the problem
of loads and stores taking extra cycles in RISC machines that must
occur when there is a single 32-bit bus to the outside world. One way
is double cycle the bus as done in the MIPS R2000. Another way is to
add an instruction cache on chip. Rather than include a complete
instruction cache, SUN has come up with an idea called a "stall
cache". The cache only contains the instructions needed during a load
or store access, i.e., the instructions that follow the load or
store. (If you know what the branch target buffer is on the AMD
29000, this is the same philosophy for a different problem.) The
conjecture is that a small fully associative cache (say 32 entries)
will substantially reduce the impact of loads and stores while
maintaining a single bus. Preliminary results suggest a 10%
performance improvement and that random replacement is superior to
LRU (see exercise 8.11 on page 494 for an explanation why this might
be true.) (Suggested by Ed Frank of Sun Microsystems.)
Distributed system cost. What is the most economical way to
configure distributed systems in terms of local memory, remote
memory (in file system), local disk, remote disk, network bandwidth,
and so on. This would require collecting costs of the options and
then estimating a performance model. (Suggested by Andy Bechtolsheim
of Sun Microsystems.)
6) Generalized Benchmarks. Two popular benchmark kernels, Livermore
Loops and Linpack, summarize performance as a single number. While
providing some information, this is not enough to understand what is
really going on. In addition, for certain types of architectures,
performance is a function of input size. However, these two programs
measure execution time for only a few input sizes. This project
involves rewriting either or both benchmarks to provide more
information about processor performance rather than that single
Synthetic SPEC benchmark. One of the problems of the SPEC benchmark
suite is that they take hours or days to run on a simulator of a new
machine. If someone one could come up with a program that ran in
10,000 cycles that would accurately predict the SPEC rating they would
have the undying admiration of microprocessor designers around the
world. The idea would be to measure the SPEC benchmarks by all
perspectives and then write an assembly language program that has the
proper profile for a particular architecture. The idea of write an
assembly language program is to prevent targeted compilers from
discarding most of the synthetic program. The proper profile include
instruction mix, cache miss rate, branch taken frequency, instruction
sequence dependencies for superscalar machines, "density" of floating
point operations, and anything else you can think of to make it
realistic. The idea would be to try this program on a variety of
models to see how accurately it predicts performance. (Suggested by
Jim Slager of Sun Microsystems.)
8) Benchmarks for I/O. Exercise 9.15
Benchmarks for multiprocessors or network of multicomputers.
Exercises 10.8, 10.9, 10.10, 10.11, 10.12.
5) Exercise 4.22 Instruction statistics tools.
Improving DLX tools. We have the source code of DLX tools (compiler,
simulator, assembler) but there are bugs need to be fixed.
Try to port them to DEC3100, fix the bugs, and improve its performance.
7) New DLX tools. Any of Exercises 6.21, 6.22, 6.23, 7.13, 7.14, 8.16.
Link-time instrumentation for cache simulation.
One of the limitations of cache traces is the phenomenal slowdown in
execution to collect the information plus the storage space required
to save. Borg et al  propose adding statistics collecting
instructions to basic blocks at linktime to collect cache
information. This allows them to run collect information on any
program that runs on their instruction sets for hundreds of times
longer than any previous cache studies, with surprising results.
While they need to rerun the program to try different cache
parameters, it is so much faster that it takes much less time than
trace based collection. Write such a linker for the MIPS instruction
sets and collect cache statistics of your own. Again, there can be
several teams working on this project since they can pick several
Hand-tuned simulator. Exercises 5.19
Write a simulator for 8086 on DEC3100, Macintosh, or SUN Sparc.
9) I/O overhead experiments. Find the instructions per I/O required to
delivered 8KB requests in various configurations. If this number is
well modeled, we will better understand the effect of MIPS in
increasing throughput when machines have lots of disks. Some
interesting comparisons might be:
1. Raw vs. file overhead.
2. Cost of DMA for cache invalidation and memory bus utilization.
3. Cycle cost of optimizations such as buffer cache management and
4. Instruction counts on different machines.
5. The effect of different disk controllers.
Most work could be done with trace and editing tricks.
(Suggested by Rich Clewett of Sun Microsystems.)
I/O performance experiments. Exercise 9.15.
if you have some topics you would like to study, write a one-page proposal
and let me know your plan by 2/24.
can work alone or as a group of two.
You have to choose your project by 2/24 and show me the composition of
team members and your initial plans of attack.