CS520 Computer Architecture Semester Project List

Use WART (Wisconsin Architecture Research Toolset) to analyze the program behavior and cache performance.

The goal here to learn about the new tool such as

QPT for precise program profiling and memory reference traces,

There is a din.c that convert the .QTrace format to .din format for DineroIII.

CPROF for annotating source code which generates cache misses,
DineroIII and Tycho for analyzing miss rates for cache designs,
EEL (Executable Editing Library) for building tools to analyze and modify an executable (compiled) program,

The source code is in ~cs520/warts. They are developed on Sparc machine. We have chico and groucho (sun sparc classic) for compiling and installing the toolset.
Level-2 cache design system. CacheSimulator was created by Nick Patterson. It modified DineroIIl to simulate the Level-1 and Level-2 cache systems. Create scripts that utilize CacheSimulator to suggest certain Level-2 cache size for an applications. You may want to use QPT to profile the applications of interest and then use its memory reference list output to drive the CacheSimulator.
UC Berkekey S98 project list
Port JVM to symmetrix processor systems and take full advantage of the hardware multiprocessors.
SPECjvm98 available! SPEC has released its first benchmark for comparing Java virtual machine (JVM) client platforms. Study it and suggest improvement.
Study ARM processor.
Previous semester project list:

Cost and Performance
1) How to distribute caches in the hosts, I/O controllers, and disks of
   a parallel computer system? Note that caches in each of the
   three levels utilize different locality information and with multiple
   units in each level, your design need to consider the coherence
   issue. One subproblem is to consider this as a constraint satisfaction
   problem: assume there is only limit amount of caches to be
   distributed, what is the most efficient way to distribute them?
   Come up with mathematical formulae for this problem.
   (Suggested by Richard Lary, Peter T. McLean, and David W. Thiel of
   Digital Equipment Corporation)

2) The existing FDDI and ATM networks offer high speed bandwidth among
   connected units. Explore the design issues on utilizing these high speed
   networks as an interconnection network which connects processors and
   storage systems. Compare the design(s) using FDDI vs. that using ATM.
   In particular, examine the protocol functionalities
   provided by the high speed communication protocols in these networks,
   compare them with the functionality provided by buses or
   interconnection networks, design additional network primitives to
   facilitate the design of distributed applications, such as distributed
   resource locking.
   (Suggested by Richard Lary, Peter T. McLean, and David W. Thiel of
   Digital Equipment Corporation)

3) Multimedia Computing. With the advances in video and voice processing
   equipments and communication networks, we see the price decreases in those
   equipments and communication bandwidth. Some of the big companies
   such as IBM and Apple, are emphasizing multimedia computing. One of
   the problems in existing computer architecture for supporting multimedia
   computing is the low speed of the bus is not capable of handling the
   traffic among video equipment, frame buffer, and communication
   devices. Assume that we replace those low
   speed bus with small high speed switches, such as cross-point switches
   or ATM switches. Design the primitives for these new multimedia
   computer architecture. You have to collect the operations that are needed
   to support those devices. Your can design these primitives at the programming
   level or implement them at the ISA level. Discuss the trade-off of these
   two approaches.

4) Evaluate a "stall" cache. There are several ways to fix the problem
   of loads and stores taking extra cycles in RISC machines that must
   occur when there is a single 32-bit bus to the outside world. One way
   is double cycle the bus as done in the MIPS R2000. Another way is to
   add an instruction cache on chip. Rather than include a complete
   instruction cache, SUN has come up with an idea called a "stall
   cache". The cache only contains the instructions needed during a load
   or store access, i.e., the instructions that follow the load or
   store. (If you know what the branch target buffer is on the AMD
   29000, this is the same philosophy for a different problem.) The
   conjecture is that a small fully associative cache (say 32 entries)
   will substantially reduce the impact of loads and stores while
   maintaining a single bus. Preliminary results suggest a 10%
   performance improvement and that random replacement is superior to
   LRU (see exercise 8.11 on page 494 for an explanation why this might
   be true.) (Suggested by Ed Frank of Sun Microsystems.)

5) Distributed system cost. What is the most economical way to
   configure distributed systems in terms of local memory, remote
   memory (in file system), local disk, remote disk, network bandwidth,
   and so on. This would require collecting costs of the options and
   then estimating a performance model. (Suggested by Andy Bechtolsheim
   of Sun Microsystems.)

Benchmarks.
6) Generalized Benchmarks. Two popular benchmark kernels, Livermore
   Loops and Linpack, summarize performance as a single number. While
   providing some information, this is not enough to understand what is
   really going on. In addition, for certain types of architectures,
   performance is a function of input size. However, these two programs
   measure execution time for only a few input sizes. This project
   involves rewriting either or both benchmarks to provide more
   information about processor performance rather than that single
   number.

7) Synthetic SPEC benchmark. One of the problems of the SPEC benchmark
   suite is that they take hours or days to run on a simulator of a new
   machine. If someone one could come up with a program that ran in
   10,000 cycles that would accurately predict the SPEC rating they would
   have the undying admiration of microprocessor designers around the
   world. The idea would be to measure the SPEC benchmarks by all
   perspectives and then write an assembly language program that has the
   proper profile for a particular architecture. The idea of write an
   assembly language program is to prevent targeted compilers from
   discarding most of the synthetic program. The proper profile include
   instruction mix, cache miss rate, branch taken frequency, instruction
   sequence dependencies for superscalar machines, "density" of floating
   point operations, and anything else you can think of to make it
   realistic. The idea would be to try this program on a variety of
   models to see how accurately it predicts performance. (Suggested by
   Jim Slager of Sun Microsystems.)

8) Benchmarks for I/O. Exercise 9.15

9) Benchmarks for multiprocessors or network of multicomputers.
Exercises 10.8, 10.9, 10.10, 10.11, 10.12.

Tools
5) Exercise 4.22 Instruction statistics tools.

6) Improving DLX tools. We have the source code of DLX tools (compiler,
simulator, assembler) but there are bugs need to be fixed.
Try to port them to DEC3100, fix the bugs, and improve its performance.

7) New DLX tools. Any of Exercises 6.21, 6.22, 6.23, 7.13, 7.14, 8.16.

8) Link-time instrumentation for cache simulation.
   One of the limitations of cache traces is the phenomenal slowdown in
   execution to collect the information plus the storage space required
   to save. Borg et al [1990] propose adding statistics collecting
   instructions to basic blocks at linktime to collect cache
   information. This allows them to run collect information on any
   program that runs on their instruction sets for hundreds of times
   longer than any previous cache studies, with surprising results.
   While they need to rerun the program to try different cache
   parameters, it is so much faster that it takes much less time than
   trace based collection. Write such a linker for the MIPS instruction
   sets and collect cache statistics of your own. Again, there can be
   several teams working on this project since they can pick several
   architecture.

8) Hand-tuned simulator. Exercises 5.19
Write a simulator for 8086 on DEC3100, Macintosh, or SUN Sparc.

Input/Output
9) I/O overhead experiments. Find the instructions per I/O required to
   delivered 8KB requests in various configurations. If this number is
   well modeled, we will better understand the effect of MIPS in
   increasing throughput when machines have lots of disks. Some
   interesting comparisons might be:
   1. Raw vs. file overhead.
   2. Cost of DMA for cache invalidation and memory bus utilization.
   3. Cycle cost of optimizations such as buffer cache management and
      seek optimizations.
   4. Instruction counts on different machines.
   5. The effect of different disk controllers.
   Most work could be done with trace and editing tricks.
   (Suggested by Rich Clewett of Sun Microsystems.)

10) I/O performance experiments. Exercise 9.15.

or if you have some topics you would like to study, write a one-page proposal
and let me know your plan by 2/24.

You can work alone or as a group of two.
You have to choose your project by 2/24 and show me the composition of
team members and your initial plans of attack.