AP06_2.vp 1 Introduction We have implemented the CE program, which can simu- late the behavior of caches inside the SMP system a the software basis. The user must have the source code of the program in C language and modify it (explicitly include memory operations for CE purposes). The cache model for one CPU considered here corre- sponds to the structure of L1 – L3 caches on most modern memory architectures [1]. We consider a multilevel set-associa- tive cache. The number of sets is denoted by h. If h is equal to 1 then the cache is called fully associative. One set consists of s independent blocks (called lines in Intel terminology). If s is equal to 1 then the cache is called direct mapped. The size of the data part of a cache in bytes is denoted by DCS. The cache block size in bytes is denoted by BS. We assume write-back caches with the LRU block replacement strategy. Obviously, DC s B hS S� � � . In shared memory systems, each CPU (and each cache) is connected by the shared bus to the shared memory. A co- herence protocol is used to maintain memory coherence according to the specified consistency model. 2 General assumptions of the CE We make following assumptions during the implementa- tion of the CE: 1. The CE is designed for SMP systems using bus snoopy cache coherence protocols. 2. The whole cache size is used only for data. We omit bus transactions for instruction reading. 3. The operand read does not cross the cache block boundary. 4. We assume only write-back caches with the LRU replace- ment strategy. 5. We assume that all caches in each level are the same; caches have at most 3 levels. 6. We assume that all bus transactions are atomic. 7. We assume random bus arbitration. 3 An example of an application of the CE Application of the CE is very easy and straightforward. © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 47 Czech Technical University in Prague Acta Polytechnica Vol. 46 No. 2/2006 A Simple Cache Emulator for Evaluating Cache Behavior for SMP Systems I. Šimeček Every modern CPU uses a complex memory hierarchy, which consists of multiple cache memory levels. It is very difficult to predict the behavior of this hierarchy for a given program (for details see [1, 2]). The situation is even worse for systems with a shared memory. The most important example is the case of SMP (symmetric multiprocessing) systems [3]. The importance of these systems is growing due to the multi-core feature of the newest CPUs. The Cache Emulator (CE) can simulate the behavior of caches inside an SMP system and compute the number of cache misses during a computation. All measurements are done in the “off-line” mode on a single CPU. The CE uses its own emulated cache memory for an exact simulation. This means that no other CPU activity influences the behavior of the CE. This work extends the Cache Analyzer introduced in [4]. Keywords: cache hierarchy, cache emulator, symmetric multiprocessing, MESI protocol. Fig. 1: Overview of cache parameters Fig. 2: Example of an SMP system 4 Comparison between CE and HW CPU cache monitors CE has many important advantages in comparison to HW CPU cache monitors: � The user can measure the behavior of the SMP system on the uniprocessor system. � CE is supported on any platform, because CE is imple- mented as a GNU C library and can easily be included in any program. � The measurements are not influenced by other processes due to the “off-line” mode of the measurement. � External HW CPU cache monitors are very expensive and platform-dependent. Modern CPUs also have internal cache counters (for example, in the IA-32 architecture they are called “performance counters”), but they are available only for privileged users and are hard-to-use. � CE can measure effects of memory functions not supported by the CPU core (for example a “read once” operation) and can serve for the development of more effective cache hierarchies. � The user can easily change the cache configuration or the number of CPUs for the measurement by the #defines state- ment. Parameters for real systems can easily be included from predefined files. � The user can measure the cache behavior only in an area of interest. � Conditional memory operations are supported. � The “off-line mode” guarantees that small quantities of cache misses can also be exactly measured. CE also has potential drawbacks: � Memory operations in CE are about 1000 times slower than HW memory reads because in CE all read (or write) operations are simulated by the software. The exact slow- down ratio depends on the type of measured task and the complexity of the simulated SMP system. This drawback is reduced by the fact that the user can measure the cache be- havior only in the area of interest. � CE requires additional memory for its own emulated cache memory. � Only data caches are assumed, not TLB or other parts of the memory architecture. � Only the numbers of cache misses are measured, not effects of these misses. Some memory latencies, conflicts or stalls occurring during code execution can be overlapped by other computation, so they do not result in performance degradation. � Some cache misses cannot be measured because it is diffi- cult to expressed these memory operations on the source code level (for example, those caused by stack operations in calling subroutines � CALL-RET sequences). � For caches which can hold both data and instructions, the effect of loading instructions into the cache is omitted. This drawback is not usually significant, because the code-sizes of the inner loops are much smaller than the data-sizes used by these loops. � The current version supports only the MESI protocol, write-back caches with the LRU replacement strategy. A version that also supports different coherence protocols or cache configurations is under development. 5 Solution of coherency misses The number of coherency cache misses is strongly influ- enced by the exact times of the memory requests execution and the type of bus arbitration. Since no SW cache emulator is able to predict these times exactly among the whole SMP sys- tem, the memory request ordering is solved on a statistical ba- sis. We assume that the whole CPU enters the section of global memory operation in approximately the same time and that there are no preferences for memory requests. For each exe- cution of this global memory operation inside the loop, a ran- dom ordering of CPU memory requests is generated, because we assume random bus arbitration. Under this ordering, the memory operations get serialized. 6 Validation of CE In order to validate CE, we ran following subroutines from the linear algebra package: � sparse matrix-vector multiplication, � Cholesky factorization, � matrix-matrix multiplication. 48 © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ Acta Polytechnica Vol. 46 No. 2/2006 Czech Technical University in Prague The example code The example code with CE calls s � 0.0; s � 0.0; Cache_init(); for i � 1 to n do for i � 1 to n do CPU no. 1: s� � A[i] * B[i]; s� � A[i]; * B[i]; Cache_read(1,A� i); Cache_read(1,B� i); CPU no. j: c[ j ] � s; Cache_write( j,c� j); Cache_print(); Table 1: Example of an application of the CE All routines run in 2 forms: 1. The original code was measured by the Performance Ana- lyzer HW cache monitor on this SMP system: SunFire V880, 8 UltraSPARC III Cu processors, each CPU has a 64 KB L1 cache and an 8 MB L2 cache, run- ning OS Solaris 10. 2. The modified code was emulated by CE. All cache param- eters were equal to the real SMP system configuration (see above) on this HW configuration: An Intel Celeron 2,4 MHz, 512 MB RAM, a 128 KB L2 cache, an 8 KB L1 cache running OS Windows XP with Intel C compiler version 7.01. The results from these tests are very similar. Of course, they are not exactly the same, because the real measurement and the emulation are inexact in different ways, as discussed above. For simplicity, we can say that the differences between these results were smaller than 20 %. 7 Conclusions We have implemented a cache emulator to study quantita- tive parameters of the cache behavior in the SMP systems during different types of computation. We have also discussed the advantages and drawbacks of this emulator. The main ad- vantage of this emulator is that a user can simulate the cache behavior of any SMP system on the uniprocessor system. The emulator has been verified on different types of usual tasks. The results were similar to those obtained from the HW cache monitor. The errors in the estimations are due to minor simplifying assumptions in the CE. Acknowledgment This work was supported by MŠMT under research pro- gram MSM6840770014. References [1] Wadleigh, K. R., Crawford, I. L.: Software optimization for high performance computing. Hewlett-Packard professional books, 2000. [2] Kennedy, K., Allen, J. R.: Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kauf- mann Publishers Inc., 2002. [3] Bik, A., Girkar, M., Grey, P., Tian, X.: “Efficient Ex- ploitation of Parallelism on Pentium III and Pentium 4 Processor-Based Systems.” Intel Technology Journal, 2001, No. 5. [4] Tvrdík, P., Šimeček, I.: “Software Cache Analyzer.” Pro- ceedings of CTU Workshop, 2005, p. 180–181. Ing. Ivan Šimeček phone: +420 224 357 268 e-mail: xsimecek@fel.cvut.cz Department of Computer Science Czech Technical University Faculty of Electrical Engineering Technická 2 166 27 Prague 6, Czech Republic © Czech Technical University Publishing House http://ctn.cvut.cz/ap/ 49 Czech Technical University in Prague Acta Polytechnica Vol. 46 No. 2/2006