The performance of a snooping cache-coherent multiprocessor depends on many detailed implementation issues that determine how quickly

Question:

The performance of a snooping cache-coherent multiprocessor depends on many detailed implementation issues that determine how quickly a cache responds with data in an exclusive or M state block. In some implementations, a CPU read miss to a cache block that is exclusive in another processor's cache is faster than a miss to a block in memory. This is because caches are smaller, and thus faster, than main memory. Conversely, in some implementations, misses satisfied by memory are faster than those satisfied by caches. This is because caches are generally optimized for "front side" or CPU references, rather than "back side" or snooping accesses.
For the multiprocessor illustrated in Figure 4.37, consider the execution of a sequence of operations on a single CPU where
€¢ CPU read and write hits generate no stall cycles.
€¢ CPU read and write misses generate Nmemory and Ncache stall cycles if satisfied by memory and cache, respectively.
€¢ CPU write hits that generate an invalidate incur Ninvalidate stall cycles.
€¢ A writeback of a block, either due to a conflict or another processor's request to an exclusive block, incurs an additional Nwriteback stall cycles.
Consider two implementations with different performance characteristics summarized in Figure 4.38.
Consider the following sequence of operations assuming the initial cache state in Figure 4.37. For simplicity, assume that the second operation begins after the first completes (even though they are on different processors):
P1: read 110
P15: read 110
For Implementation 1, the first read generates 80 stall cycles because the read is satisfied by P0's cache. P1 stalls for 70 cycles while it waits for the block, and P0 stalls for 10 cycles while it writes the block back to memory in response to P1's request. Thus the second read by P15 generates 100 stall cycles because its miss is satisfied by memory. Thus this sequence generates a total of 180 stall cycles.
For the following sequences of operations, how many stall cycles are generated by each implementation?
a. P0: read 120
P0: read 128
P0: read 130
b. P0: read 100
P0: write 108 P0: write 130 c. P1: read 120
P1: read 128
P1: read 130
d. P1: read 100
P1: write 108 P1: write 130