(i) Define clock skew and clock drift. [2 marks] (ii) A client running Cristian's Algorithm observes a...
Question:
(i) Define clock skew and clock drift. [2 marks] (ii) A client running Cristian's Algorithm observes a local clock time of 1399157100.00s at the start of its RPC, and 1399157100.10s at the end of its RPC. The RPC returns a server timestamp of 1399157100.05s. What client-server clock skew will the algorithm calculate? Justify your answer. [2 marks] (iii) Using Cristian's Algorithm as the underlying primitive, propose a time synchronisation algorithm that measures and compensates for clock drift. [2 marks]
the SRAM banks are square and the time taken for a signal to travel along the edge of a SRAM bank is much less than your network's clock cycle time. [5 marks] (b) To implement a set-associative LLC we may spread each set across multiple banks, i.e. each "way" of the set will be in a different bank. The different associative ways will have different access latencies depending on their distance from the cache controller. How might we optimise the placement of lines in particular banks (or ways) to minimise the cache's average access latency? Remember to consider the cost of moving lines. [6 marks] (c) How might the SRAM banks be efficiently interconnected so that the cache's access time is constant regardless of which bank is accessed? [4 marks] (d) Why might it be advantageous to be able to manage the amount of LLC used by each co-scheduled thread in a chip multiprocessor? [5 marks]
This question is concerned with connected undirected graphs in which each edge has a weight, and with spanning trees in such graphs. (a) Explain what is meant by the translation strategy, and outline briefly the steps of a translation-based proof of correctness. [3 marks] (b) Give an algorithm for finding a maximum spanning tree, that runs in O(E + V log V ) time. Explain why your algorithm's running time is as required. [8 marks] (c) Prove rigorously that your algorithm is correct. [9 marks] [Note: You may refer to algorithms from lecture notes without quoting the code. You may use results from lecture notes without proof, but you must state them clearlya) A superscalar processor may speculatively execute loads even when one or more earlier stores have not yet computed their memory addresses. In practice, we would need to restart execution from the speculative load if a memory-carried dependency is subsequently detected. (i) With the help of some additional hardware it is possible to record which loads cause such ordering violations. Briefly outline how this could be done and how such a record could be used to help improve performance. [3 marks] (ii) Describe why such a scheme may unnecessarily delay the issuing of a load even when the mechanism correctly recalls that the load has led to an order violation between a store and load in the past? [4 marks] (b) Why might it also be advantageous for a superscalar processor to predict whether a particular load will hit or miss in the processor's L1 data cache? [3 marks] (c) You are asked to design hardware to run artificial neural network applications in a high-performance and energy-efficient manner. Such workloads can typically make good use of many multiply-accumulate (MAC) units operating in parallel and narrow datatypes. Your system is required to support a range of different neural networks that vary considerably in the type of computations they perform. You consider three approaches: (1) to use a multicore processor; (2) to design a single domain-specific accelerator; (3) to compose your design from two or more domain-specific accelerators where each is specialised for different types of neural network. (i) What are the advantages and disadvantages of each approach? [6 marks] (ii) Describe one possible way of organising the multicore processor and a possible choice for the architecture(s) of its individual cores. Briefly justify your design decisions.(a) A 4KB, blocking, private L1 cache with 16B lines sees the following sequence of accesses from its core. 0x00001000 Load 0x00001010 Store 0x00002000 Load 0x00001010 Load 0x00003000 Load 0x00001010 Store 0x00001010 Store 0x00002000 Load 0x00001000 Load 0x00002000 Load Assuming a write-allocate cache that is empty at first and implements the least-recently-used (LRU) replacement algorithm, what is the hit rate if the cache is (i) direct-mapped; (ii) fully-associative; (iii) 2-way set-associative? [6 marks] (b) If the core supports out-of-order execution, how might a non-blocking cache bring performance benefits? [4 marks] (c) How might the core's load/store queue be used to reduce the number of memory accesses seen by the cache? [4 marks] (d) Assume that this core and cache are part of a chip-multiprocessor, with the cache connected to a shared L2 via a bus that maintains coherence through a snooping MESI protocol. What sequence of steps would be taken if another core wanted to load from 0x00001010 after the given sequence had finished? [6 m(a) How do superblock and trace scheduling differ? [4 marks] (b) How might a programmer improve the performance of a program given detailed knowledge of a processor's memory hierarchy? [6 marks] (c) Larger scale networks, i.e. those involving chip-to-chip or longer distance communications, have been designed for many years. What new challenges and constraints are introduced when designing on-chip networks? [6 marks] (d) As fabrication technologies scale the performance of wires improves slowly relative to that of transistors. Why is this particularly problematic when attempting to increase the performance of superscalar processors? [(a) What dependencies exist between the instructions in the code fragment below? Identify both true data dependencies and name dependencies, and for each name dependence indicate whether it is an antidependence or an output dependence. [4 marks] LI R1, 25 /* R1=25 */ LI R2, 8 /* R2=8 */ ADD R1, R1, R2 /* R1=R1+R2 */ LD R2, 0(R1) /* R2=mem[R1] */ (b) How would a hardware register renaming mechanism remove the name dependencies? Illustrate your answer by providing a version of the code showing the destination and source registers for each instruction after renaming has taken place. Clearly state what free physical registers you assume are available prior to renaming. [4 marks] (c) Why is the removal of name dependencies beneficial within a superscalar processor? [4 marks] (d) In addition to removing name dependencies, for what other purposes may register renaming hardware be used in a superscalar processor? [4 marks] (e) The out-of-order execution of ALU instructions in a superscalar processor is only constrained by the availability of functional units and true data dependencies. Why must the out-of-order execution of memory instructions (e.g. load and store instructions) be constrained further? [4 marks](a) Why is it important to concentrate on improving the common case (e.g. the most commonly used operations and resources) when designing a microprocessor? [4 marks] (b) What is the major difference between a very long instruction word (VLIW) processor and a dynamically-scheduled superscalar processor? What impact does this have on the complexity of the implementation in each case? [4 marks] (c) When designing a VLIW processor, why might variable-length instruction bundles be preferred over fixed-length instructions? [4 marks] (d) Some VLIW processors contain additional hardware to permit memory reference speculation. (i) What optimisation does memory reference speculation permit? [4 marks] (ii) Briefly describe the additional hardware required to support this type of specu (a) What features of a processor's instruction set are desirable if a pipelined implementation is planned? [5 marks] (b) The performance of a processor typically improves when a modest number of pipeline stages are created. Why does it become difficult to maintain near linear performance gains with deeper pipelines? [5 marks] (c) Clustered superscalar processors partition functional units into clusters. Data forwarding within a cluster operates as normal allowing dependent instructions to execute on consecutive clock cycles. Communication between clusters normally incurs an additional delay of 1 or 2 clock cycles. The clustering idea may also be extended to include the issue buffer (also known as the issue window). (i) What problem does clustering attempt to solve? [5 marks] (ii) Assume a processor has two symmetric clusters that contain both functional units and an issue buffer. In this processor, instructions must be steered to a particular cluster before they are inserted in an issue buffer. What should the two basic goals of a good steering policy be?(a) Describe briefly six factors which might influence or constrain the design of a new processor. [6 marks] (b) The performance of a superscalar processor is often enhanced with hardware to support the following: branch prediction register renaming out-of-order execution the speculative reordering of load instructions strided prefetching (i) Sketch an assembly language program that would benefit from the use of all of these techniques when executed on a superscalar processor. Briefly describe how each of the techniques helps to improve the performance of your program. [10 marks] (ii) Briefly outline two example programs for which the adoption of the techniques listed would not provide a significant performance improvement. [4 marks] a) What hardware and software techniques may be used to reduce the number of conflict misses experienced by a direct-mapped cache? [4 marks] (b) How might a hardware prefetcher that is capable of detecting and prefetching non-unit strides be implemented? [4 marks] (c) How can the MESI cache coherency protocol be exploited to ensure that a test-and-set instruction is performed atomically without the need to lock down the bus for multiple cycles? [4 marks] (d) Moore's law predicts we will be able to integrate a very large number of processor cores onto a single chip in the near future. What constraints and challenges may limit our ability to exploit these chip multiprocessors? [8 marks](a) Why do modern processors typically exploit multiple cores rather than a single multithreaded processor? [5 marks] (b) What does support for vector chaining and tailgating allow in a vector processor? [5 marks] (c) How might recent advances in die stacking help to improve microprocessor performance and reduce costs? [5 marks] (d) In what ways might parallelism be exploited to reduce the power consumption of a microprocessor?
answer the question clearly
Auditing An International Approach
ISBN: 978-0071051415
6th edition
Authors: Wally J. Smieliauskas, Kathryn Bewley