Question: Assignment 2 Benchmarking the memory subsystem Deadline: Thursday February 18, 2016 1 Analysis How is the memory subsystem structured on your machine. On Linux system

Assignment 2 Benchmarking the memory subsystem Deadline: Thursday February 18, 2016 1 Analysis How is the memory subsystem structured on your machine. On Linux system likwid-topology (dev by Google) is a good tool to know how the memory system is architectured How much DRAM memory? Memory speed? Memory Technology How much bandwidth can your processor draw from the bus? You will need to discover width of the memory bus and clock speed of the memory bus. Full duplex/half duplex? (On intel processor, ARK often indicates something.) What is the highest level of cache? How big is it? How is it shared across core? Same question for all levels of cache. 2 Bandwidth The purpose of this section is to measure the maximum memory bandwidth of the different components of the system. The easiest way to ensure that is to do the minimum amount of arithmetic operation per byte of data. For each level of the memory hierarchy, measure read, write and read/write bandwidth. The easiest way of doing this is to have each core do a measurable number of memory transfer on a piece of data of a particular size. Plot each bandwidth as a function of the size of the data each core work on. To measure read bandwidth, the easiest test is often to simply compute the sum of an array. To measure write bandwidth, the easiest test is often to set a memory region to zero. To measure read/write bandwidth, the easiest test is often to copy an array in an other one. The measurement itself can be an issue because of the fill-in (what happens at the beginning) and flushout (what happens at the end). A reasonable way to measure is to loop over your main operation multiple time and time only the middle loop iterations to make sure the measurement are carried out while all the cores are busy. 3 Latency The best way to measure memory latency is to perform memory operations that are not easily predicted. Linked list are certainly king in that context. Write an element-less singly linked list. (To be clear it is simply an array of integer next that refers to itself so that you traverse it by doing current = next[current];.) For different size of the list, measure the time it takes to follow a large number of links and report the time per link followed. You will need to set the list to see what you want to see. 1 the instruction cost of the traversal by setting the list to be core bound. the latency by shuffling the order of the list, but be careful not to build a short cycle. the TLB latency cost by ensuring that you are jumping memory pages every time in a hard to predict fashion. 4 Help 4.1 Make sure bandwidth is memory bound Look at the assembly to ensure the it is mostly IO. You can estimate a sufficient arithmetic to IO instruction by looking at flops/bandwidth ratio. 4.2 alignment Be wary of of alignment. Look at the difference between mm256 load ps and mm256 loadu ps. 4.3 Other interesting things mm256 stream load si256 makes a non-temporal read (wont be cached)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!