Question: Data speculation refers to the execution of an instruction before some logically preceding instructions on which it may be data dependent. Data speculation implies optimistically
Data speculation refers to the execution of an instruction before some logically preceding instructions on which it may be data dependent. Data speculation implies optimistically assuming the absence of a dependence; thus a given instruction is speculatively executed by assuming no dependence and in case the assumption turns out to be false, a recovery mechanism is initiated. Load instructions usually have long latency and are thus the best candidates for speculative execution. Starting the execution of an instruction earlier can hide its latency and could result in a better schedule. In this part, a 1-wide VLIW VEX assembly file data.s is given to you (1-wide VLIW implies a sequential code with no parallelism), and your job is to first remove false data dependencies (WAW and WAR) by register renaming and then perform data speculation by hoisting loads ahead of stores which they have potential conflicts with and by inserting speculation check instructions chk.s for a recovery in case of mis-speculation. By moving these loads at earlier points will allow us to schedule dependent operations (which use the results of the loaded values in registers) earlier too; thus, the overall schedule may be improved. Speculation does not come free; the checks must be done, in our case in software through VEX instructions and the payoffs are determined by the benefits of speculations vs its costs (checks). The way you will implement this will be as follows: we will focus on a memory dependence (as discussed earlier) from a store to a subsequent load that indicates the MAY dependence (this MAY dependence indicates that the memory addresses of the store and load may be the same at runtime indicating a dependence). During scheduling, you will suppress this dependence with a corresponding check insertion (thru a complex instruction check) at the original load position. You will then run the scheduling algorithm on this modified dependence graph (use the Critical Path - Fan-Out - Source heuristic order). After the schedule is generated, if the load is moved up over the store on which it was dependent, you will convert the load into a dismissible load and you will expand the check instruction as shown in examples below. If the scheduler does not move the load up over the store, the load instruction remains non-speculative and the check instruction is redundant and is removed. First, we give the details of register and memory dependences below. Register dependency: There are three types of data dependencies due to registers: RAW, WAW, WAR. Only the first is the true dependency and thus should be respected, while the others can be removed by register renaming as shown below. Note that the change in (3) is the side effect of breaking false data dependency between (1) and (2), and must be carried out for the correctness of the program. The number of available general purpose registers and their usage can be found in table 11 and table 8 of the VEX manual (page 36 and page 14). Please pick registers from $r0.3 to $r0.56 for renaming. When you remove the dependence thru renaming, ensure you remove the corresponding dependence edge from the graph; when you remove the dependence constraint, the scheduler will have more freedom to arrange instructions and will result in better code. WAW between (1) and (2) (1) c0 add $r0.4 = $r0.x + 10 -----> c0 add $r0.4 = $r0.x + 10 (2) c0 add $r0.4 = $r0.x + 20 -----> c0 add $r0.5 = $r0.x + 20 (3) c0 add $r0.x = $r0.4 + 5 -----> c0 add $r0.x = $r0.5 + 5 WAR between (1) and (2) (1) c0 add $r0.x = $r0.2 + 10 -----> c0 add $r0.x = $r0.2 + 10 (2) c0 add $r0.2 = $r0.x + 20 -----> c0 add $r0.5 = $r0.x + 20 (3) c0 add $r0.x = $r0.2 + 3 -----> c0 add $r0.x = $r0.5 + 3 Speculative load: You have to perform data speculation by replacing normal loads with dismissible loads and hoisting them ahead of stores with which they have potential conflicts with and then inserting speculation check instructions chk.s(m1, m2, r) for a recovery in case of mis-speculation at their original places. The primary reason that stops moving a load above a store is the memory dependence - the store could write to the same location as load and the results in such cases would be incorrect. Thus, at runtime, one must check if there is an address conflict between store and load and re-execute the load at the current position in case there is a conflict. An example is given below in which (2) is scheduled before (1) by speculating that 0[$r0.3] is not equal to 0x10[$r0.2]. Also, a chk.s() is inserted at the original load position place for checking if the prediction is correct and recovering if it fails. One possible way to implement this is to first suppress the may dependence between a store and corresponding load and run the scheduler; if the load is hoisted above the store, the load should be converted to a dismissible load and a check must be inserted after the store to detect and correct the speculation. If the load is scheduled below the store, the dependence is respected anyway and there is no need for a check nor for the conversion of the load. c0 ldw.d $r0.4 = 0[$r0.3] (1) c0 stw 0x10[$r0.2] = $r0.5 --> c0 stw 0x10[$r0.2] = $r0.5 (2) c0 ldw $r0.4 = 0[$r0.3] --> chk.s(0x10[$r0.2],0[$r0.3], r0.4) You are free to choose any implementation of check as you may want, make sure you go for as efficient an implementation as you can because you want this overhead to be smallest. Sample expansion of checking instructions (this is a new example, not the continuation of the above) Suppose the load instruction is c0 $r0.12 = 0x10[$r0.6] and the store instruction stw 0x14[$r0.3] = $r0.15, then the checking instruction is chk.s(0x10[$r0.6], 0x14[$r0.3] = $r0.15, $r0.6). One possible expansion of chk instruction is as follows: c0 add $r0.52 = $r0.6, 0x10 # ldw.d $r0.12 = 0x10[$r0.6] c0 add $r0.53 = $r0.3, 0x14 # stw 0x14[$r0.3] = $r0.15 c0 cmpne $b0.0 = $r0.52, $r0.53 c0 brf $b0.0, Label1 Label0: ... [original instructions] ... Label1: c0 ldw.d $r0.12 = 0x10[$r0.6] c0 goto Label0 In the above implementation we first calculate the two actual addresses through two add instructions in registers $r0.52 and $r0.53 and then compare them and if they are equal, $b0.0 will be false when the control goes to Label1 and re-executes the original load and then executes a goto that goes to the next instruction. Note that the code for Label1 is hoisted at the end of the function or procedure. Thus, on a mis-speculation path we encounter a major penalty of two branches and on the speculation path we encounter an overhead of first 4 instructions. Thus, the optimization makes sense when load latency is quite high (tens of cycles) and load distance (latency between loading of a value and its use) is small. For simplicity, in case there are lots of prior stores before a load instruction, please consider the most recent previous store only. In addition, you don't need to account for resources required by the checking code and you may assume chk.s() takes 0 cycle to complete. High-Level Overview: Use the data dependence graph built in Problem 2 from the 1-wide VEX assembly code, and each node represents one operation in the assembly code Annotate each edge with one or more where di is the delay as defined in the original listing algorithm, ri is the shared register between the two nodes (mark - if it is a memory dependence) and ti is the type of data dependency (either False or True for register dependences; either MUST or MAY for memory dependences) Perform the register renaming by choosing free registers from $r0.3 to $r0.56 and then remove appropriate annotations on corresponding edges or even edges themselves (if no annotations are left) from the graph. [Hint] Dont forget the side effects by examining appropriate edges with annotation of that shared register down to the tail. Remove the may dependence between a store and subsequent loads and then re-run the scheduling algorithm on the modified dependence graph. [Hint] You may have to fill in slots with NOPs when there are no non-nop instructions that can be scheduled given current constraints. Template code: A code template is provided to you in vliwSpeculation.cpp. Complete the required functions to perform the list scheduling. Feel free to define any helper functions in the code but DO NOT modify any of the provided functions. Instruction format: In vliwSpeculation.cpp, the helper function readInput returns a subset of instructions in the input assembly file that need to be scheduled. Note that that subset may contain ;; delimiters and NOP instructions (their formats are given below) which you should discard when you are doing scheduling. All instructions in that vector are well formatted, and you can append them to your scheduling output directly. However, you may have to append NOP instructions to your output when there are no non-nop instructions which can be scheduled given current constraints and ;; delimiters to the output when all the slots of current VLIW instruction word are filled up. Format of NOP instruction: Format of ;; delimiter: ;;< > Deliverables: vliwSpeculation.cpp Arguments: i. Input Filename - Name of 1-wide VEX assembly file of instructions to schedule ii. Output Filename - Name of 4-wide VEX schedule for input instructions iii. Mode - Controls whether register renaming and speculation optimizations are executed Mode = 0 - No register renaming or speculation in scheduled code Should be same output as Mode = 5 in previous part. Mode = 1 - Register renaming but no speculation in scheduled code Mode = 2 - Register renaming and speculation in scheduled code This code should be commentated well to describe the algorithms and data structures used. Team member contributions should be labelled in the comments for functions, or even specific blocks of code. Ex. // Author: gburdell3 The following 4-wide VEX assembly files of scheduled instruction using register renaming and/or data speculation: 4_wide_rename_data.s - 4-wide VEX assembly file from scheduling with register renaming. 4_wide_rename_spec_data.s - 4-wide VEX assembly file from scheduling with register renaming and speculation . Part2_Report.pdf which contains the following: A count of the number of speculative loads that were created in the schedule. A graph of the performance comparison between non-speculation no-renaming 4-wide code, renaming 4-wide code, and speculative and renaming 4-wide code (plot the corresponding timings in ms side by side). Use the load latency of 20 cycles. Please schedule with "visible" (to human eyes) latencies and not rely on the hardware pipelining ability.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
