Question: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called

In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision aX plus Y) and is the central operation in Gaussian elimination. The following code implements the DAXPY operation, Y = aX+Y, for a vector length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address of Y:

addi x4, x1, #800 ; x1 = upper bound for X foo:

Assume the functional unit latencies as shown in the following table. Assume a one-cycle delayed branch that resolves in the ID stage. Assume that results are fully bypassed.

fld F2,0 (x1) F4, F2, FO ; (F2) = X(i) ; (F4)

a. Assume a single-issue pipeline. Show how the loop would look both unscheduled by the compiler and after compiler scheduling for both floating-point operation and branch delays, including any stalls or idle clock cycles. What is the execution time (in cycles) per element of the result vector, Y, unscheduled and scheduled? How much faster must the clock be for processor hardware alone to match the performance improvement achieved by the scheduling compiler? (Neglect any possible effects of increased clock speed on memory system performance.)

b. Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it without any stalls, collapsing the loop overhead instructions. How many times must the loop be unrolled? Show the instruction schedule.
What is the execution time per element of the result?

c. Assume a VLIW processor with instructions that contain five operations, as shown in Figure 3.20. We will compare two degrees of loop unrolling. First, unroll the loop 6 times to extract ILP and schedule it without any stalls (i.e., completely empty issue cycles), collapsing the loop overhead instructions, and then repeat the process but unroll the loop 10 times. Ignore the branch delay slot. Show the two schedules. What is the execution time per element of the result vector for each schedule? What percent of the operation slots are used in each schedule? How much does the size of the code differ between the two schedules? What is the total register demand for the two schedules?

Figure 3.20

= a*X(i) ; (F6)=Y (1) F6,0 (x2) F6, F4, F6 F6,0 (x2)

addi x4, x1, #800 ; x1 = upper bound for X foo: fld F2,0 (x1) F4, F2, FO ; (F2) = X(i) ; (F4) = a*X(i) ; (F6)=Y (1) F6,0 (x2) F6, F4, F6 F6,0 (x2) ; (F6) = a*X(1) + y(i) ; Y(i) = a*X(i) + Y (1) ; increment X index x1,x1, #8 x2,x2, #8 x3, x1, x4 x3, foo fmul.d fld fadd.d fsd addi addi sltu bnez increment Y index ; test: continue loop? ; loop if needed

Step by Step Solution

★★★★★

3.45 Rating (152 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Computer Architecture Questions!

Old Time Company purchased a Class 10 truck many years ago for $8,000. The truck has now become a collectors item and was sold on August 1, 2017, for $10,000. The netbook value on that date was $500...

In this exercise we compare the performance of 1-issue and 2-issue processors, taking into account program transformations that can be made to optimize for 2-issue execution. Problems in this...

Let's consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in Figure 2.42. Assume that the ALUs can do all arithmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and...

3.14 In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double-precision ax...

1. [100 pts] In this exercise, we look at how software techniques can extract instructionlevel parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop...

Q1: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop (double- precision aX...

Problem 1: In this exercise, we look at how software techniques can extract instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-called DAXPY loop...

2. Constant Poratios fall on straight lines on an Ellingham Diagram. These lines are called CO/CO grid P0 lines. Calculate the equation of the CO/CO grid line for Pco Data: 2CO(g) + O(g) = 2CO(g)...

What will be the stellar parallax angle of a star that is 2 parsecs away?

Help Which one of the following individuals should have a higher tolerance for risk? Multiple Choice Darren Carter who works for American Airlines and is worried that he is going to be laid off soon...

12.2 Josh OShea is the manager of the Cardiovascular/Respiratory Laboratory. This department is responsible for measuring blood gases, performing respiratory treatments, and distributing automated IV...

In systems with a write-through L1 cache backed by a writeback L2 cache instead of main memory, a merging write buffer can be simplified. Explain how this can be done. Are there situations where...

The LRU replacement policy is based on the assumption that if address A1 is accessed less recently than address A2 in the past, then A2 will be accessed again before A1 in the future. Hence, A2 is...

Increasing a caches associativity (with all other parameters kept constant), statistically reduces the miss rate. However, there can be pathological cases where increasing a caches associativity...

A business had $10,800 in cheques outstanding from its June 30 bank reconciliation. During the month of July, the business issued cheques totalling $41,900. The July bank statement shows that $48,200...

A backhoe is purchased for $192,000 and has an estimated salvage value of $13,000 at the end of its five-year useful life. When preparing a depreciation schedule by the straight-line method, what is...

Hartley uniforms produces uniforms. The company allocates manufacturing overhead based on the machine hours each job uses. Hartley UniformsHartley Uniforms reports the following cost data for the...