Question: Assume you have the following codevoid inner 4 ( vec _ ptr u , vec _ ptr v , data _ t * dest )

Assume you have the following codevoid inner4(vec_ptr u, vec_ptr v, data_t *dest)
{int length = vec_length(u);data_t *vdata = get_vec_start(v);for (i =0; i length; i++){}
*dest = sum;
}and you modify the code to use 4-way loop unrolling and four parallel accumulators. Measurements for this function with the x86-
64 architecture shows it achieves a CPE of 2.0 for all types of data.
Assuming the model of the Intel i7 architecture shown in class (one branch unit, two arithmetic units, one load and one store unit),
the performance of this loop with any arithmetic operation can not get below 2.0 CPE because of
When the same 44 code is compiled for the IA32 architecture, it achieves a CPE of 2.75, worse than the CPE of 2.25 achieved
with just four-way unrolling. The mostly likely reason this occurs is because
 Assume you have the following codevoid inner4(vec_ptr u, vec_ptr v, data_t

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!