Question: Assume you have the following codevoid inner 4 ( vec _ ptr u , vec _ ptr v , data _ t * dest )
Assume you have the following codevoid innervecptr u vecptr v datat dest
int length veclengthu;datat vdata getvecstartv;for i ; i length; i
dest sum;
and you modify the code to use way loop unrolling and four parallel accumulators. Measurements for this function with the x
architecture shows it achieves a CPE of for all types of data.
Assuming the model of the Intel i architecture shown in class one branch unit, two arithmetic units, one load and one store unit
the performance of this loop with any arithmetic operation can not get below CPE because of
When the same code is compiled for the IA architecture, it achieves a CPE of worse than the CPE of achieved
with just fourway unrolling. The mostly likely reason this occurs is because
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
