Amazon Web Services (AWS) offers a wide variety of computing instances, which are machines configured to target

Question:

Amazon Web Services (AWS) offers a wide variety of “computing instances,” which are machines configured to target different applications and scales. AWS prices tell us useful data about the Total Cost of Ownership (TCO) of various computing devices, particularly as computer equipment is often depreciated1 on a 3-year schedule. As of July 2017, a dedicated, compute-oriented “c4” computing instance includes two x86 chips with 20 physical cores in total. It rents on-demand for $1.75/hour, or $17,962 for 3 years. In contrast, a dedicated “p2” computing instance also has two x86 chips but with 36 cores in total, and adds 16 NVIDIA K80 GPUs. A p2 rents on-demand for $15.84/hour, or $184,780 for 3 years.

a. The c4 instance uses Intel Xeon E5-2666 v3 (Haswell) processors. The p2 instance uses Intel Xeon E5-2686 v4 (Broadwell) processors. Neither part number is listed officially on Intel's product website, which suggests that these parts are specially built for Amazon by Intel. The E5-2660 v3 part has a similar core count to the E5-2666 v3 and has a street price of around $1500. The E5-2697 v4 part has a similar core count to the E5-2686 v4 and has a street price of around $3000. Assume that the non-GPU portion of the p2 instance would have a price proportional to the ratio of street prices. What is the TCO, over 3 years, for a single K80 GPU?

b. Suppose that you have a compute- and throughput-dominated workload that runs at rate 1 on the c4 instance and at rate T on the GPU-accelerated p2 instance. How large must T be for the GPU-based solution to be more cost-effective? Suppose that each general-purpose CPU core can compute at a rate of about 30G single-precision FLOPS. Ignoring the CPUs of the p2 instance, what fraction of peak K80 FLOPS would be required to reach the same rate of computation as the c4 instance?

c. AWS also offers "f1" instances that include 8 Xilinx Ultrascale + VU9P FPGAS. They rent at $13.20/hour, or $165,758 for 3 years. Each VU9P device includes 6840 DSP slices, which can perform 27 18-bit integer multiply- accumulate operations (recall that one multiply-accumulate counts as two "operations"). At 500 MHz, what is the peak multiply-accumulate opera- tions/cycle that an fl-based system might achieve, counting all 8 FPGAs toward the computation total? Assuming that the integer operations on the FPGAS can substitute for floating-point operations, how does this compare to the peak single-precision multiply-accumulate operations/cycle of the GPUs of the p2 instance? How do they compare in terms of cost-effectiveness?

Fantastic news! We've Found the answer you've been seeking!