申論題內容
3. (20%) Assume a GPU architecture that contains 10 SIMD processors. Each SIMD instruction has a
width of 32 and each SIMD process ssor contains 8 lanes for single-precision arithmetic and load/store
instructions, meaning that each non-diverged SIMD instruction can produce 32 results every 4
cycles. Assume a kernel that has divergent branches that cause on average 80% of threads to be
active. Assume that 70% of all SIMD instructions executed are single-precision arithmetic and 20%
are load/store. Since not all memory latencies are covered, assume an average SIMD instruction issue
rate of 0.85. Assume that the GPU has a clock speed of 1.5 GHz. Please compute the throughput, in
GFLOP/sec, for this kernel on this GPU.