@HaomingJiang
2017-11-11T12:06:58.000000Z
字数 1530
阅读 1137
Haoming Jiang, Wenjie Yao
Bridges, K80 (1--4) or P100 (1--2)
Denote the position of each entry of the cude with a turple . Bascially we want to rearrange the computing order for entries, so that we can divide all entries into different successive parts where the entries in the same part can be computed in parallel.
Observation1 : Note that for certain , any two are not directly related, which means they can be computed in parallel.
Observation2 : For any stripe , it only depends on its 8 neighbor stripes.
Based on the above Observation2, we can derive the following coloring method where entries for the same color can be computed in parallel:
1. i%2==0 && j%2==0, color=0
2. i%2==0 && j%2==1, color=1
3. i%2==1 && j%2==0, color=2
4. i%2==1 && j%2==1, color=3
We evenly divide s for the same color into parts, where is the total number of blocks in GPU computing.
// Pseudocode
block_id = get_block_id();
thread_id = get_thread_id();
for(i = 1:max_iterations)
for( color_id = 0:3 )
{
for( (i,j) in ijpairs[color_id][get_block_id] )
for( k = (n*thread_id)/thread_num:(n*thread_id+n)/thread_num-1 )
update(i,j,k);
synchronize(); // or barrier()
}
K80:
Number of cores: 4992
Total Memory Bandwidth: 480 GB/s
Memory: 24 GB GDDR5
Double-precision Performance: 2.91 Teraflops
Single-precision Performance: 8.73 Teraflops
P100:
Number of cores: 3584
Total Memory Bandwidth: 732 GB/s
Memory: 16 GB GDDR5
Double-precision Performance: 4.7 Teraflops
Single-precision Performance: 9.3 Teraflops
PCIe x16 Interconnect Bandwidth(K80,P100): 32 GB/s
NVIDIA NVLink™ Interconnect Bandwidth(P100): 160 GB/s