[关闭]
@rg070836rg 2017-10-31T20:12:03.000000Z 字数 17022 阅读 2423

cuda实验报告

GPU并行计算课程实验与报告


陈实 SA17011008 计算机学院 2017年10月31日

一、实验环境描述

首先请参见实验环境的安装:
ubuntu实验环境的安装

二、ubuntu_samples测试

本课程的实验,将会以两个平台进行,分别是ubuntu下以及windows下,ubuntu是来自于学校的服务器,windows是自己的笔记本,记录遇到的问题,供大家分享。

2.1 deviceQuery

接下来,尝试几个例子:
首先cd到样例目录

  1. ubuntu@ubuntu:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$

我们通过make编译文件,再执行./deviceQuery ,即可看到服务器GPU相关信息:

image.png-70.4kB

我们把结果剪贴下来:

  1. Detected 4 CUDA Capable device(s)
  2. Device 0: "Tesla K80"
  3. CUDA Driver Version / Runtime Version 8.0 / 8.0
  4. CUDA Capability Major/Minor version number: 3.7
  5. Total amount of global memory: 11440 MBytes (11995578368 bytes)
  6. (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
  7. GPU Max Clock rate: 824 MHz (0.82 GHz)
  8. Memory Clock rate: 2505 Mhz
  9. Memory Bus Width: 384-bit
  10. L2 Cache Size: 1572864 bytes
  11. Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  12. Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  13. Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  14. Total amount of constant memory: 65536 bytes
  15. Total amount of shared memory per block: 49152 bytes
  16. Total number of registers available per block: 65536
  17. Warp size: 32
  18. Maximum number of threads per multiprocessor: 2048
  19. Maximum number of threads per block: 1024
  20. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  21. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  22. Maximum memory pitch: 2147483647 bytes
  23. Texture alignment: 512 bytes
  24. Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  25. Run time limit on kernels: No
  26. Integrated GPU sharing Host Memory: No
  27. Support host page-locked memory mapping: Yes
  28. Alignment requirement for Surfaces: Yes
  29. Device has ECC support: Enabled
  30. Device supports Unified Addressing (UVA): Yes
  31. Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0
  32. Compute Mode:
  33. < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
  34. Device 1: "Tesla K80"
  35. CUDA Driver Version / Runtime Version 8.0 / 8.0
  36. CUDA Capability Major/Minor version number: 3.7
  37. Total amount of global memory: 11440 MBytes (11995578368 bytes)
  38. (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
  39. GPU Max Clock rate: 824 MHz (0.82 GHz)
  40. Memory Clock rate: 2505 Mhz
  41. Memory Bus Width: 384-bit
  42. L2 Cache Size: 1572864 bytes
  43. Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  44. Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  45. Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  46. Total amount of constant memory: 65536 bytes
  47. Total amount of shared memory per block: 49152 bytes
  48. Total number of registers available per block: 65536
  49. Warp size: 32
  50. Maximum number of threads per multiprocessor: 2048
  51. Maximum number of threads per block: 1024
  52. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  53. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  54. Maximum memory pitch: 2147483647 bytes
  55. Texture alignment: 512 bytes
  56. Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  57. Run time limit on kernels: No
  58. Integrated GPU sharing Host Memory: No
  59. Support host page-locked memory mapping: Yes
  60. Alignment requirement for Surfaces: Yes
  61. Device has ECC support: Enabled
  62. Device supports Unified Addressing (UVA): Yes
  63. Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0
  64. Compute Mode:
  65. < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
  66. Device 2: "Tesla K80"
  67. CUDA Driver Version / Runtime Version 8.0 / 8.0
  68. CUDA Capability Major/Minor version number: 3.7
  69. Total amount of global memory: 11440 MBytes (11995578368 bytes)
  70. (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
  71. GPU Max Clock rate: 824 MHz (0.82 GHz)
  72. Memory Clock rate: 2505 Mhz
  73. Memory Bus Width: 384-bit
  74. L2 Cache Size: 1572864 bytes
  75. Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  76. Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  77. Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  78. Total amount of constant memory: 65536 bytes
  79. Total amount of shared memory per block: 49152 bytes
  80. Total number of registers available per block: 65536
  81. Warp size: 32
  82. Maximum number of threads per multiprocessor: 2048
  83. Maximum number of threads per block: 1024
  84. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  85. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  86. Maximum memory pitch: 2147483647 bytes
  87. Texture alignment: 512 bytes
  88. Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  89. Run time limit on kernels: No
  90. Integrated GPU sharing Host Memory: No
  91. Support host page-locked memory mapping: Yes
  92. Alignment requirement for Surfaces: Yes
  93. Device has ECC support: Enabled
  94. Device supports Unified Addressing (UVA): Yes
  95. Device PCI Domain ID / Bus ID / location ID: 0 / 133 / 0
  96. Compute Mode:
  97. < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
  98. Device 3: "Tesla K80"
  99. CUDA Driver Version / Runtime Version 8.0 / 8.0
  100. CUDA Capability Major/Minor version number: 3.7
  101. Total amount of global memory: 11440 MBytes (11995578368 bytes)
  102. (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
  103. GPU Max Clock rate: 824 MHz (0.82 GHz)
  104. Memory Clock rate: 2505 Mhz
  105. Memory Bus Width: 384-bit
  106. L2 Cache Size: 1572864 bytes
  107. Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  108. Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  109. Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  110. Total amount of constant memory: 65536 bytes
  111. Total amount of shared memory per block: 49152 bytes
  112. Total number of registers available per block: 65536
  113. Warp size: 32
  114. Maximum number of threads per multiprocessor: 2048
  115. Maximum number of threads per block: 1024
  116. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  117. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  118. Maximum memory pitch: 2147483647 bytes
  119. Texture alignment: 512 bytes
  120. Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  121. Run time limit on kernels: No
  122. Integrated GPU sharing Host Memory: No
  123. Support host page-locked memory mapping: Yes
  124. Alignment requirement for Surfaces: Yes
  125. Device has ECC support: Enabled
  126. Device supports Unified Addressing (UVA): Yes
  127. Device PCI Domain ID / Bus ID / location ID: 0 / 134 / 0
  128. Compute Mode:
  129. < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
  130. > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
  131. > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No
  132. > Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No
  133. > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
  134. > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No
  135. > Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No
  136. > Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No
  137. > Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No
  138. > Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : Yes
  139. > Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No
  140. > Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No
  141. > Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : Yes
  142. deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = Tesla K80, Device1 = Tesla K80, Device2 = Tesla K80, Device3 = Tesla K80
  143. Result = PASS
从上面这一段,发现有四张显卡设备,不过记得当时拿到机器的时候,记得只分了两块K80,经查询,Tesla K80一块拥有俩GK210核心,从程序返回的结果来看,原配的2880个流处理器分成的15个阵列,也仅仅被打开了13组,这样,单核心,也就有2496个流处理器了,单卡的话就是两倍的核心,同时,单卡24G内存,确实为一块性能不错的GPU处理器。其他的参数也没有仔细看,用到了再说。

为了后续例子方便,我们在上级目录下,make整个sample,这样,就会为为子目录下面每一个程序,生存可执行文件了。

image.png-4.4kB
等到刷出Finished building CUDA samples,即安装完成。

2.2 nvcc问题

想自行编译windows上的一个例子,不过输入nvcc,发现没有安装?
image.png-11.6kB
不过转念一想,通过make可以编译,那没道理没装nvcc,后来一想,可能是没有配置环境变量,转到cuda安装目录,一看nvcc好好的在那边。

  1. cd /usr/local/cuda/bin/

image.png-24.8kB

转去配置环境变量:

  1. vim ~/.bashrc
  2. #发现已经配置过环境变量,不过检查了一下,cuda路径不对,修改一下,顺便加上bin目录,
  3. #在最后加上这几行
  4. export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
  5. export CUDA_HOME=/usr/local/cuda
  6. export PATH=$PATH:/usr/local/cuda/bin
  7. #保存并刷新环境变量
  8. source ~/.bashrc

再次测试nvcc,发现找到了:

image.png-14.6kB

三、安装vnc server

因为server不能直接运行带有图形界面的代码,比如这一段简单的python代码:

  1. # -*- coding:utf-8 -*-
  2. from PIL import Image
  3. import matplotlib.pyplot as plt
  4. bg_pic = Image.open('01.jpg')
  5. plt.figure()
  6. plt.imshow(bg_pic)
  7. plt.axis('off')
  8. plt.show()

image.png-35.4kB

由于可能需要用到图形界面,在此装个vnc,简单记录下

安装服务:

  1. sudo apt-get install vnc4server

编辑下方文件的内容加上自己的信息:

  1. vim /etc/sysconfig/vncservers
  2. VNCSERVERS="3:ubuntu"
  3. VNCSERVERARGS[3]="-geometry 1920x1080 -alwaysshared"

在自己的用户名下运行命令,启动vnc进程

  1. nvcserver
  2. 输入2次密码即可

在自己的用户下,修改/home/ubuntu/.vnc/xstartup
配置xsrartup内容如下:

  1. #!/bin/sh
  2. # Uncomment the following two lines for normal desktop:
  3. # unset SESSION_MANAGER
  4. # exec /etc/X11/xinit/xinitrc
  5. def
  6. export XKL_XMODMAP_DISABLE=1
  7. unset SESSION_MANAGER
  8. unset DBUS_SESSION_BUS_ADDRESS
  9. gnome-panel &
  10. gnome-settings-daemon &
  11. metacity &
  12. nautilus &
  13. gnome-terminal &

维护方法:

  1. 关闭:vncserver -kill :3
  2. 启动:vncserver :3

这样通过,windows上面的vncviewer连接,就可以运行带有图形界面的程序了。
image.png-162kB

四、windows测试

本机的电脑,由于毕业设计时候做的是强化学习的相关课题,已经装过了cuda环境,来加速tensorflow-gpu等环境的加速,由于当时没有做记录笔记,并且,没有什么安装难度,这边省略安装步骤。

4.1 deviceQuery

在win下面我们利用vs进行cuda编程,本文windows实验环境为vs2013,首先运行例子deviceQuery,结果如下:

  1. CUDA Device Query (Runtime API) version (CUDART static linking)
  2. Detected 1 CUDA Capable device(s)
  3. Device 0: "GeForce GT 755M"
  4. CUDA Driver Version / Runtime Version 8.0 / 8.0
  5. CUDA Capability Major/Minor version number: 3.0
  6. Total amount of global memory: 2048 MBytes (2147483648 bytes)
  7. ( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
  8. GPU Max Clock rate: 1020 MHz (1.02 GHz)
  9. Memory Clock rate: 2700 Mhz
  10. Memory Bus Width: 128-bit
  11. L2 Cache Size: 262144 bytes
  12. Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  13. Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
  14. Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
  15. Total amount of constant memory: 65536 bytes
  16. Total amount of shared memory per block: 49152 bytes
  17. Total number of registers available per block: 65536
  18. Warp size: 32
  19. Maximum number of threads per multiprocessor: 2048
  20. Maximum number of threads per block: 1024
  21. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  22. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  23. Maximum memory pitch: 2147483647 bytes
  24. Texture alignment: 512 bytes
  25. Concurrent copy and kernel execution: Yes with 1 copy engine(s)
  26. Run time limit on kernels: Yes
  27. Integrated GPU sharing Host Memory: No
  28. Support host page-locked memory mapping: Yes
  29. Alignment requirement for Surfaces: Yes
  30. Device has ECC support: Disabled
  31. CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
  32. Device supports Unified Addressing (UVA): Yes
  33. Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
  34. Compute Mode:
  35. < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
  36. deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 755M
  37. Result = PASS

4.2 新建cuda项目&测试环境

如果安装过程正常无误,那么在vs中应该已经有了cuda模板,我们选择并新建一个cuda工程:
image.png-54.5kB
创建完之后,发现自带了一个cuda例子,我们运行,发现可以出结果,至此,cuda运行环境检查完毕
image.png-31.3kB

五、cuda实验

5.1 第一个程序 init.cu

首先了解cuda程序初始化的方式,于是先新建一个新的cudafile,开始编写,资料来自于课堂的ppt,并加以理解:
image.png-52.8kB
课堂PPT中的代码,没有详细说明每一行的作用,现在对其中的部分进行修改,以更深刻的理解。
先贴上代码:

  1. #include <cuda_runtime.h>
  2. #include<iostream>
  3. using namespace std;
  4. //2017年10月23日
  5. //陈实 SA17011008
  6. //CUDA 初始化
  7. bool InitCUDA()
  8. {
  9. int count;
  10. //取得支持Cuda的装置的数目
  11. cudaGetDeviceCount(&count);
  12. //没有符合的硬件
  13. if (count == 0) {
  14. cout << "无可用的设备";
  15. return false;
  16. }
  17. int i;
  18. //检查每个设备支持的参数,如果获得的版本号大于1 ,认为找到设备
  19. for (i = 0; i < count; i++) {
  20. cudaDeviceProp prop;
  21. if (cudaGetDeviceProperties(&prop, i) == cudaSuccess) {
  22. if (prop.major >= 1) {
  23. //输入 并打断点,调试观察
  24. cout << "设备" << i << ":" << prop.major << "." << prop.minor << endl;
  25. break;
  26. }
  27. }
  28. }
  29. if (i == count) {
  30. cout << "未找到能支持1.x以上的cuda设备" << endl;
  31. return false;
  32. }
  33. cudaSetDevice(i);
  34. return true;
  35. }
  36. int main()
  37. {
  38. if (InitCUDA())
  39. cout << "初始化成功!" << endl;
  40. return 0;
  41. }

通过查看prop变量,我们可以观察到很多的参数,并输出自己的设备支持的cuda版本:
image.png-36.9kB

windows上的运行结果是这样:(计算能力3.0)
image.png-4kB

通过winscp 传到服务器,编译并运行

  1. nvcc init.cu -o init.out

ubuntu服务器:(计算能力3.7)
image.png-6.2kB

5.2 素数测试

5.2.1 初始测试

本次实验准备以素数测试作为实验题材,进行一个素数的测试,采用不优化的全算法,测试时间。
实验预计采用int范围内最大的素数:2147483647
写了一份简单的代码,其中,核函数如下:

  1. __global__ static void is_prime_g(int *x, bool *result)
  2. {
  3. if (*x == 0 || *x == 1)
  4. {
  5. *result = false;
  6. return;
  7. }
  8. for (int j = 2; j < *x; j++)
  9. {
  10. if (*x%j == 0)
  11. {
  12. *result = false;
  13. return;
  14. }
  15. }
  16. *result = true;
  17. }

通过cudaApi获得gpu运行时间:

  1. cudaEvent_t start, stop;
  2. float Gpu_time;
  3. cudaEventCreate(&start);
  4. cudaEventCreate(&stop);
  5. cudaEventRecord(start, 0);
  6. //核函数
  7. //........
  8. cudaEventRecord(stop, 0);
  9. cudaEventSynchronize(stop);
  10. cudaEventElapsedTime(&Gpu_time, start, stop); //GPU 测时
  11. cudaEventDestroy(start);
  12. cudaEventDestroy(stop);
  13. printf("Gpu time is: %f ms\n", Gpu_time);

通过clock记录cpu函数时间,循环执行多次,取得平均值:

  1. begin = clock();//开始计时
  2. for (int i = 0; i < n; i++)
  3. {
  4. //......
  5. }
  6. end = clock();//结束计时
  7. printf("%d次总耗时%d ms 平均耗时:%f ms\n",n, end - begin, (end - begin)*1.0/n);//差为时间,单位毫秒

实验记录如下:
windows:
image_1bthjodgrh1k1m0e1u0n8021rgs2a.png-7.4kB
K20:
image_1bthjlt7gmr7r0b13q11t33sc11g.png-18.3kB
K80:
image_1bthjmaqpbkhq8o1gvc1shd4da1t.png-25kB

将数字规模扩大10倍:
windows:(出错)
image_1bthjsb8919np11ubd5qn027312n.png-7.4kB
K20:
image_1bthju3tm1p1t1gmlrcker3an4h.png-14.8kB
K80:
image_1bthjtqu3146g19ejrebb9e1n3e44.png-14.5kB

由于windows显卡出错,下面的实验,仅在服务器上运行:
修改线程数为256:
素数为2048261
K20:
image_1bthkb2g5oou1hpintl1t8711co4u.png-16.4kB
K80:
image_1bthkbr5dgqr8vb1sd15qvunp5b.png-15kB

素数为20232347
K20:
image_1bthkhg5h17p98uvtu14je25o68.png-14.6kB
K80:
image_1bthkhp6fd55to618ghaer18596l.png-17.1kB

发现,效率并没有提高,继续修改线程为1024,重新编译,经过测试,时间反而还提高了,估计是程序处理的有问题,看到老师PPT,由于是做的东西不太相同,所以没法按PPT上的方法,对线程数进行修改处理。

5.2.2 优化测试

所以换一个思考方式:每个线程判断一个数是否可以被整除,将每线程判断结果写入shared memory内,然后统计结果,如果全部不能被整除,那就是素数。其中计时方式不参考老师ppt,仍然采用cudaApi。cpu不再记录时间(无意义),仅仅用于验证数据是否正确。

仿照老师上课讲的内容,进行修改,代码如下:

  1. #include <stdio.h>
  2. #include <stdlib.h>
  3. #include <cuda_runtime.h>
  4. #include<iostream>
  5. #include<time.h>//time.h头文件
  6. using namespace std;
  7. #define THREAD_NUM 1
  8. #define BLOCK_NUM 1
  9. //host code
  10. //产生一个要被测试的数组
  11. //
  12. void GenerateNumbers(long *number, int size)
  13. {
  14. for (int i = 0; i < size - 2; i++) {
  15. number[i] = i + 2;
  16. }
  17. }
  18. //device code
  19. //内核函数
  20. //
  21. __global__ static void IsPrime(long *num, bool* result, int TEST)
  22. {
  23. extern __shared__ bool shared[];
  24. const int tid = threadIdx.x; //块内线程索引
  25. const int bid = blockIdx.x; //网格中线程块索引
  26. result[bid] = false;
  27. int i;
  28. for (i = bid * THREAD_NUM + tid; i < TEST; i += BLOCK_NUM * THREAD_NUM)
  29. {
  30. if (TEST % num[bid*bid * THREAD_NUM + tid] == 0) //能整除
  31. {
  32. shared[tid] = true;
  33. }
  34. else
  35. {
  36. shared[tid] = false;
  37. }
  38. }
  39. __syncthreads(); //同步函数
  40. if (tid == 0)
  41. {
  42. for (i = 0; i<THREAD_NUM; i++)
  43. {
  44. if (shared[i])
  45. {
  46. result[bid] = true;
  47. }
  48. }
  49. }
  50. }
  51. //原始素数求法
  52. bool is_prime(int x)
  53. {
  54. if (x == 0 || x == 1)
  55. return false;
  56. for (int j = 2; j < x; j++)
  57. {
  58. if (x%j == 0)
  59. return false;
  60. }
  61. return true;
  62. }
  63. //host code
  64. //主函数
  65. //
  66. int main()
  67. {
  68. int TEST;
  69. cin >> TEST;
  70. long *data = new long[TEST];
  71. GenerateNumbers(data, TEST); //产生要测试的数组
  72. //定义并分配内存
  73. long* gpudata;
  74. bool* result;
  75. cudaMalloc((void**)&gpudata, sizeof(long)* TEST);
  76. cudaMalloc((void**)&result, sizeof(bool)*BLOCK_NUM);
  77. //数据拷贝
  78. cudaMemcpy(gpudata, data, sizeof(long)* TEST, cudaMemcpyHostToDevice);
  79. //api计时
  80. cudaEvent_t start, stop;
  81. float Gpu_time;
  82. cudaEventCreate(&start);
  83. cudaEventCreate(&stop);
  84. cudaEventRecord(start, 0);
  85. //调用内核函数
  86. IsPrime << <BLOCK_NUM, THREAD_NUM, THREAD_NUM * sizeof(bool) >> >(gpudata, result,TEST);
  87. //api计时结束
  88. cudaEventRecord(stop, 0);
  89. cudaEventSynchronize(stop);
  90. cudaEventElapsedTime(&Gpu_time, start, stop); //GPU 测时
  91. cudaEventDestroy(start);
  92. cudaEventDestroy(stop);
  93. printf("Gpu time is: %f ms\n", Gpu_time);
  94. bool sum[BLOCK_NUM];
  95. //结果拷贝
  96. cudaMemcpy(&sum, result, sizeof(bool)*BLOCK_NUM, cudaMemcpyDeviceToHost);
  97. //释放空间
  98. cudaFree(gpudata);
  99. cudaFree(result);
  100. //验证结果(不参与计时)
  101. bool isprime = true;
  102. for (int i = 0; i < BLOCK_NUM; i++)
  103. {
  104. if (sum[i])
  105. {
  106. isprime = false;
  107. break;
  108. }
  109. }
  110. //gpu
  111. if (isprime)
  112. {
  113. printf("GPU:%d is a prime\n", TEST);
  114. }
  115. else
  116. {
  117. printf("GPU:%d is not a prime\n", TEST);
  118. }
  119. //cpu
  120. int begin, end;//定义开始和结束标志位
  121. begin = clock();//开始计时
  122. if (is_prime(TEST))
  123. printf("CPU:%d is a prime\n", TEST);
  124. else
  125. printf("CPU:%d is not a prime\n", TEST);
  126. end = clock();//结束计时
  127. printf("Cpu time is: %d\n", end-begin);
  128. return 0;
  129. }
每个具体的线程,即blockid的某个threadid的线程,所计算的是一个数或者好几个数,总共有block_NUM*THREAD_NUM这么多的线程,假设要计算素数test,则必须对所有从2,3,4...到test-1这么多数做除法运算,总共应该是test-1个数,这些数就按顺序安排到block_NUM*THREAD_NUM这些线程上运行,多出来的再次重复的安排也就是说i和i += BLOCK_NUM * THREAD_NUM都应该是安排在同一个线程块的同一个线程id上运行。

测试1:

  1. #define THREAD_NUM 1
  2. #define BLOCK_NUM 1

K20:
image_1bti1n26u104v1tbd2leus3ri7p.png-12.7kB
K80:
image_1bti1nb7i2f01ffpkpm9absl016.png-14.6kB

测试2:

  1. #define THREAD_NUM 1024
  2. #define BLOCK_NUM 1

K20:
image_1bti1sjfo132d10nuhbhdna47t1j.png-10.9kB
K80:
image_1bti1spjd1967t3519q91lki1nfk20.png-14.3kB

从实验结果来看,提升块并没有加速时间,怀疑用错了GPU计时,所以利用nvprof查看:

  1. ubuntu@ubuntu:~/cuda_test$ nvprof ./prime_3.out
  2. 202261583
  3. ==1365== NVPROF is profiling process 1365, command: ./prime_3.out
  4. Gpu time is: 0.072064 ms
  5. GPU:202261583 is a prime
  6. CPU:202261583 is a prime
  7. Cpu time is: 811364
  8. ==1365== Profiling application: ./prime_3.out
  9. ==1365== Profiling result:
  10. Time(%) Time Calls Avg Min Max Name
  11. 99.97% 236.60ms 1 236.60ms 236.60ms 236.60ms [CUDA memcpy HtoD]
  12. 0.03% 62.752us 1 62.752us 62.752us 62.752us IsPrime(long*, bool*, int)
  13. 0.00% 3.3600us 1 3.3600us 3.3600us 3.3600us [CUDA memcpy DtoH]
  14. ==1365== API calls:
  15. Time(%) Time Calls Avg Min Max Name
  16. 58.06% 337.27ms 2 168.64ms 375.87us 336.90ms cudaMalloc
  17. 40.76% 236.75ms 2 118.37ms 37.190us 236.71ms cudaMemcpy
  18. 0.52% 3.0149ms 364 8.2820us 246ns 279.56us cuDeviceGetAttribute
  19. 0.43% 2.5004ms 4 625.10us 615.91us 629.75us cuDeviceTotalMem
  20. 0.15% 878.57us 2 439.28us 170.20us 708.37us cudaFree
  21. 0.04% 232.24us 4 58.060us 56.043us 60.786us cuDeviceGetName
  22. 0.02% 124.54us 1 124.54us 124.54us 124.54us cudaEventSynchronize
  23. 0.01% 43.690us 1 43.690us 43.690us 43.690us cudaLaunch
  24. 0.00% 9.5710us 2 4.7850us 1.7290us 7.8420us cudaEventCreate
  25. 0.00% 9.3820us 2 4.6910us 2.9250us 6.4570us cudaEventRecord
  26. 0.00% 7.5690us 3 2.5230us 270ns 6.7980us cudaSetupArgument
  27. 0.00% 5.0280us 12 419ns 245ns 762ns cuDeviceGet
  28. 0.00% 4.3560us 1 4.3560us 4.3560us 4.3560us cudaEventElapsedTime
  29. 0.00% 2.9730us 3 991ns 281ns 2.0150us cuDeviceGetCount
  30. 0.00% 2.8080us 2 1.4040us 740ns 2.0680us cudaEventDestroy
  31. 0.00% 1.9640us 1 1.9640us 1.9640us 1.9640us cudaConfigureCall

发现,计算时间确实很短,所以基本不需要优化。

5.2.3 其他优化

由于选题有问题,导致并不能测出优化前后的结果对比,但还是照着PPT上的其他思路进行优化,使用block与grid内建对象,进行重新改写kernel函数,并通过标记一个值进行返回,这样可以做到较好的效果,代码如下:

  1. #include "stdio.h"
  2. #include "stdlib.h"
  3. #include<iostream>
  4. #include <cuda_runtime.h>
  5. using namespace std;
  6. //define kernel
  7. __global__ void prime_kernel(int *d_mark, int N)
  8. {
  9. int i = blockIdx.x * 256 + threadIdx.x + 16 + threadIdx.y;//threadIdx.x*16
  10. if (i >= 2 && N%i == 0)//判断条件正确
  11. *d_mark = 0;
  12. }
  13. //原始素数求法
  14. bool is_prime(int x)
  15. {
  16. if (x == 0 || x == 1)
  17. return false;
  18. for (int j = 2; j < x; j++)
  19. {
  20. if (x%j == 0)
  21. return false;
  22. }
  23. return true;
  24. }
  25. int main()
  26. {
  27. int N;
  28. cin >> N;
  29. int *h_mark;
  30. h_mark = (int *)malloc(sizeof(int)* 1);
  31. *h_mark = 1;//if *h_mark == 1 N is a prime number; else N is not a prime number
  32. //分配空间
  33. int *d_mark;
  34. cudaMalloc((void **)&d_mark, sizeof(int));
  35. // 拷贝数据
  36. cudaMemcpy(d_mark, h_mark, sizeof(int), cudaMemcpyHostToDevice);
  37. // 设置执行参数
  38. dim3 block(16, 16); //可以用一维实现,(256,1)
  39. dim3 grid(4, 1);
  40. //api计时
  41. cudaEvent_t start, stop;
  42. float Gpu_time;
  43. cudaEventCreate(&start);
  44. cudaEventCreate(&stop);
  45. cudaEventRecord(start, 0);
  46. //执行kernel
  47. prime_kernel << <block, grid >> >(d_mark, N);//block和grid位置反了
  48. //api计时结束
  49. cudaEventRecord(stop, 0);
  50. cudaEventSynchronize(stop);
  51. cudaEventElapsedTime(&Gpu_time, start, stop); //GPU 测时
  52. cudaEventDestroy(start);
  53. cudaEventDestroy(stop);
  54. printf("Gpu time is: %f ms\n", Gpu_time);
  55. // 结果拷贝
  56. cudaMemcpy(h_mark, d_mark, sizeof(int), cudaMemcpyDeviceToHost);
  57. //GPU输出
  58. if (*h_mark == 1)
  59. printf( "%d is a prime number\n", N);
  60. else
  61. printf( "%d is not a prime number\n", N);
  62. //释放
  63. cudaFree(d_mark);
  64. //cpu输出
  65. if (is_prime(N))
  66. printf("CPU:%d is a prime\n", N);
  67. else
  68. printf("CPU:%d is not a prime\n", N);
  69. return 0;
  70. }

k20:
image_1bti3m5jldoh14rqqup13ko1cvs2q.png-11kB
k80:
image_1bti3m0bi1263b79kus1ai01u3d2d.png-12.6kB

5.3 总结

本次实验,分成下面几步:
1.串行:cpu上,测试一个数是否为素数
2.丢在GPU上:改写成kernel函数
3.基本并行化:每个线程判断一个数是否可以被整除,将每线程判断结果写入shared memory内,然后统计结果,如果全部不能被整除,那就是素数。
4、优化:使用block与grid内建对象,进行重新改写kernel函数

六、实验心得

1.学会了CUDA编程的基本原理,能够编写简单的CUDA程序并且比未优化的CPU版本运行速度快。
2.初步了解了CUDA并行化的基本知识。
3.对cuda编程工具有了较好的了解,对liunx的使用有了提升。

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注