Pages: |

| 2 |

-- [ �� 1 ] --

�� . �.�. ��

��

��-�� 

��

��  ��

�� 

01.04.10 ��

��  ��

�� .�., �� .�.

�� CUDA ��

�� -��

�� 3.1: �� -�� , �� , ��

��

2012

�� CUDA ��

�� .�., �� .�. �� -�� . � �� : �� , 2012. � 53 �.

��. � ��-�� CUDA � �� . �� , � �� (��) �� (�� -��). �� , �� MPI. �� , �� . �� .

�� -�� , ��

01.04.07 �� , �� , �� , � �� .

��

�� 4

1 �� GPU 5

1.1 �� GPU-�� 7

1.2 �� 10

1.3 �� C �� CUDA 12

1.4 �� CUDA �� Windows 15

1.5 �� CUDA �� 17

2 �� (�� ) 19

3 �� CUDA: �� -�� 21

3.1. �� 22

3.2 �� 26

3.3 �� 32

4 �� 36

5 �� 37

�� 38

�� 39

�� . �� CUDA: �� -�� 41

�� . �� CUDA+MPI: �� -�� 45

��

�� , �� -�� . �� -�� CUDA (�� . Compute Unified Device Architecture), �� Nvidia [1]. CUDA �� , �� SIMD (�� . Single Instruction Multiple Data), �.�. �� .

�� , �� . � �� CUDA � �� , �� (��) �� -��. �� , �� MPI.

�� . ��, �� -�� (��, �� , �� , �� .�.) � �� , � �� , �� .

�� GPU

� �� : �� , �� , �� [2]. �� , �� , �� , � �� . �� : ��-, ��- � ��, ��, �� , �� , �� , ��, �� .

�� 1016 �� (10 ��) �� Linpack (�� ) [3]. �� ? ��, �� , �� , �� , ��:

-�� : �� 100 �� 1 �� [4];

-�� : �� 1016 �� 106 �� [4];

-��-�� (Coupled Cluster Singles and Doubles, CCSD) �� (~50) �� 1014 ��; �� (�� ).

�� , �� GPU ��, �� . �� , � �� , � �� CPU �� (�� 2 �� 6, �� 2012��.). �� , �� GPU, �� , �� . � ��, �� GPU, �� 5-30 �� [5]. �� (�� 100-�� !) �� , �� SSE (Streaming SIMD Extensions, �� SIMD-�� ), �� GPU. �� GPU �� SSE-�� CPU (�� NVIDIA):

- �� : � 12 �� CPU (12x);

- �� (non-bonded force calc): 8-16x;

- �� (�� ): 40-120x � 7x.

�� GPU-�� -0

��. 1 �� GPU-�� CPU. �� NVidia 2008 �.

�� GPU-��

�� CUDA � �� -�� NVIDIA, �� , �� . CUDA �� [1, 6-7], �� GeForce �� (�� GeForce 8, GeForce 9, GeForce 200), � �� Quadro � Tesla.

�� (multiprocessors). �� CUDA-�� (CUDA cores), �� (SFU), ��, � �� (shared memory) �, �� (�� ).

�� SIMT (Single Instruction, Multiple Thread). �� CUDA �� (kernels), �� CPU � �� (threads). �� , �� . �� CUDA-��, �� .

�� (thread block) �� , �� . �� (�� [7] �� CUDA API). �� . �� , �� .

�� (warps) �� 32 �� (�� ), �� (�� SIMD � Single Instruction, Multiple Data). �� , �� , �� . �� (�� if), �� . �� (� �� , �� , �� ).

� �� , �� (grid of thread blocks). �� , �� : �� , �� (�� ).

�� CUDA �� -1

��. 2 �� CUDA

�� (��-, ��- �� ), �� threadIdx. � �� , �� (��-, ��- �� ) �� blockIdx. �� . 2. �� .x,.y,.z.

��

�� . �� . �� CPU �� , �� (� �� GPU �� ALU).

� CUDA �� GPU �� , �� , �� (��. ��. 1) [1].

�� 1. �� CUDA

��	��	��	��
�� (registers)	R/W	per-thread	�� (on chip)
local	R/W	per-thread	�� (DRAM)
shared	R/W	per-block	�� (on-chip)
global	R/W	per-grid	��(DRAM)
constant	R/O	per-grid	��(on chip L1 cache)
texture	R/O	per-grid	��(on chip L1 cache)

��  � �� , �� , �� 256 �� 1.5 �� (� �� 4 �� Tesla). �� , �� 100 ��/� �� NVIDIA, �� . �� , �� load � store, � �� .

��  � �� , � �� . �� , �� .

��  � �� 16-�� (� �� ) �� . �� , �� , �� . �� , �� . �� : �� , �� (ALU) � ��, �� .

��  � �� 64 �� (�� GPU), �� . �� 8 �� . �� .

��  � �� , �� . �� , �� . �� 8 �� . ��, �� .

��, �� , ��, �� , �� . �� . �� (CPU) �� R/W �� , �� (�� DRAM GPU) � �� CPU � GPU (�� CUDA API).

�� C �� CUDA

�� CUDA (�� .cu) �� nvcc.

�� CUDA ��

�� , �� ;
�� , �� , �� ;
��, �� , �� , �� ;
�� , �� ;
runtime, ��

�� 2. �� CUDA

��	��	��
__device__	device	device
__global__	device	host
__host__	host	host

�� __host__ � __device__ �� (�� , �� GPU, �� CPU - �� ). �� __global__ � __host__ �� .

�� __global__ �� void.

__global__ void myKernel ( float * a, float * b, float * c )

{

int index = threadIdx.x;

c [i] = a [i] * b [i];

}

�� , �� GPU (__device__��__global__) �� :

�� (�� __global__��);
�� ;
�� static-�� ;
�� .

�� GPU �� -�__device__,�__constant__��__shared__. �� :

�� (struct��union);
�� , �� extern;
�� __constant__�� CPU �� ;
__shared__�� .

�� 

� ��

gridDim�- ��grid'� (�� dim3);
blockDim�- �� (�� dim3);
blockIdx�- �� grid'� (�� uint3);
threadIdx�- �� (�� uint3);
warpSize�- ��warp'� (�� int).

� �� 1/2/3/4-�� -�char1,�char2,�char3,�char4,�uchar1,�uchar2,�uchar3,�uchar4,�short1,�short2,�short3,�short4,�ushort1,�ushort2,�ushort3,�ushort4,�int1,int2,�int3,�int4,�uint1,�uint2,�uint3,�uint4,�long1,�long2,�long3,�long4,�ulong1,�ulong2,�ulong3,�ulong4,�float1,�float2,�float3,�float2, �double2.

�� -�x,�y,�z��w. �� -�� make_<typeName>.

int2 a = make_int2 ( 1, 7 );

float3 u = make_float3 ( 1, 2, 3.4f );

�� , �� (� �� GLSL/Cg/HLSL) �� , �.�. �� "+" - �� .

�� dim3, �� uint3, �� , �� .

�� 

�� GPU �� :

kernelName <<<Dg,Db,Ns,S>>> ( args )

�� kernelName �� (��) ��__global__��,�Dg�- �� (�� ) ��dim3, �� grid'a (� ��),�Db�- �� (�� ) ��dim3, �� (� ��),�Ns�- �� (�� ) ��size_t, �� shared-��, �� (� �� shared-��),�S�- �� (�� ) ��cudaStream_t�� (CUDA stream), � �� , �� 0. ��args �� kernelName.

�� __syncthreads, �� . �� , �� . �.�. �� , �� , �� (�, ��, �� ). �� shared-��.

�� CUDA �� , �� float-�� (� ��double) - ��sinf. �� CUDA �� (__sinf,�__powf�� .�.) �� , �� sinf,�powf�� .�.

�� CUDA �� Windows

CUDA � �� -�� NVidia, �� , �� CUDA (�� :�https://developer.nvidia.com/cuda-gpus). �� :

�� , �� CUDA;
�� CUDA Toolkit;
�� NVidia, �� https://developer.nvidia.com/cuda-downloads
�� (Microsoft Visual Studio, Netbeans, etc.).

� �� CUDA (�� 4.0) � �� Microsoft Visual Studio. ��, �� Microsoft Visual Studio 2010.

�� :

�� CUDA, �� NVidia ForceWare (��. http://www.nvidia.com/drivers). �� , �� , �� (Release Notes) CUDA Toolkit.
�� (CUDA Developer Drivers) �� CUDA Toolkit �� NVidia: https://developer.nvidia.com/cuda-toolkit. �� , ��, �� Microsoft Visual Studio. GPU Computing SDK �� , �� Microsoft Visual Studio, � �� , �� .
�� (CUDA Toolkit) � �� . �� CUDA �� C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0. � �� (� �� CUDA_LIB_PATH, CUDA_INC_PATH), �� .
�� (Microsoft Visual Studio) ��

Tools | Options | Projects and Solutions | VC++ Directories.

�� Include files �� :

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\include

�� Library files �� :

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\Win32 (�� C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\x64 �� 64-�� )

�� GPU Computing SDK � �� . �� , �� Microsoft Visual Studio.
�� CUDA �� (�� CUDA Toolkit): ��

Project | Custom Build Rules�, �� CUDA Runtime API Build Rule. �� CUDA Runtime API � ��, �� CUDA-��. �� nvcc �� , �� .

�� CUDA ��

�� CUDA �� SDK ��. GPU Computing. �� bandwidthTest, �� C:\ Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK #.#\C\bin\win32\ Release � Windows XP � � %ProgramData%\NVIDIA Corporation\ NVIDIA GPU Computing SDK #.#\C\bin\win32 � �� Windows Vista � ��. (�� 64-� �� Windows �� win64\Release). � �� , �� . ��. 3.

�� bandwidthTest� ��-2

��. 3 �� bandwidthTest�

�� (�� ) � �� . �� : ��, �� , � ��, �� , �� .

�� , ��, �� CUDA, � �� .

��.

1) ��, �� CUDA Toolkit �� CUDA Toolkit Visual Studio Integration (�� ).

2) CUDA Toolkit �� 3.1 � �� C:\CUDA �� , �� CUDA Toolkit. � �� 3.2 �� .

�� (�� )

�� , �� . �� (��. �� . 4), �� nVidia� Tesla.

�� .-3

��. 4. �� . ��.

�� :

�� Flagman QD820 (8 �� AMD� Opteron� SixCore, 16 � DIMM 4096Mb DDR-II, 4 � HDD 300Gb SerialATA 10000rpm);
�� AMD� Opteron� SixCore, 16 � DIMM 4096Mb DDR-II, 4 � HDD 300Gb SerialATA 10000rpm);
2 �� Flagman WX240T.2 (2 �� Intel� Xeon� X5550, 12 � DIMM 2048Mb DDR-III, 4 �� nVidia� Tesla� C1060 4096Mb DDR-III);
6 �� Flagman WP120N.2 (�� Intel� Core� i7 I7-950, 6 � 2048Mb DDR-III, 2 �� nVidia� Tesla� C1060 4096Mb DDR-III).

�� : �� : 20 Gbit/s InfiniBand (�� Mellanox MTS3600); �� : 1 Gbit/s (�� 3Com, 24 ports).

�� Windows � �� , �� . �.�. �� 419 (�� ) � �� 537 (�� , �� ).

�� (�� 537), �� :

��

C:\Windows\system32\mstsc.exe

�� IP-�� :

�� Lab25-6 � 85.143.6.98:3389

�� Lab25-7 � 85.143.6.98:3390

�� Lab25-8 � 85.143.6.98:3391

�� :

�� : cluster\Theorlab

��: Theorlab

�� .

�� CUDA: �� -��

� �� CUDA � �� , �� (��) �� -�� (��).

�� (Master equation), �� n2 ��, �� n � �� . �� (��-��, �� ) ��, � ��, �� . �� , �� , �� n ��, �� m ��, �� m �� 1000-10000 �� . �� , �� (�� ), �� . �� , �� , �� GPU-�� 100% ��. �� , �� n ��.

�� (�� ) � �� , �� (�� ) � ��.

3.1. �� 

� �� : �� , �� , �� , �� .�. �� , �� (��) �� , � �� [8, 9].

�� . �� . �� [8], �� , �� [10].

�� , �� [25] (��. 5). �� , �� , �� , �� . �� . �� . �� . �� (J.E. Mooij) �� (��. ��. 5 (�)), �� [11-13]. � �� . �� (��) �� [14,15], � �� , �.�. �� (��. ��. 5 (�)).

�� (�) �� -4

�� (�) �� -5

��. 5. �� (�) �� [11] � (�) �� [15]. �� (�� ), �� .

� �� , �� [11]:

, (1)

�� - �� , �� , - �� , � � - �� . �� [16-18], � �� . � �� , �� , �� , � �� , �� T � �� .

�� . ��, �� (~��), �� (�� ) ��, ��, �� . �� , �� . �� [19], �� , �� :

, (2)

�� - �� .

�� (� �� ), �� N �� ( - �� )

. (3)

� �� , �� [20]. �� , � �� , ��

, (4)

�� . �� :

. (4)

��, �� , �� j-�� , �� , �� .6.

��, �� , �� . �� M ��, �� M �� 1000-10000, �� , �� . �� , �� (�� ), �� . �� , �� , �� GPU-�� , ��. �� .

�� ,-29

��. 6 �� , �� -��.

3.2 �� 

�� , �� GPU � �� CUDA [1,5]. �� , �� , �� GPU, �� , �� (SIMT).

��, �� [17-19]. ��, �� , �� . �� , �� , �� GPU, �� . �� GPU �� MPI [21, 22], ��. �� .

�� (��. �� .7):

�� , �� (�� ) � �� , �� GPU;

�� , ��, �� GPU;
�� , � �� .

�� -33

��. 7 �� , �� (��)� �� , �� .6. �� , �� GPU-��.

�� GPU-��. � �� . �� Master, �� CPU, �� , �� , �� .

�� 3. �� MPI �� GPU ��

Master (myid = 0)	Slove (myid = 1 � N, �� N � �� GPU)
��
void master () { float data[2];//�� . �� float res[3]; //�� . ��, �� data[2] init_io(); init_net(); while (load_data ((float ) data)) { if (! is_free ()) { receive_result (res); save_result (res); } send_data ((float ) data); } while (! are_all_free ()) { receive_result (res); save_result (res); } close_io(); send_end (); }	void slave () { float data[2]; float res[3]; int j; j = 0; int myid; MPI_Comm_rank (MPI_COMM_WORLD, &myid); int device = myid % 2; cudaSetDevice(device); cudaDeviceProp deviceProp; cudaGetDeviceProperties(&deviceProp, device); while (receive_data ((float *) data)) { calc (data, res,deviceProp.maxThreadsDim[0], deviceProp.multiProcessorCount); send_result (res); j = j + 1; } }
��
init_io() � �� , � �� . init_net() � �� n, � �� n = 0, �� slove-�� . load_data - �� (�� ) if (! is_free ()) - �� , �� , �� . �� slove � �� (�� ), �� send_data � �� . �� TAG_WMSG �� slove-�� (MPI_ANY_SOURCE) �� master �� , �� , � �� msg = W_OK, �� (�� data). ��, �� TAG_DATA. �� (�� ) ��: n = n + 1.	�� receive_data �� master: �� ?�. �� slove-�� (msg) � �� TAG_WMSG, �� . �� msg = W_OK � �� .
�� (slove) � �� GPU � �� . �� n = N.
Master �� , �� is_free() �� receive_result, �� slove-�� (MPI_ANY_SOURCE), � �� : n = n - 1. �� , �� save_result.	�� calc.cpp � �� GPU c �� (myid % 2) �� �� void montecarlo( singlecomplex a_dev, singlecomplex b_dev, float rand_dev,int N, int T, float Am, float eps,int maxThreadsPerBlock, int NumMultiprocessors) �� gpu.cu* �� master. �� send_result. ��, �� TAG_RESULT.
�� , �� master �� while, � �� . �� .
are_all_free () � �� master �� n = 0 (n � �� ) �� : receive_result � save_result. close_io() � �� , �� n = 0. send_end () � �� slove-�� : �� . �� msg = W_NO c �� TAG_WMSG.	�� msg = W_NO � �� TAG_WMSG, �� receive_data �� .

�� GPU �� : ��, �� . � �� : �� (�� ), �� (�� ), �� (�� ) � �� (�� , �� ). �� , �� , �� . �� , ��, �� , �� . �� , �� , ��, �� . � �� , �� GPU �� , �� GPU, �� .

��, �� ~ , �� M � �� . � �� , � �� GPU, �� , �.�. �� [23]. �� (M > 103) � �� . �� GPU � �� CUDA �� CURAND. �� , �� GPU � 50 �� , �� PU [24].

3.3 �� 

�� = 5 T, �� , � �� , �� [20], �� . �� 5%, �� .

�� . 8 �� , �� , �� .

�� (�� -36

��. 8 �� (�� ).

� ��. 3. �� , �� c �� CPU (Intel Core i7�960 �� 1 ��) � �� GPU (Nvidia Tesla �1060). �� GPU �� CUDA. ��, �� , � �� . �� . 4 �� CPU �� 170 ��. �� , �� , � �� .

�� 4. �� CPU � GPU

��	�� CPU, �.	�� GPU, �.
980	17,082	0,609
1980	35,303	0,656
3840	68,166	0,686
7680	136,454	0,874
15360	272,705	1,654
30720	545,458	3,260

�� (��. ��.9), �� [17-19], �� 2D-�� (�, 0) 100�100 ��, �� , �� 10000 �� (�� 15360 ��) �, ��, �� .

�� -37

��. 10 �� .

�� . 10 � � ��. 5 �� , �� , �� GPU, �� 20 GPU �� 97%.

�� GPU-38

��. 10 �� GPU �� GPU.

�� 5. �� 10000 �� GPU c �� MPI

�� (10000 ��) �� GPU
��	��	��, �	��
GPU (Tesla C1060)	1	6020	~ � 66 �� . CPU
GPU (Tesla C1060) + MPI	5	1228	~ � 323 �� . CPU ~ � 5 �� . GPU
GPU (Tesla C1060) + MPI	20	311	~ � 1273 �� . CPU ~ � 19,4 �� . GPU

Pages: |

| 2 |

��  >> ��

��

<< �� | ��

���������� ������� ����������� ����������

������������� ��������������� ����������� ��. �.�. ������������ ������������ ����������������� �����������

��

�� . �.�. ��

��