-
Notifications
You must be signed in to change notification settings - Fork 0
gamer-project/gpu_testbed_poisson-gravity
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
===================================================== || || || GPU Poisson and Gravity solvers for GAMER || || || ===================================================== ----------------------------------------------------------------------------------------------------------------- Versiona 1.0 10/17/2008 =============================== 1. contain both CPU and GPU solvers 2. asynchronous test 3. the version 1.0 only works for a. FLU_NXT2 = 8 b. POT_GHOST_SIZE = 2 ( 5-points stencil ) 4. the entire potential data of a single patch are loaded into the shared memory before entering the SOR iteration 5. CUCAL_Poisson_Global.cu : only use the global memory CUCAL_Poisson_Sharel.cu : use three slices of shared memory (including both even and odd grids) CUCAL_Poisson_Share2.cu : use three slices of shared memory (including only even or odd grids) CUCAL_Poisson_Share3.cu : load all potential data into shared memory in the begining performance : Share3 > Share2 > Share1 > Global but to fit all potential data into shared memory, the FLU_NXT2 is restricted to 8 (16 is too large) Versiona 1.1 10/18/2008 =============================== 1. optimization a. coalescing memory access for the global memory (both read and write) b. do NOT upload the ghost-zone data of the potential c. no bank conflict for writing data into the shared memory (still with bank conflict for reading data) 2. works for 3P_STENCIL and 5P_STENCIL 3. still only works for FLU_NXT2 = 8 4. speed-up ratio over CPU POT_GPU_NPATCH : 128 PATCH_SIZE : 8 FLU_NXT2 : 8 POT_GHOST_SIZE : 2 POT_NXT : 12 a. POT_MAX_ITER : 10 CPU Processing Time : 13.327 ms CPU SOR Time : 10.321 ms CPU Advance Time : 2.586 ms GPU Processing Time : 2.193 ms GPU Invoking Time : 1.007 ms Speedup Ratio : 6.077063 b. POT_MAX_ITER : 20 CPU Processing Time : 22.184 ms CPU SOR Time : 19.184 ms CPU Advance Time : 2.579 ms GPU Processing Time : 2.491 ms GPU Invoking Time : 1.009 ms Speedup Ratio : 8.905661 c. POT_MAX_ITER : 50 CPU Processing Time : 48.114 ms CPU SOR Time : 45.119 ms CPU Advance Time : 2.567 ms GPU Processing Time : 3.386 ms GPU Invoking Time : 1.008 ms Speedup Ratio : 14.209687 c. POT_MAX_ITER : 1000 CPU Processing Time : 869.129 ms CPU SOR Time : 866.145 ms CPU Advance Time : 2.579 ms GPU Processing Time : 31.602 ms GPU Invoking Time : 1.009 ms Speedup Ratio : 27.502342 5. somehow the performance has suddenly dropped about 2 ms (also drop for the version 1.0 ) ~ before the performance dropped, the speed-up ratio over CPU for POT_MAX_ITER = 10 is about 8 Versiona 1.2 03/10/2009 =============================== 1. construct two kernels "CUCAL_Poisson_10cube" and "CUCAL_Poisson_14cube" for POT_GHOST_SIZE = 1 and 3, respectively 2. do NOT advance fluid inside the Poisson solver 3. determine the termination criteria for the SOR iteration 4. use GPU streams for the asynchronous data copy between CPU and GPU 5. performance : POT_GHOST_SIZE = 1 G_POT_GPU_NPATCH = 256 CPU Processing Time : 43.377 ms CPU SOR Time : 42.061 ms CPU Advance Time : 0.000 ms GPU Processing Time : 2.696 ms GPU Invoking Time : 0.611 ms Speedup Ratio : 16.089390 CPU -> GPU : 0.66 ms GPU -> CPU : 0.38 ms Kernel : 1.76 ms =================================== POT_GHOST_SIZE = 3 G_POT_GPU_NPATCH = 256 CPU Processing Time : 180.158 ms CPU SOR Time : 178.591 ms CPU Advance Time : 0.000 ms GPU Processing Time : 9.831 ms GPU Invoking Time : 1.605 ms Speedup Ratio : 18.325500 CPU -> GPU : 1.51 ms GPU -> CPU : 0.40 ms Kernel : 7.92 ms Versiona 1.2.1 04/19/2009 =============================== 1. work with CUDA 2.1 2. add some C++ header files in order to be compiled in the NCHC GPU cluster Versiona 1.2.2 04/25/2009 =============================== 1. for POT_GHOST_SIZE == 1 ( the function "CUCAL_PoissonSolver_10cube.cu" ), store the density field in the shared memory --> better performance Versiona 1.3 04/29/2009 =============================== 1. add the function "CUCAL_Poisson_18cube" --> work with POT_NXT_IN == 18 2. since 18*18*18*4 = 22.78Kb > 16Kb, it CANNOT be stored in the shared memory --> we store half the data in the per-thread registers 3. the internal potential is initialized as zero in both CPU and GPU solvers Versiona 2.0 09/14/2009 =============================== 1. too many modifications ... 2. integrate into two GPU kernels (1) CUCAL_Poisson_10to14cube : works for POT_GHOST_SIZE = 1, 2, 3 --> use only the shared memory to store all potential data (2) CUCAL_Poisson_16to18cube : works for POT_GHOST_SIZE = 4, 5 --> use both the shared memory and registers to store all potential data 3. power-law initial condition for the density field --> output the radial distribution of potential and also compare to the analytical solutions 4. perform spatial interpolation in GPU --> only send the coarse-grid potential into GPU 5. only send PATCH_SIZE^3 fine-grid potential back to CPU 6. re-format the Makefile 7. add the header file "Symbolic_Constant.h" 8. rename several variables to be consistent with GAMER 9. the CPU_PoissonSolver can work with arbitrary POT_GHOST_SIZE --> it also performs the spatial interpolation 10. count the average number of SOR iterations in the CPU Poisson solver Versiona 2.1 09/24/2009 =============================== 1. add the "conservative quadratic interpolation" in both CPU and GPU Poisson solvers --> use the input parameter "IntScheme" to control the interpolation scheme Versiona 2.2 01/13/2010 =============================== 1. add the "quadratic interpolation" in both CPU and GPU Poisson solvers --> set "IntScheme == 4" to use this interpolation scheme 2. for POT_GHOST_SIZE == 3, the symbolic constant POT_BLOCK_SIZE_Z is set to "2" in order to have enough registers Versiona 2.3 02/25/2010 =============================== 1. Include the CPU and GPU Gravity solvers. 2. The potential output array now stores "GRA_NXT^3" data instead of "PATCH_SIZE^3" data. Versiona 2.3.1 03/09/2010 =============================== 1. For the Gravity solver, we replace "[PATCH_SIZE][PATCH_SIZE][PATCH_SIZE][NCOMP]" by "[NCOMP][PATCH_SIZE][PATCH_SIZE][PATCH_SIZE]". --> It is mainly for the out-of-core + GAMER computing (GAMER.1.0.beta4.0). Versiona 2.4 07/27/2010 =============================== 1. Input the POT_GPU_NPGROUP, GPU_NSTREAM, GPU_ID, INT_SCHEME, and OPT__GRA_P5_GRADIENT from the command line. --> Please refer to "./PoissonGravity -h" for the usage instruction. --> If these parameters are not set by the command-line input, they will be set to the default values according to the device properties. --> (a) GPU_NSTREAM = 1/4 for compute capability == 1.0 / > 1.0 (b) FLU_GPU_NPGROUP = 2*GPU_NSTREAM*(# of multiprocessors) (c) GPU_ID = 0 (d) INT_SCHEME = 4 (e) OPT__GRA_P5_GRADIENT = true 2. Replace the "float' by "real" in all CUDA functions. Versiona 2.5 08/31/2010 =============================== 1. Revise the function "CUAPI_Set_Diagnose_Device". 2. The file "CUAPI.cu" is separated into several CUAPI_XXX files. 3. Add the header file "CUCAL.h". 4. Specify the virtual and GPU architectures in the Makefile. 5. Support the Fermi architecture. --> Add the option "FERMI" for the optimization in Fermi GPUs. 6. All CUCAL_XXX functions are renamed as CUPOT_XXX. 7. Declare the "volatile" qualifier for the reduction operation. Versiona 2.6 12/12/2010 =============================== 1. Replace all "!Disp" operation by "Disp^1" --> The "!Disp" operation does NOT work in CUDA.3.2 2. Classify all files --> Add directories "CPU_Solver, GPU_Solver, GPU_API, Header" 3. Add the header "CUAPI.h" --> Replace all "CUDA_SAFE_CALL" and "CUT_CHECK_ERROR" by "CUDA_CHECK_ERROR" defined in the file "CUAPI.h" 4. Output the total memory requirement of the GPU solver 5. Rename the file "CUCAL.h" as "CUPOT.h" 6. Record the elapsed time per patch per step and the number of grids per sec 7. Record the CPU information 8. Replace "fabs" by "FABS" 9. Set the cacahe configuration in Fermi GPUs to "cudaFuncCachePreferShared" 10. Type casting all literal numbers by (real) in CPU solvers. 11. The variables "Mp, Mm" are declared as constant variables and set in the function "CUAPI_Set_Diagnose_Device" 12. For Fermi GPUs, we save density in the shared memory for higher performance 13. Support "OpenMP parallelization" in all CPU solvers. --> Add the option "OPENMP" in the Makefile 14. Add the option "INTEL" --> use the Intel compiler 15. The variables storing the size of memory allocation in the function "CUAPI_MemAllocate_PoissonGravity.cu" is declared as "long" 16. The default number of patch groups in Fermi GPUs is set to "2*GPU_NStream*DeviceProp.multiProcessorCount" ***BUG FIXED*** 1. Fix the bug in "CUPOT_PoissonSolver_16to18cube.cu" --> Add "__syncthreads()" after "Residual_Total_Old = s_Residual_Total[0];" --> Ensure that "Residual_Total_Old" records the correct value (before "s_Residual_Total[0]" is modified by the thread 0) Versiona 2.6.1 07/16/2011 ============================ 1. Re-order the issue of CUDA streams for higher performance Versiona 3.0 08/07/2011 ============================ 1. Support CPU/GPU multigrid solvers --> Add the "POT_SCHEME" option in the makefile --> Only work with Fermi GPUs --> Higher convergence rate as compared with SOR solver, but the performance is lower than SOR solver for PATCH_SIZE == 8 2. Correct the reduction operation in "CUPOT_PoissonSolver_SOR_10to14cube.cu" and "CUPOT_PoissonSolver_SOR_16to18cube.cu" --> update versions from 2.4 to 2.5 3. Add the command-line option "-o" for data output 4. Performance of GPU SOR solver with double precision is highly improved Versiona 3.0.1 05/05/2012 ============================ 1. Rearrange the implementation of CUDA streams to enable the overlapping of data transfer and kernel execution even for the GPUs not capable of the concurrrent upstream/downstream memory copies. 2. Replace the device property "deviceOverlap" by "asyncEngineCount" in the function "CUAPI_DiagnoseDevice.cu" to query both the overlapping between memory copy and kernel execution and the overlapping between upstream and downstream memory copies. Versiona 3.1.0 03/06/2016 ============================ 1. Optimized for K40 GPU ==> Fine tune POT_BLOCK_SIZE_Z, GPU_NSTREAM 2. Support the new makefile option "GPU_ARCH=FERMI/KERPLER" 3. No longer support TESLA GPU (compute capability < 2.0) 4. Update Poisson solvers to the latest versions ***************************************************************************************** After v3.0, please use the new CPU/GPU solvers adopted in (or after) GAMER.1.0.beta5.2.0 --> support "IntScheme_t" data type and GRA_GHOST_SIZE == 0 in "CUPOT_PoissonSolver_SOR_16to18cube" ***************************************************************************************** BUG 08/03/2011 ============================ Unfinished Works 11/13/2011 ============================ 1. Further optimize the GPU multigrid solver if PATCH_SIZE > 8 is required
About
Performance testbed for the GPU Poisson-gravity solver
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published