-
Notifications
You must be signed in to change notification settings - Fork 41
NERSC GPU hackathon (Dec 2021)
This page summarises preparations for the hackathon on the 2nd/8th/9th/10th December 2021. We will use CoreNEURON+NMODL with a few sets of test model data.
NOTE If you see any issues that we need to be handled after hackathon should be created in https://github.com/neuronsimulator/gpuhackathon/issues
-
Olli - Install Caliper, Jinja2 PyYAML Pytest Sympy on Ascent (via Spack or Standalone?), Blocked due to access issue
-
Pramod - Update Ascent instructions to enable NMODL (same as NERSC), Blocked due to access issue
-
Olli - Channel benchmark failing on Ascent during load input dataset. Should we re-generate data on Ascent?
-
Ioannis - Generate & copy input dataset for Solver analysis and optimisation during Hackathon. Add info on this wiki page for the same.
-
Omar - List of all OpenACC directives and APIs used in the CoreNEURON + NMODL. See https://github.com/neuronsimulator/gpuhackathon/blob/main/overview.md
-
Olli - Common profiling/benchmarking script which could compare & plot OpenMP vs OpenACC runtimes
-
First Day Presentation Preparation
- Pramod + Olli: 3 min Introduction
- Ioannis + Christos: Solver introduction to mentors
- Pramod + Omar + Olli: Compute loop + DSL code generation introduction to mentors
- Alex: Check the neuron-gpu-nightly execution on Perlmutter. Do they work with/without MPI?
- Alex: NEURON + CoreNEURON - Interviews and NEURON + CoreNEURON + NMODL - Interviews build on Perlmutter & Ascent. Do they work?
- Alex: Should we automatically disable Interviews on the Cray systems and IBM Power systems by detection via CMake?
- Alex: Check if there is any improvement needed for building from source on Hackathon systems.
- Nico + Alex: Make usage of OpenACC API's (
acc_memcpy_to_device
andacc_copyin
) portable for OpenMP as well as OpenACC. In short https://github.com/BlueBrain/CoreNeuron/blob/master/coreneuron/gpu/nrn_acc_manager.cpp should become OpenACC as well as OpenMP. - Nico + Alex: Test first ringtest and then channel-benchmark for the correctness of the migration
- Pramod: Review OpenACC/OpenMP interoperability via API calls, see e.g. https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866.
- Omar: Add OpenMP offload example like https://bbpcode.epfl.ch/reps/user/kumbhar/cunrnran123/tree/mainacc.c (still using CUDA Random123) -- it moved to GitLab https://bbpgitlab.epfl.ch/hpc/user/cunrnran123
- Omar: Trying Random123 directly in OpenMP offload regions
- Omar: Anything to discuss / ask to mentors about Random123 portability on different gpus?
- Add an option in CoreNEURON that only executes Hines solver ?
- Find out a way to dynamically set a good number for the
nwarp
variable (used to distribute the cells) - Ioannis + Christos: Profiling of a realistic model and analyse different performance metrics with mentors to understand the limitations.
- Ioannis + Christos: With current algorithm, investigate possible performance improvement opportunities
- Ioannis + Christos: Performance comparison of OpenACC vs OpenMP vs CUDA implementation
- Ioannis + Christos: Would benefit from special memory types usage?
- Ioannis + Christos: Look into algorithmic improvements to expose more parallelism (if suitable)
- Olli - Update codegen backend to emit OpenMP constructs with OpenACC (ongoing)
- Omar - Review code generation backend implementation https://github.com/BlueBrain/nmodl/pull/780
- Olli + Omar - Review with the mentors performance aspects (e.g.
present
vsmap
aspects) - Olli + Pramod - Review corresponding OpenMP directives and APIs used in place of OpenACC
- Build system changes to enable OpenMP target offload as well as OpenACC
- Olli: Try simple OpenMP + OpenACC offload test (e.g. with inbuilt ring test)
- Olli: Update Hines Solver in coreneuron with OpenMP (ongoing)
- - Christos + Ioannis to review: https://github.com/BlueBrain/CoreNeuron/pull/693
- Pramod: Measure performance difference between OpenMP and OpenACC runs
- Olli: Under coreneuron event communication part, start tackling OpenACC pragmas or API calls that are not converted to OpenMP
- Pramod: Discuss compiler bug / error with NVIDIA colleague (Brent): https://forums.developer.nvidia.com/t/enabling-openmp-offload-breaks-openacc-code/196643/6
- Pramod: Prepare a slide describing the goals - what currently we do and what we would like to do?
- Pramod: Organise / plan a discussion with a / few nvidia compiler engineers. Central question is - ABI compatibility across different CUDA + OpenMP + OpenACC runtimes/compilers.
- First, check/fix https://github.com/BlueBrain/CoreNeuron/issues/594
- Re-measure the slowdown from using unified memory (ringtest + channel-benchmark)
- Identify which data structures are causing the slowdown. How to identify this easily with the tools?
These are instructions to build + run on Ascent (login1.ascent.olcf.ornl.gov
).
module load nvhpc/21.9 python cmake flex
module swap cuda/10.1.243 cuda/10.2.89
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
# clone repository
cd $HOME
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir -p build && cd build
cmake .. -G Ninja -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/
cmake --build . --parallel
As $HOME is not writable, create your own directory into the project directory:
mkdir -p /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
cd /ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/$USER
Now rub a tiny, functional test on the GPU:
- Allocate a node
bsub -P GEN170 -J neuron -W 90 -nnodes 1 -alloc_flags "gpumps" -Is $SHELL
- Make sure to load necessary modules:
module load nvhpc/21.9 python cmake
module swap cuda/10.1.243 cuda/10.2.89
module use /autofs/nccsopen-svm1_proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper
- Run simple function test on the GPU:
OMP_NUM_THREADS=1 jsrun --gpu_per_rs 1 -n 1 $HOME/CoreNeuron/build/bin/ppc64le/special-core -e 1 -d $HOME/CoreNeuron/tests/integration/ring --gpu --mpi
- Run a channel-benchmark test
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 \
jsrun --gpu_per_rs 2 -n 2 $HOME/CoreNeuron/build/benchmark/ppc64le/special-core \
--datpath=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1320-cells-2-ranks/ \
--mpi --gpu --cell-permute=2 --tstop=100
- The system-wide modules only go up to NVHPC 21.7, which has known issues with NEURON.
We have ourselves inserted an installation of NVHPC 21.9 (module
nvidia/21.9
) into the Cray Programming Environment setup. - Note that NVHPC 21.9 is configured to use the system GCC 7.5 standard library, while we have built some other dependencies with GCC 9.2. This seems to be "close enough".
-
-tp haswell
or-DR123_USE_SSE=0
is required becausenvc++
defaults to-tp zen
on the Perlmutter nodes, which defines__ABM__
and causes Random123 to try and includeintrin.h
, which fails.
# clone repository
git clone --branch hackathon_main [email protected]:BlueBrain/CoreNeuron.git # or git clone --branch hackathon_main https://github.com/BlueBrain/CoreNeuron.git
cd CoreNeuron && mkdir -p build && cd build
# allocate node
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 4 --account=ntrain9_g
# Use our own hand-crafted modules for cuda 11.4 (from nvhpc/21.9) and PrgEnv-nvidia for nvhpc 21.9
module use /global/cfs/cdirs/ntrain9/neuron/modules
# Also, spack-generated modules for dependencies (caliper, ninja, py-*)
module use /global/cfs/cdirs/ntrain9/neuron/spack_modules/cray-sles15-zen2
# Load modules: prefer CUDA 11.4 from NVHPC/21.9
module swap cuda cuda/11.4.1
module load cmake nvidia/21.9 python caliper ninja py-pytest py-pyyaml py-jinja2 py-sympy
# Build CoreNEURON
cmake .. -G Ninja \
-DCORENRN_ENABLE_CALIPER_PROFILING=ON \
-DCORENRN_ENABLE_GPU=ON \
-DCORENRN_ENABLE_NMODL=ON \
-DCORENRN_NMODL_FLAGS="sympy --analytic" \
-DCORENRN_EXTERNAL_BENCHMARK_DATA=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021 \
-DCMAKE_INSTALL_PREFIX=../install \
-DCMAKE_CUDA_COMPILER=nvcc \
-DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" \
-DCMAKE_CXX_COMPILER=CC \
-DCMAKE_CUDA_ARCHITECTURES=80 \
-DCORENRN_NMODL_DIR=/global/cfs/cdirs/ntrain9/neuron/spack/cray-sles15-zen2/gcc-9.3.0/nmodl-0.3.0.20111126-m2kos252sgvxkq7xltv5w35e4irae7gj
cmake --build . --parallel
ctest --output-on-failure -j 16
If you are working on the NMODL code generation, avoid -DCORENRN_NMODL_DIR=..
option or provide your own install directory -DCORENRN_NMODL_DIR=..
.
If you haven't allocated a session
salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus 2 --account=ntrain9_g
Note that we are setting NVCOMPILER_ACC_SYNCHRONOUS=1
below so that we get correct timings for individual kernels. Otherwise kernel timings are incorrect due to async launch.
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 CALI_CONFIG=runtime-report,calc.inclusive srun -n 2 benchmark/x86_64/special-core -e 1 --datpath=../tests/integration/ring --mpi --gpu --cell-permute=2
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 84a7f50 (2021-11-25 18:03:04 +0100)
Additional mechanisms from files
Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod
num_mpi=2
num_omp_thread=1
Info : 2 GPUs shared by 2 ranks per node
...
NVCOMPILER_ACC_SYNCHRONOUS=1 OMP_NUM_THREADS=1 CALI_CONFIG=runtime-report,calc.inclusive \
srun -n 2 benchmark/x86_64/special-core \
--datpath=$CFS/ntrain9/neuron/nersc-gpu-hackathon-dec-2021/channel-benchmark-all-1320-cells-2-ranks/ \
--mpi --gpu --cell-permute=2 --tstop=100
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 de4e433 (2021-11-26 08:50:58 +0100)
Additional mechanisms from files
Ca.mod CaDynamics_DC0.mod CaDynamics_E2.mod Ca_HVA.mod Ca_HVA2.mod Ca_LVAst.mod CoreConfig.mod Ih.mod Im.mod K_Pst.mod K_Tst.mod KdShu2007.mod NaTa_t.mod NaTg.mod NaTs2_t.mod Nap_Et2.mod ProbAMPANMDA_EMS.mod ProbGABAAB_EMS.mod ProfileHelper.mod SK_E2.mod SKv3_1.mod TTXDynamicsSwitch.mod VecStim.mod cacumm.mod cacummb.mod cagk.mod cal2.mod can2.mod cat.mod exp2syn.mod expsyn.mod gap.mod h.mod halfgap.mod hh.mod kadist.mod kaprox.mod kca.mod kd.mod kd2.mod kdb.mod kdrbca1.mod kdrca1.mod km.mod kmb.mod na3n.mod naxn.mod netstim.mod netstim_inhpoisson.mod new_calcium_channels.mod passive.mod pattern.mod stim.mod svclmp.mod
num_mpi=2
num_omp_thread=1
Info : 2 GPUs shared by 2 ranks per node
....
Solver Time : 16.2698
Simulation Statistics
Number of cells: 1320
Number of compartments: 648360
Number of presyns: 3961320
Number of input presyns: 0
Number of synapses: 3960002
Number of point processes: 7921322
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 11969
Number of spikes with non negative gid-s: 11969
Path Min time/rank Max time/rank Avg time/rank Time %
main 40.376340 40.378758 40.377549 99.399072
checkpoint 0.000001 0.000002 0.000002 0.000004
output-spike 0.004200 0.004211 0.004205 0.010353
simulation 16.269795 16.269813 16.269804 40.052045
spike-exchange 0.019603 0.041487 0.030545 0.075194
spike-exchange 0.019578 0.041452 0.030515 0.075120
communication 0.000886 0.000906 0.000896 0.002206
imbalance 0.018654 0.040551 0.029603 0.072874
timestep 16.225238 16.247191 16.236214 39.969356
state-update 4.167419 4.176959 4.172189 10.270849
state-SKv3_1 0.084673 0.086011 0.085342 0.210090
state-SK_E2 0.086975 0.087625 0.087300 0.214910
state-ProbGABAAB_EMS 0.176391 0.182205 0.179298 0.441385
state-ProbAMPANMDA_EMS 0.512746 0.521091 0.516919 1.272520
state-nax 0.135361 0.136370 0.135866 0.334466
state-NaTg 0.121694 0.121713 0.121704 0.299603
state-Nap_Et2 0.091850 0.092020 0.091935 0.226320
state-na3 0.126686 0.127354 0.127020 0.312690
state-K_Tst 0.083087 0.083141 0.083114 0.204605
state-K_Pst 0.086288 0.086488 0.086388 0.212665
state-kmb 0.078860 0.079250 0.079055 0.194613
state-KdShu2007 0.077877 0.078125 0.078001 0.192018
state-kdr 0.094169 0.095289 0.094729 0.233198
state-kdrb 0.092489 0.092745 0.092617 0.227999
state-kdb 0.089317 0.089467 0.089392 0.220060
state-kd2 0.087586 0.092668 0.090127 0.221869
state-kca 0.363553 0.364731 0.364142 0.896423
state-kap 0.109070 0.109326 0.109198 0.268817
state-kad 0.112513 0.113974 0.113244 0.278776
state-Ih 0.098592 0.100036 0.099314 0.244485
state-hd 0.093146 0.094326 0.093736 0.230754
state-cat 0.121305 0.121554 0.121430 0.298928
state-can 0.121067 0.122852 0.121960 0.300233
state-Ca_LVAst 0.129896 0.130772 0.130334 0.320849
state-cal 0.108788 0.109597 0.109192 0.268804
state-Ca_HVA2 0.142881 0.143095 0.142988 0.351999
state-cagk 0.097391 0.098042 0.097716 0.240553
state-cacum 0.234768 0.235131 0.234950 0.578385
state-cacumb 0.175342 0.176663 0.176002 0.433273
state-IClamp 0.002348 0.002558 0.002453 0.006039
state-CaDynamics_DC0 0.113226 0.113724 0.113475 0.279346
state-pas 0.002896 0.003032 0.002964 0.007297
update 0.116486 0.117232 0.116859 0.287677
second-order-cur 0.001943 0.002165 0.002054 0.005056
matrix-solver 6.071176 6.123533 6.097355 15.010108
setup-tree-matrix 4.745456 4.748895 4.747175 11.686317
cur-SKv3_1 0.104343 0.105509 0.104926 0.258301
cur-SK_E2 0.097890 0.098530 0.098210 0.241768
cur-ProbGABAAB_EMS 0.252629 0.255678 0.254154 0.625660
cur-ProbAMPANMDA_EMS 0.903469 0.903876 0.903672 2.224608
cur-nax 0.113687 0.115076 0.114382 0.281578
cur-NaTg 0.113117 0.113994 0.113556 0.279544
cur-Nap_Et2 0.084684 0.084767 0.084725 0.208572
cur-na3 0.095460 0.096301 0.095880 0.236033
cur-K_Tst 0.076803 0.076846 0.076825 0.189122
cur-K_Pst 0.083091 0.083296 0.083193 0.204801
cur-kmb 0.084998 0.085193 0.085095 0.209483
cur-KdShu2007 0.079360 0.079602 0.079481 0.195662
cur-kdr 0.084818 0.085221 0.085019 0.209296
cur-kdrb 0.083571 0.083837 0.083704 0.206058
cur-kdb 0.074649 0.075234 0.074942 0.184487
cur-kd2 0.074704 0.075905 0.075305 0.185380
cur-kca 0.102095 0.102144 0.102119 0.251392
cur-kap 0.093734 0.094259 0.093996 0.231395
cur-kad 0.096892 0.096943 0.096917 0.238586
cur-Ih 0.099069 0.099392 0.099230 0.244280
cur-hd 0.079998 0.080359 0.080179 0.197379
cur-cat 0.103624 0.103911 0.103767 0.255449
cur-can 0.118858 0.119161 0.119010 0.292971
cur-Ca_LVAst 0.123353 0.126097 0.124725 0.307041
cur-cal 0.122882 0.123123 0.123003 0.302800
cur-Ca_HVA2 0.142775 0.143094 0.142934 0.351868
cur-cagk 0.122285 0.123171 0.122728 0.302125
cur-cacum 0.081801 0.082285 0.082043 0.201969
cur-cacumb 0.074288 0.075040 0.074664 0.183803
cur-IClamp 0.085479 0.088503 0.086991 0.214149
cur-CaDynamics_DC0 0.064449 0.064505 0.064477 0.158726
cur-ttx_ion 0.063089 0.063417 0.063253 0.155713
cur-ca_ion 0.100195 0.100529 0.100362 0.247065
cur-k_ion 0.063816 0.064154 0.063985 0.157515
cur-na_ion 0.066107 0.066272 0.066190 0.162941
cur-pas 0.109251 0.109293 0.109272 0.268999
deliver-events 0.982300 1.005465 0.993883 2.446681
net-receive-ProbGABAAB_EMS 0.000950 0.001073 0.001012 0.002490
net-receive-ProbAMPANMDA_EMS 0.003695 0.003919 0.003807 0.009372
net-buf-receive-ExpSyn 0.003570 0.003671 0.003621 0.008913
net-buf-receive-Exp2Syn 0.003701 0.003835 0.003768 0.009276
net-buf-receive-ProbGABAAB_EMS 0.068577 0.070341 0.069459 0.170990
net-buf-receive-ProbAMPANMDA_EMS 0.090366 0.094045 0.092206 0.226986
update-net-receive-buf 0.399891 0.406366 0.403129 0.992398
net-receive-buf-cpu2gpu 0.379896 0.385496 0.382696 0.942098
net-receive-buf-order 0.003375 0.003533 0.003454 0.008503
check-threshold 0.182114 0.184026 0.183070 0.450671
finitialize 2.094235 2.094272 2.094254 5.155510
spike-exchange 0.000035 0.105358 0.052696 0.129725
spike-exchange 0.000031 0.105354 0.052693 0.129715
communication 0.000019 0.000022 0.000020 0.000050
imbalance 0.000006 0.105331 0.052669 0.129656
cur-SKv3_1 0.000031 0.000032 0.000031 0.000078
cur-SK_E2 0.000029 0.000031 0.000030 0.000074
cur-ProbGABAAB_EMS 0.000072 0.000073 0.000073 0.000178
cur-ProbAMPANMDA_EMS 0.000265 0.000267 0.000266 0.000655
cur-nax 0.000034 0.000034 0.000034 0.000084
cur-NaTg 0.000032 0.000034 0.000033 0.000081
cur-Nap_Et2 0.000025 0.000026 0.000025 0.000063
cur-na3 0.000029 0.000030 0.000029 0.000073
cur-K_Tst 0.000023 0.000024 0.000024 0.000058
cur-K_Pst 0.000025 0.000026 0.000025 0.000063
cur-kmb 0.000026 0.000027 0.000027 0.000065
cur-KdShu2007 0.000024 0.000024 0.000024 0.000059
cur-kdr 0.000026 0.000026 0.000026 0.000064
cur-kdrb 0.000025 0.000025 0.000025 0.000062
cur-kdb 0.000023 0.000023 0.000023 0.000057
cur-kd2 0.000024 0.000024 0.000024 0.000059
cur-kca 0.000031 0.000032 0.000031 0.000078
cur-kap 0.000027 0.000028 0.000027 0.000068
cur-kad 0.000028 0.000030 0.000029 0.000071
cur-Ih 0.000029 0.000030 0.000029 0.000073
cur-hd 0.000024 0.000025 0.000024 0.000060
cur-cat 0.000030 0.000032 0.000031 0.000076
cur-can 0.000035 0.000035 0.000035 0.000086
cur-Ca_LVAst 0.000035 0.000036 0.000035 0.000087
cur-cal 0.000036 0.000037 0.000036 0.000090
cur-Ca_HVA2 0.000039 0.000041 0.000040 0.000098
cur-cagk 0.000037 0.000038 0.000037 0.000092
cur-cacum 0.000025 0.000026 0.000025 0.000063
cur-cacumb 0.000024 0.000025 0.000024 0.000060
cur-IClamp 0.000027 0.000028 0.000027 0.000068
cur-CaDynamics_DC0 0.000021 0.000022 0.000022 0.000053
cur-ttx_ion 0.000026 0.000026 0.000026 0.000064
cur-ca_ion 0.000028 0.000029 0.000029 0.000070
cur-k_ion 0.000018 0.000019 0.000018 0.000046
cur-na_ion 0.000022 0.000023 0.000022 0.000055
cur-pas 0.000034 0.000034 0.000034 0.000084
update-net-receive-buf 0.000021 0.000023 0.000022 0.000054
load-model 21.857876 21.860913 21.859395 53.812170
To produce a useful profile with NSight Systems we need to configure Caliper to emit NVTX markers (CALI_CONFIG=nvtx
) and tell NSight Systems to record regions with names that are not registered strings (NSYS_NVTX_PROFILER_REGISTER_ONLY=0
).
To avoid profiling model initialisation and setup, you may want to only record the actual simulation (--capture-range=nvtx --nvtx-capture=simulation
).
Additionally, NSight Systems seems to have trouble profiling multiple OpenMP host threads launching GPU kernels at once, so you may want to disable that (OMP_NUM_THREADS=1
).
Taken together, an example prefix could be:
CALI_CONFIG=nvtx OMP_NUM_THREADS=1 nsys profile --env-var NSYS_NVTX_PROFILER_REGISTER_ONLY=0 --cuda-um-gpu-page-faults=true --cuda-um-cpu-page-faults=true --trace=cuda,nvtx,openacc,openmp --capture-range=nvtx --nvtx-capture=simulation ./x86_64/special-core ...
Some other notes:
- during development, just use simple ring test (for quick iteration)
- install nmodl master into project space
$CFS/ntrain9/neuron
so that mentors or other people can just use standard version for profiling or other non-codegen related tasks.
- Repository and branch which includes lots of tests of OpenACC and OpenMP interoperability https://github.com/iomaganaris/openMP_GPU/tree/magkanar/example. Also the code mentioned in https://forums.developer.nvidia.com/t/openacc-and-openmp-data-interoperability/181866/2