abouteiller
released this
19 Nov 13:51
·
14 commits
to master
since this release
Curated Change log
Added
- PaRSEC API 4.0.
- Add DTD CUDA support including NEW tiles in DTD.
- Add RoCM/HIP device support.
- Add IrisXE/Level0 device support (experimental).
- Enable users to manage their own data copies without PaRSEC interfering. Data copies are marked as being owned by PaRSEC or
not and managed by PaRSEC or not. A data copy owned by PaRSEC can be reclaimed by PaRSEC when its reference count reaches 0, a data copy managed by PaRSEC can be copied / moved onto a different device, while a data copy not managed by PaRSEC will never be
moved by the runtime. - Add an info system, and introduce two info hooks. See
parsec/class/info.h
for details. The info system allows the user to register info objects with different levels of structures and dynamic objects in the PaRSEC runtime. - PTG supports user-defined routines to move data between GPU and CPU, and user-defined sizes for buffers allocated on the GPU.
- PTG supports reshaping data propagated between local tasks and the speficiation of two types on acccesses to data colletions.
- PINS log
SCHEDULE_BEGIN
andSCHEDULE_END
events to better track tasks lifecycle. - Detect and report oversubscribed binding of core resources.
- PaRSEC Thread binding can be disabled (
bind_threads 0
MCA parameter). - Load balancing between GPUs can be tuned (
device_load_balance_skew
MCA parameter). - Load balancing exclusivity between CPU/GPUs can be disabled (
device_load_balance_allow_cpu
MCA parameter). - Data sent in messages can be of variable size.
- New API
parsec_context_query
can be used to obtain information on the system, like the number of devices, ranks, etc. - New active-message communication API gives low-level access to the PaRSEC communication system to DSLs.
Changed
- Single letter command line options have been replaced with
--mca
parameters.--help
is now--parsec-help
. - Renamed symbols related to data distribution to properly prefix them with the
parsec_
prefix. The old symbols have been deprecated. - DTD interface change: the global array parsec_dtd_arena_datatypes is replaced with functions to create, destroy, and get arena
datatypes for DTD, and these objects now live inside the parsec context. PARSEC_SUCCESS
changed to0
(from-1
), all values forPARSEC_ERR_XYZ
changed.- PaRSEC now requires CMake 3.21.
- PaRSEC profiling tools now require Python 3.x
- PaRSEC profiling system does not require for local dictionaries to be identical between ranks anymore.
time_estimate
functions can be used to control task load balancing (replacesweight
PTG property).
Deprecated
- data distribution w/o the
parsec_
prefix. Further documentation (including a
sed script) can be found incontrib/renaming
.
Removed
- PaRSEC API 3.0
- RECURSIVE Device support (this is temporary and will be restored in a future version).
- Removed obsolete
dbp2paje
tool;h5totrace
is the replacement tool to use. This removes the optional dependency on GTG. - Removed all command line options not prefixed by
--mca
, except for--parsec-help
and--parsec-version
. - Using more than
PARSEC_GPU_MAX_WORKSPACE
workspaces per device will now cause an error (instead of computing incorrect values). - PTG property
weight
(replaced bytime_estimate
).
Fixed
- DTD Termination detection would occasionally assert.
- Multiple bugs with GPU data ownership causing crashes and incorrect results when executing with more than 1 GPU.
- Device-to-device memory copies would not work in some scenarios.
- Suboptimal ordering of members in broadcast tree could cause performance reduction.
- Cray MPI and MPICH would crash in
MPI_Cancel
and when usingNULL
datatypes. - Do not report incorrect flops/s capabilities (
device_show_capabilities
MCA parameter). - On some systems PaRSEC would allocate more GPU memory than is available on the device.
- Performance with large number of GPU tasks with the same priority would be poor due to overhead of sorting by priority.
Known Bugs
- PaRSEC Thread binding ignores externally provided binding (e.g., a cpuset enforced by
srun
); see issue ICLDisco/dplasma#9. - Enabling the
RECURSIVE
device will cause crashes (it is disabled by default in this release); see issues #548, #541. - Running out of GPU memory when using the NEW keyword in PTG may cause deadlocks; see issue #527.
Security
Merged Pull Requests
List of merged pull requests
- [BBT#582] bugfix/atomic lifo: The offsetof was incorrect leading to lifo padding being wrong in external lifo by @abouteiller in #316
- First sketch of a github action for building by @bosilca in #309
- Miscellaneous profiling fixes by @omor1 in #320
- Per-language compiler flags by @therault in #326
- [BBT#541] A new way to install the internal headers by @bosilca in #322
- Doc/GitHub by @abouteiller in #330
- Provide a temporary fix for the flag detection. by @bosilca in #336
- We need BISON 3, and try to automatically pick the brew variant on Mac OSX by @abouteiller in #331
- Clean strings usages in CMake. by @bosilca in #340
- Allow the runtime to compile even when PTG support is not possible. by @bosilca in #332
- Work around GCC bug for atomic_thread_fence with memory order acquire by @devreal in #343
- Fix parsec_future: volatile and memory barriers by @devreal in #342
- Reshape test: variable used for polling should be volatile by @devreal in #344
- Dust off the cmake_modules by @abouteiller in #346
- New CMake versions use MPI_ROOT to find MPI by @abouteiller in #345
- Fallback using a compatible HWLOC. by @bosilca in #341
- hotfix: compile failure when Ayudame not found by @abouteiller in #348
- Fix/quick fixes by @bosilca in #350
- Update issue template to make it easier to read and easier to fill-up by @abouteiller in #349
- Update the installation instructions by @abouteiller in #354
- Cleanup/ptgpp assignments by @abouteiller in #352
- Apply -g3 to DEBUG only, set default config to Release by @abouteiller in #347
- Profiling msync and header commit by @therault in #337
- Removing hard flex/bison dependency: only devs need to run the parser by @abouteiller in #335
- Hicma/recursive by @bosilca in #328
- Fix/deprecated support by @bosilca in #362
- Add the filename to the generated profiling event name. by @bosilca in #359
- Fix atomics on macosX not working properly (missing header) by @abouteiller in #356
- Remove never compiled in '64bit' lifo implementation by @abouteiller in #360
- Fix/many small updates by @bosilca in #363
- Make the ParsecCompilerFlags.cmake self contained by @abouteiller in #364
- Profiling fix: parsec_init(NULL, NULL) by @therault in #339
- GitHub runner with spack by @bosilca in #333
- Update PAPI SDE to fit the current API by @therault in #365
- ucontext is not supported on OSX. by @bosilca in #366
- recursive cb type was not correct by @abouteiller in #368
- Since new policy, setting the non-cache variable creates an empty cache by @abouteiller in #367
- Do now allow spack to be updated automatically. by @bosilca in #375
- flex: on some machines, flex cannot work if parsec/utils is not created by @abouteiller in #374
- Attempt to backport the revamp of the communication engine by @devreal in #380
- Respect DISTDIR is provided. by @bosilca in #383
- [RFC] profiling tools: more efficient cross-stream event matching by @omor1 in #372
- Hash table: count used buckets only when needed by @devreal in #379
- Print the debug rank from device_show_statistics by @abouteiller in #386
- Handle error in CUDA/HIP module init and configurable max_streams by @therault in #351
- Update to a newer spack compiler by @bosilca in #392
- Make the PUSHOUT and other DTD GPU concepts generic by @abouteiller in #387
- Workaround current CUDA/HIP "solution suspicious" bug... by @therault in #381
- dtd_bench_simple_gemm.c relies on non-standard cblas.h file by @therault in #317
- profiling tools: improve large buffer performance by @omor1 in #390
- Add a load-balancing skew so that we favor locality up to a configurable limit by @abouteiller in #389
- Fix redistribute wrapper by @devreal in #395
- [BBT#509] Dtd cuda with new by @therault in #318
- Only enable CUDA language if supported. by @bosilca in #404
- [BBT#572] Implement hash table API providing a key handle during lock by @devreal in #307
- profiling tools: fix format for padded structures by @omor1 in #402
- Don't pass the execution stream around to recursive calls. by @bosilca in #413
- Protect the inline functions by they device support. by @bosilca in #415
- Allow the DSL to provide a task_snprintf function and use it when displaying the DOT by @devreal in #409
- Execution stream keeps the highest priority task for local execution. by @bosilca in #399
- TTG/termdet by @therault in #391
- More data transfer statistics between devices by @therault in #426
- Remove startup tasks from DTD by @therault in #425
- tools/profiling: PTT v2 by @omor1 in #418
- Fix/more warnings by @bosilca in #416
- Fix possible access to unowned memory in
parsec_dtd_task_class_add_chore
by @DSMishler in #427 - fix remote_dep_mpi.c by @cflinto in #429
- bugfix for profiling without multiprocessing by @DSMishler in #432
- Project_dyn test missing libm dependency by @abouteiller in #433
- Ttg/termdet dynamic PTG by @therault in #430
- Update API versions in examples by @abouteiller in #435
- Make sure PaRSEC compiles for all rwlock implementations. by @bosilca in #434
- comm: set parsec_tls_execution_stream in comm thread by @omor1 in #422
- Use normal for loops to iterate over local index variables when they are a range. by @therault in #329
- [BBT#559] Add llp scheduler: local lifo with priorities by @devreal in #325
- [BBT#536] Add AMD RoCM/HIP device by @abouteiller in #315
- TSL variables are not static by default. by @bosilca in #437
- Allow overwriting of the completion and enqueue callbacks. by @bosilca in #439
- Bring back support for changing the PaRSEC communicator by @bosilca in #401
- Fix: termination detector race condition by @therault in #438
- Fixes/tls in comm thread by @therault in #440
- profiling: disambiguation between certain MPI events by @omor1 in #376
- TTG/building system by @therault in #424
- A small script to help us find which files might need copyright update. by @therault in #442
- Remove dependency on argv[0] by @abouteiller in #445
- Make configurable the treshold for warning about wrong binding on by @abouteiller in #453
- Remove some warnings about unused variables by @abouteiller in #451
- Fix statistics management with multiple GPU. by @bosilca in #454
- Debug output about reshapping was too verbose by @abouteiller in #456
- Topic/update spack by @bosilca in #462
- Increase function name length limit in debug output by @devreal in #463
- initial flags modification PR. Open to review. by @DSMishler in #461
- Introduce parsec_taskpool_wait and parsec_taskpool_test by @devreal in #411
- Fix profile dtd by @therault in #475
- Correctly identify the Intel compiler. by @bosilca in #472
- Proper case and imported target for hwloc by @abouteiller in #457
- Use the profiling key macros by @bosilca in #476
- No more tag collisions. by @bosilca in #477
- profiling: fix multiple EXEC_END events in some circumstances by @omor1 in #421
- DSL profiling by @therault in #469
- Put END_C_DECLS at the end of device_cuda.h by @devreal in #480
- Drop support for python2. by @bosilca in #482
- Hotfix dtd dsl and warning by @therault in #488
- Use pip to build and install the python support. by @bosilca in #489
- Termdet callback order by @therault in #494
- Fix the branching test in distributed by @therault in #495
- Fix profiling tests by @therault in #490
- Log all types of runtime-system events in task_profiler, and ensure that all time of compute threads is accounted for by @devreal in #410
- Fix missing loop counter increment in pins task profiler by @devreal in #498
- Close files between operations. by @bosilca in #484
- Make parsec_data_t::device_copies a flexible array member by @devreal in #499
- Fix an issue about GPU statistics (Issue#505) by @QingleiCao in #506
- Do not cancel persistent requests with cray-mpich, it is broken. by @abouteiller in #519
- Make sure data_in is not NULL on GPU when accessing nb_elts by @QingleiCao in #492
- volatile uint32 is not always a valid type for MPI_SUM by @abouteiller in #524
- Profiling corrected. by @josephjohnjj in #526
- Remove all options that are not prefixed by --mca by @abouteiller in #447
- Implement a universal hash function by @devreal in #528
- Can't use NULL to pass-in MPI datatypes with MPICH derivatives by @abouteiller in #518
- Update to the SDE available in PAPI-7.0.1 by @therault in #522
- More flexible paranoid checks in data_dist/matrix implementations by @therault in #530
- Handle the tertiary case for startup tasks. by @bosilca in #508
- Fix some errors with ctest for profiling by @abouteiller in #523
- pip install --prefix has version-dependent behaviors by @therault in #532
- Fix old typo in termdet modules management: by @therault in #533
- Allow for tag registration/deregistration any time. by @bosilca in #521
- Topic/device naming by @bosilca in #493
- Gpu workspace fix by @therault in #510
- Fix bug: some scenarios would call a nullptr function in profiling by @therault in #535
- Profiling hotfix: error in python3 when passing {char[64]} as conversion type by @therault in #536
- Limit the number of recv requests to ensure there is space for sends by @devreal in #538
- Flex flags bugfix and python venv log by @abouteiller in #537
- Fix hash function generation for PTG by @therault in #479
- Small warnings from clang14 by @abouteiller in #542
- Print -wflags found only when results differ from cache by @abouteiller in #540
- Fix the start/stop test. by @bosilca in #546
- Gpu fix load balancing by @therault in #517
- Tear-down parsec_ce after high-level communication by @devreal in #549
- relabel mislabelled tests by @abouteiller in #550
- Find the right spack environment. by @bosilca in #559
- Add device async/again support. by @bosilca in #544
- hotfix by @therault in #562
- Profiling and PAPI SDE updates by @therault in #565
- Fix the management of GPU copies. by @bosilca in #563
- cmake -> CMAKE_COMMAND by @evaleev in #567
- Correct the logic for passing over CPU/RECURSIVE devices by @abouteiller in #557
- HAVE_PEER_ACCESS is always present in all relevant versions of CUDA or by @abouteiller in #572
- CUDA: disable timer support on cuda events by @devreal in #576
- Pick a stable spack branch by @bosilca in #578
- Add an option to skip HWLOC compat run by @bosilca in #582
- Allow locals and parameters to be defined via CMake. by @bosilca in #583
- Prevent race condition in accelerator copies management by @bosilca in #575
- Fix/rocm detect and unknown device warning by @abouteiller in #577
- [python] do not build python support unless pandas is available by @evaleev in #584
- Refactor GPU device to increase code factorization between the devices. by @therault in #570
- Disable recursive device by default (temporarily) by @abouteiller in #585
- remote_dep: rotate bcast topology around root by @omor1 in #481
- Discover atomic support for __int128_t. by @bosilca in #587
- Fix warnings. by @bosilca in #591
- Typo in level zero component by @therault in #592
- dbpreader: missing corner case in cache building by @therault in #593
- Skip profiling for task classes without profiling information by @DSMishler in #594
- cmake logic fixes for level zero and half-installed systems by @therault in #596
- l0: dpccpp can't create output files in the build dir if the enclosing by @abouteiller in #597
- gpu: some errors introduced during gpu despecialization caused deadlocks by @abouteiller in #598
- ze: the queue need to be reset when task completes by @abouteiller in #599
- Fixes for GPU memory oversubscription by @devreal in #602
- If the DSL defines a task_snprintf function, use that function. by @therault in #603
- Make sure the w2r task has a stageout set by @devreal in #604
- Consistently use size_t for nb_elts in data and flows by @devreal in #605
- Minor cleanup of the DTD parameters manipulation. by @bosilca in #609
- Fix the parsec_future_t. by @bosilca in #608
- Topic/add evaluate keyword by @bosilca in #569
- Relative path, symbolic links, and python examples by @therault in #606
- Update process_name.c. Fixes the issue #610 by @bimalgaudel in #611
- CMAKE: Bring back checks for atomic CAS by @devreal in #595
- HOTFIX: make the default number of devices be all the devices seen by… by @therault in #613
- bugfix: in dtd sometimes the cpu incarnation and gpu incarnations are by @abouteiller in #616
- Re-enable CI tests for cuda caps as they now work again. by @abouteiller in #614
- Add capability of saving GPU statistics and printing diff vs saved stats by @abouteiller in #558
- Fix the argument _NSGetExecutablePath. by @bosilca in #620
- Add a context-level query capability. by @bosilca in #621
- Prevent CI from running OOM when oversubscribing GPUs by @abouteiller in #629
- Fix CUDA protection macro use by @bosilca in #632
- Move comm profiling initialization into comm thread by @devreal in #626
- Cleanup/cosmetics by @abouteiller in #631
- Alternative solution to the CI problem with GPUs by @therault in #633
- ci: all tests must use parsec_addtest by @abouteiller in #635
- [BBT#237] Allow sender to send data of any size. by @bosilca in #321
- bugfix: dtd taskpool destructor should work symetric to contructor by @abouteiller in #637
- fixes memory leaks by @BrieucNicolas in #639
- Fix the lack of direct GPU to GPU communications in multi-device runs. by @therault in #642
- Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) by @abouteiller in #648
- Initialize the parsec's HWLOC subsystem before starting threads. by @abouteiller in #650
- Fix overflow when calling parsec_data_create by @QingleiCao in #646
- cmakery: let find_package find HIP v6 by @abouteiller in #652
- Fix computation of available memory on gpu (avoid truncation and conversions) by @abouteiller in #651
- bugfix: when hip is not found, its ok. by @abouteiller in #656
- Add sanity check for free memory by @devreal in #658
- Explicit message when outputing the warning about being unable to allocate memory in GPU code by @therault in #655
- config: osx would not find bison on newer fink/brew by @abouteiller in #657
- Consolidated error handling when GPU only tests execute on CPU systems by @abouteiller in #644
- Add the number of copies evicted in the statistics of the devices. by @therault in #666
- Fix use of calloc. by @bosilca in #669
- Add: mca control for cpu load balancing (and don't report Gflops figures for cpus we can't determine) by @abouteiller in #663
- Suffix-increment is deprecated on volatile variables in C++ by @devreal in #674
- show-caps: don't report flops for unknown cuda devs, report peer access by @abouteiller in #672
- Apply does not release user-defined memory by @QingleiCao in #676
- Release lock in create_w2r_task if readers are readers are not zero by @devreal in #678
- w2r task should unlock the lock if readers are not 0 by @devreal in #682
- Refactored CI by @G-Ragghianti in #667
- C11 atomic lock alignment in data_t by @abouteiller in #685
- The device task is now released by the DSL by @bosilca in #688
- Reenable the memory eviction code by @bosilca in #679
- List ordered push: search from back if lower than pivot by @devreal in #693
- Add an icl platform file, move saturn platform to legacy by @abouteiller in #692
- Contrib/copycheck by @abouteiller in #574
- bugfix: dtd would not run cpu hooks when compiled with cuda by @abouteiller in #697
- v4.0.2411 changelog by @abouteiller in #699
- Fix a race condition in DTD for the local termdet by @therault in #698
- Bring back support for MPI allow_overtake by @bosilca in #704
- Fix function name for parsec_atomic_fetch_add_int64 by @devreal in #705
New Contributors
- @omor1 made their first contribution in #320
- @devreal made their first contribution in #343
- @DSMishler made their first contribution in #427
- @cflinto made their first contribution in #429
- @QingleiCao made their first contribution in #506
- @josephjohnjj made their first contribution in #526
- @evaleev made their first contribution in #567
- @bimalgaudel made their first contribution in #611
- @BrieucNicolas made their first contribution in #639
- @G-Ragghianti made their first contribution in #667
Full Changelog: https://github.com/ICLDisco/parsec/commits/parsec-4.0.2411