Skip to content

parsec-4.0.2411

Latest
Compare
Choose a tag to compare
@abouteiller abouteiller released this 19 Nov 13:51
· 14 commits to master since this release
cdb2e7f

Curated Change log

Added

  • PaRSEC API 4.0.
  • Add DTD CUDA support including NEW tiles in DTD.
  • Add RoCM/HIP device support.
  • Add IrisXE/Level0 device support (experimental).
  • Enable users to manage their own data copies without PaRSEC interfering. Data copies are marked as being owned by PaRSEC or
    not and managed by PaRSEC or not. A data copy owned by PaRSEC can be reclaimed by PaRSEC when its reference count reaches 0, a data copy managed by PaRSEC can be copied / moved onto a different device, while a data copy not managed by PaRSEC will never be
    moved by the runtime.
  • Add an info system, and introduce two info hooks. See parsec/class/info.h for details. The info system allows the user to register info objects with different levels of structures and dynamic objects in the PaRSEC runtime.
  • PTG supports user-defined routines to move data between GPU and CPU, and user-defined sizes for buffers allocated on the GPU.
  • PTG supports reshaping data propagated between local tasks and the speficiation of two types on acccesses to data colletions.
  • PINS log SCHEDULE_BEGIN and SCHEDULE_END events to better track tasks lifecycle.
  • Detect and report oversubscribed binding of core resources.
  • PaRSEC Thread binding can be disabled (bind_threads 0 MCA parameter).
  • Load balancing between GPUs can be tuned (device_load_balance_skew MCA parameter).
  • Load balancing exclusivity between CPU/GPUs can be disabled (device_load_balance_allow_cpu MCA parameter).
  • Data sent in messages can be of variable size.
  • New API parsec_context_query can be used to obtain information on the system, like the number of devices, ranks, etc.
  • New active-message communication API gives low-level access to the PaRSEC communication system to DSLs.

Changed

  • Single letter command line options have been replaced with --mca parameters. --help is now --parsec-help.
  • Renamed symbols related to data distribution to properly prefix them with the parsec_ prefix. The old symbols have been deprecated.
  • DTD interface change: the global array parsec_dtd_arena_datatypes is replaced with functions to create, destroy, and get arena
    datatypes for DTD, and these objects now live inside the parsec context.
  • PARSEC_SUCCESS changed to 0 (from -1), all values for PARSEC_ERR_XYZ changed.
  • PaRSEC now requires CMake 3.21.
  • PaRSEC profiling tools now require Python 3.x
  • PaRSEC profiling system does not require for local dictionaries to be identical between ranks anymore.
  • time_estimate functions can be used to control task load balancing (replaces weight PTG property).

Deprecated

  • data distribution w/o the parsec_ prefix. Further documentation (including a
    sed script) can be found in contrib/renaming.

Removed

  • PaRSEC API 3.0
  • RECURSIVE Device support (this is temporary and will be restored in a future version).
  • Removed obsolete dbp2paje tool; h5totrace is the replacement tool to use. This removes the optional dependency on GTG.
  • Removed all command line options not prefixed by --mca, except for --parsec-help and --parsec-version.
  • Using more than PARSEC_GPU_MAX_WORKSPACE workspaces per device will now cause an error (instead of computing incorrect values).
  • PTG property weight (replaced by time_estimate).

Fixed

  • DTD Termination detection would occasionally assert.
  • Multiple bugs with GPU data ownership causing crashes and incorrect results when executing with more than 1 GPU.
  • Device-to-device memory copies would not work in some scenarios.
  • Suboptimal ordering of members in broadcast tree could cause performance reduction.
  • Cray MPI and MPICH would crash in MPI_Cancel and when using NULL datatypes.
  • Do not report incorrect flops/s capabilities (device_show_capabilities MCA parameter).
  • On some systems PaRSEC would allocate more GPU memory than is available on the device.
  • Performance with large number of GPU tasks with the same priority would be poor due to overhead of sorting by priority.

Known Bugs

  • PaRSEC Thread binding ignores externally provided binding (e.g., a cpuset enforced by srun); see issue ICLDisco/dplasma#9.
  • Enabling the RECURSIVE device will cause crashes (it is disabled by default in this release); see issues #548, #541.
  • Running out of GPU memory when using the NEW keyword in PTG may cause deadlocks; see issue #527.

Security

Merged Pull Requests

List of merged pull requests
  • [BBT#582] bugfix/atomic lifo: The offsetof was incorrect leading to lifo padding being wrong in external lifo by @abouteiller in #316
  • First sketch of a github action for building by @bosilca in #309
  • Miscellaneous profiling fixes by @omor1 in #320
  • Per-language compiler flags by @therault in #326
  • [BBT#541] A new way to install the internal headers by @bosilca in #322
  • Doc/GitHub by @abouteiller in #330
  • Provide a temporary fix for the flag detection. by @bosilca in #336
  • We need BISON 3, and try to automatically pick the brew variant on Mac OSX by @abouteiller in #331
  • Clean strings usages in CMake. by @bosilca in #340
  • Allow the runtime to compile even when PTG support is not possible. by @bosilca in #332
  • Work around GCC bug for atomic_thread_fence with memory order acquire by @devreal in #343
  • Fix parsec_future: volatile and memory barriers by @devreal in #342
  • Reshape test: variable used for polling should be volatile by @devreal in #344
  • Dust off the cmake_modules by @abouteiller in #346
  • New CMake versions use MPI_ROOT to find MPI by @abouteiller in #345
  • Fallback using a compatible HWLOC. by @bosilca in #341
  • hotfix: compile failure when Ayudame not found by @abouteiller in #348
  • Fix/quick fixes by @bosilca in #350
  • Update issue template to make it easier to read and easier to fill-up by @abouteiller in #349
  • Update the installation instructions by @abouteiller in #354
  • Cleanup/ptgpp assignments by @abouteiller in #352
  • Apply -g3 to DEBUG only, set default config to Release by @abouteiller in #347
  • Profiling msync and header commit by @therault in #337
  • Removing hard flex/bison dependency: only devs need to run the parser by @abouteiller in #335
  • Hicma/recursive by @bosilca in #328
  • Fix/deprecated support by @bosilca in #362
  • Add the filename to the generated profiling event name. by @bosilca in #359
  • Fix atomics on macosX not working properly (missing header) by @abouteiller in #356
  • Remove never compiled in '64bit' lifo implementation by @abouteiller in #360
  • Fix/many small updates by @bosilca in #363
  • Make the ParsecCompilerFlags.cmake self contained by @abouteiller in #364
  • Profiling fix: parsec_init(NULL, NULL) by @therault in #339
  • GitHub runner with spack by @bosilca in #333
  • Update PAPI SDE to fit the current API by @therault in #365
  • ucontext is not supported on OSX. by @bosilca in #366
  • recursive cb type was not correct by @abouteiller in #368
  • Since new policy, setting the non-cache variable creates an empty cache by @abouteiller in #367
  • Do now allow spack to be updated automatically. by @bosilca in #375
  • flex: on some machines, flex cannot work if parsec/utils is not created by @abouteiller in #374
  • Attempt to backport the revamp of the communication engine by @devreal in #380
  • Respect DISTDIR is provided. by @bosilca in #383
  • [RFC] profiling tools: more efficient cross-stream event matching by @omor1 in #372
  • Hash table: count used buckets only when needed by @devreal in #379
  • Print the debug rank from device_show_statistics by @abouteiller in #386
  • Handle error in CUDA/HIP module init and configurable max_streams by @therault in #351
  • Update to a newer spack compiler by @bosilca in #392
  • Make the PUSHOUT and other DTD GPU concepts generic by @abouteiller in #387
  • Workaround current CUDA/HIP "solution suspicious" bug... by @therault in #381
  • dtd_bench_simple_gemm.c relies on non-standard cblas.h file by @therault in #317
  • profiling tools: improve large buffer performance by @omor1 in #390
  • Add a load-balancing skew so that we favor locality up to a configurable limit by @abouteiller in #389
  • Fix redistribute wrapper by @devreal in #395
  • [BBT#509] Dtd cuda with new by @therault in #318
  • Only enable CUDA language if supported. by @bosilca in #404
  • [BBT#572] Implement hash table API providing a key handle during lock by @devreal in #307
  • profiling tools: fix format for padded structures by @omor1 in #402
  • Don't pass the execution stream around to recursive calls. by @bosilca in #413
  • Protect the inline functions by they device support. by @bosilca in #415
  • Allow the DSL to provide a task_snprintf function and use it when displaying the DOT by @devreal in #409
  • Execution stream keeps the highest priority task for local execution. by @bosilca in #399
  • TTG/termdet by @therault in #391
  • More data transfer statistics between devices by @therault in #426
  • Remove startup tasks from DTD by @therault in #425
  • tools/profiling: PTT v2 by @omor1 in #418
  • Fix/more warnings by @bosilca in #416
  • Fix possible access to unowned memory in parsec_dtd_task_class_add_chore by @DSMishler in #427
  • fix remote_dep_mpi.c by @cflinto in #429
  • bugfix for profiling without multiprocessing by @DSMishler in #432
  • Project_dyn test missing libm dependency by @abouteiller in #433
  • Ttg/termdet dynamic PTG by @therault in #430
  • Update API versions in examples by @abouteiller in #435
  • Make sure PaRSEC compiles for all rwlock implementations. by @bosilca in #434
  • comm: set parsec_tls_execution_stream in comm thread by @omor1 in #422
  • Use normal for loops to iterate over local index variables when they are a range. by @therault in #329
  • [BBT#559] Add llp scheduler: local lifo with priorities by @devreal in #325
  • [BBT#536] Add AMD RoCM/HIP device by @abouteiller in #315
  • TSL variables are not static by default. by @bosilca in #437
  • Allow overwriting of the completion and enqueue callbacks. by @bosilca in #439
  • Bring back support for changing the PaRSEC communicator by @bosilca in #401
  • Fix: termination detector race condition by @therault in #438
  • Fixes/tls in comm thread by @therault in #440
  • profiling: disambiguation between certain MPI events by @omor1 in #376
  • TTG/building system by @therault in #424
  • A small script to help us find which files might need copyright update. by @therault in #442
  • Remove dependency on argv[0] by @abouteiller in #445
  • Make configurable the treshold for warning about wrong binding on by @abouteiller in #453
  • Remove some warnings about unused variables by @abouteiller in #451
  • Fix statistics management with multiple GPU. by @bosilca in #454
  • Debug output about reshapping was too verbose by @abouteiller in #456
  • Topic/update spack by @bosilca in #462
  • Increase function name length limit in debug output by @devreal in #463
  • initial flags modification PR. Open to review. by @DSMishler in #461
  • Introduce parsec_taskpool_wait and parsec_taskpool_test by @devreal in #411
  • Fix profile dtd by @therault in #475
  • Correctly identify the Intel compiler. by @bosilca in #472
  • Proper case and imported target for hwloc by @abouteiller in #457
  • Use the profiling key macros by @bosilca in #476
  • No more tag collisions. by @bosilca in #477
  • profiling: fix multiple EXEC_END events in some circumstances by @omor1 in #421
  • DSL profiling by @therault in #469
  • Put END_C_DECLS at the end of device_cuda.h by @devreal in #480
  • Drop support for python2. by @bosilca in #482
  • Hotfix dtd dsl and warning by @therault in #488
  • Use pip to build and install the python support. by @bosilca in #489
  • Termdet callback order by @therault in #494
  • Fix the branching test in distributed by @therault in #495
  • Fix profiling tests by @therault in #490
  • Log all types of runtime-system events in task_profiler, and ensure that all time of compute threads is accounted for by @devreal in #410
  • Fix missing loop counter increment in pins task profiler by @devreal in #498
  • Close files between operations. by @bosilca in #484
  • Make parsec_data_t::device_copies a flexible array member by @devreal in #499
  • Fix an issue about GPU statistics (Issue#505) by @QingleiCao in #506
  • Do not cancel persistent requests with cray-mpich, it is broken. by @abouteiller in #519
  • Make sure data_in is not NULL on GPU when accessing nb_elts by @QingleiCao in #492
  • volatile uint32 is not always a valid type for MPI_SUM by @abouteiller in #524
  • Profiling corrected. by @josephjohnjj in #526
  • Remove all options that are not prefixed by --mca by @abouteiller in #447
  • Implement a universal hash function by @devreal in #528
  • Can't use NULL to pass-in MPI datatypes with MPICH derivatives by @abouteiller in #518
  • Update to the SDE available in PAPI-7.0.1 by @therault in #522
  • More flexible paranoid checks in data_dist/matrix implementations by @therault in #530
  • Handle the tertiary case for startup tasks. by @bosilca in #508
  • Fix some errors with ctest for profiling by @abouteiller in #523
  • pip install --prefix has version-dependent behaviors by @therault in #532
  • Fix old typo in termdet modules management: by @therault in #533
  • Allow for tag registration/deregistration any time. by @bosilca in #521
  • Topic/device naming by @bosilca in #493
  • Gpu workspace fix by @therault in #510
  • Fix bug: some scenarios would call a nullptr function in profiling by @therault in #535
  • Profiling hotfix: error in python3 when passing {char[64]} as conversion type by @therault in #536
  • Limit the number of recv requests to ensure there is space for sends by @devreal in #538
  • Flex flags bugfix and python venv log by @abouteiller in #537
  • Fix hash function generation for PTG by @therault in #479
  • Small warnings from clang14 by @abouteiller in #542
  • Print -wflags found only when results differ from cache by @abouteiller in #540
  • Fix the start/stop test. by @bosilca in #546
  • Gpu fix load balancing by @therault in #517
  • Tear-down parsec_ce after high-level communication by @devreal in #549
  • relabel mislabelled tests by @abouteiller in #550
  • Find the right spack environment. by @bosilca in #559
  • Add device async/again support. by @bosilca in #544
  • hotfix by @therault in #562
  • Profiling and PAPI SDE updates by @therault in #565
  • Fix the management of GPU copies. by @bosilca in #563
  • cmake -> CMAKE_COMMAND by @evaleev in #567
  • Correct the logic for passing over CPU/RECURSIVE devices by @abouteiller in #557
  • HAVE_PEER_ACCESS is always present in all relevant versions of CUDA or by @abouteiller in #572
  • CUDA: disable timer support on cuda events by @devreal in #576
  • Pick a stable spack branch by @bosilca in #578
  • Add an option to skip HWLOC compat run by @bosilca in #582
  • Allow locals and parameters to be defined via CMake. by @bosilca in #583
  • Prevent race condition in accelerator copies management by @bosilca in #575
  • Fix/rocm detect and unknown device warning by @abouteiller in #577
  • [python] do not build python support unless pandas is available by @evaleev in #584
  • Refactor GPU device to increase code factorization between the devices. by @therault in #570
  • Disable recursive device by default (temporarily) by @abouteiller in #585
  • remote_dep: rotate bcast topology around root by @omor1 in #481
  • Discover atomic support for __int128_t. by @bosilca in #587
  • Fix warnings. by @bosilca in #591
  • Typo in level zero component by @therault in #592
  • dbpreader: missing corner case in cache building by @therault in #593
  • Skip profiling for task classes without profiling information by @DSMishler in #594
  • cmake logic fixes for level zero and half-installed systems by @therault in #596
  • l0: dpccpp can't create output files in the build dir if the enclosing by @abouteiller in #597
  • gpu: some errors introduced during gpu despecialization caused deadlocks by @abouteiller in #598
  • ze: the queue need to be reset when task completes by @abouteiller in #599
  • Fixes for GPU memory oversubscription by @devreal in #602
  • If the DSL defines a task_snprintf function, use that function. by @therault in #603
  • Make sure the w2r task has a stageout set by @devreal in #604
  • Consistently use size_t for nb_elts in data and flows by @devreal in #605
  • Minor cleanup of the DTD parameters manipulation. by @bosilca in #609
  • Fix the parsec_future_t. by @bosilca in #608
  • Topic/add evaluate keyword by @bosilca in #569
  • Relative path, symbolic links, and python examples by @therault in #606
  • Update process_name.c. Fixes the issue #610 by @bimalgaudel in #611
  • CMAKE: Bring back checks for atomic CAS by @devreal in #595
  • HOTFIX: make the default number of devices be all the devices seen by… by @therault in #613
  • bugfix: in dtd sometimes the cpu incarnation and gpu incarnations are by @abouteiller in #616
  • Re-enable CI tests for cuda caps as they now work again. by @abouteiller in #614
  • Add capability of saving GPU statistics and printing diff vs saved stats by @abouteiller in #558
  • Fix the argument _NSGetExecutablePath. by @bosilca in #620
  • Add a context-level query capability. by @bosilca in #621
  • Prevent CI from running OOM when oversubscribing GPUs by @abouteiller in #629
  • Fix CUDA protection macro use by @bosilca in #632
  • Move comm profiling initialization into comm thread by @devreal in #626
  • Cleanup/cosmetics by @abouteiller in #631
  • Alternative solution to the CI problem with GPUs by @therault in #633
  • ci: all tests must use parsec_addtest by @abouteiller in #635
  • [BBT#237] Allow sender to send data of any size. by @bosilca in #321
  • bugfix: dtd taskpool destructor should work symetric to contructor by @abouteiller in #637
  • fixes memory leaks by @BrieucNicolas in #639
  • Fix the lack of direct GPU to GPU communications in multi-device runs. by @therault in #642
  • Compute CPU and GPU versions without lying during kernel epilog (enable TTG/PTG versioning to coexist) by @abouteiller in #648
  • Initialize the parsec's HWLOC subsystem before starting threads. by @abouteiller in #650
  • Fix overflow when calling parsec_data_create by @QingleiCao in #646
  • cmakery: let find_package find HIP v6 by @abouteiller in #652
  • Fix computation of available memory on gpu (avoid truncation and conversions) by @abouteiller in #651
  • bugfix: when hip is not found, its ok. by @abouteiller in #656
  • Add sanity check for free memory by @devreal in #658
  • Explicit message when outputing the warning about being unable to allocate memory in GPU code by @therault in #655
  • config: osx would not find bison on newer fink/brew by @abouteiller in #657
  • Consolidated error handling when GPU only tests execute on CPU systems by @abouteiller in #644
  • Add the number of copies evicted in the statistics of the devices. by @therault in #666
  • Fix use of calloc. by @bosilca in #669
  • Add: mca control for cpu load balancing (and don't report Gflops figures for cpus we can't determine) by @abouteiller in #663
  • Suffix-increment is deprecated on volatile variables in C++ by @devreal in #674
  • show-caps: don't report flops for unknown cuda devs, report peer access by @abouteiller in #672
  • Apply does not release user-defined memory by @QingleiCao in #676
  • Release lock in create_w2r_task if readers are readers are not zero by @devreal in #678
  • w2r task should unlock the lock if readers are not 0 by @devreal in #682
  • Refactored CI by @G-Ragghianti in #667
  • C11 atomic lock alignment in data_t by @abouteiller in #685
  • The device task is now released by the DSL by @bosilca in #688
  • Reenable the memory eviction code by @bosilca in #679
  • List ordered push: search from back if lower than pivot by @devreal in #693
  • Add an icl platform file, move saturn platform to legacy by @abouteiller in #692
  • Contrib/copycheck by @abouteiller in #574
  • bugfix: dtd would not run cpu hooks when compiled with cuda by @abouteiller in #697
  • v4.0.2411 changelog by @abouteiller in #699
  • Fix a race condition in DTD for the local termdet by @therault in #698
  • Bring back support for MPI allow_overtake by @bosilca in #704
  • Fix function name for parsec_atomic_fetch_add_int64 by @devreal in #705

New Contributors

Full Changelog: https://github.com/ICLDisco/parsec/commits/parsec-4.0.2411