Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG Llama CCL Perf Master #16640

Open
16 tasks
SeanNijjar opened this issue Jan 10, 2025 · 0 comments
Open
16 tasks

TG Llama CCL Perf Master #16640

SeanNijjar opened this issue Jan 10, 2025 · 0 comments
Assignees
Labels
op_cat: ccl perf for issues tracking performance problems/improvements tg-llama

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented Jan 10, 2025

CCL Add New Commands for Perf

Reduce Scatter

All Reduce

March to 12GBps per link

CCL Backend Kernels Optimization:

Optimize main loop and noc burst commands:

  • CCL V2 (Reference Interpreter Kernel) Optimizations #16255
  • partly on branch) Optimize outer loop overheads
    • Measured improvement (on branch): flatten try advance switch-case statements merge the advance and completion checks (big savings of ~150-170ns per outer loop iteration (once for every command or packet, whichever is more frequent) which can represent 15-20% improvement
  • (on branch) User "one_packet" data flow APIs
    • Estimated improvements: save 50-100ns per CB packet
  • Allow multiple open packets in flight to avoid blocking read/write barriers for commands
  • Optimize packet header initialization
    • a surprising amount of time is spent on this
    • enable caching
    • simplify packet structure (see Fabric EDM Optimizations)

Fabric EDM Optimization:

  • Merge command type and noc command type in packet header.
    • @tt-aho has in progress commit here:
  • Decouple worker acks from EDM acks.
    • This will be essential if we keep one worker per link
    • move EDM to raw read/write offsets into channel
    • Update worker connection open/close to get credits and channel offsets
      • currently each worker requires EDM in the loop for every channel buffer (ack)
        • Instead, EDM and worker should be able to manage read/write pointers independently without blocking
      • Will inadvertently change state machine so may negate next item
  • Short circuit sender states from send to worker ack
  • Enable write-atomic operation
    • Use available back 16B to store atomic inc info

Perf Reporting

@SeanNijjar SeanNijjar added op_cat: ccl perf for issues tracking performance problems/improvements tg-llama labels Jan 10, 2025
@SeanNijjar SeanNijjar changed the title TG Llama CCL Perf Burndown TG Llama CCL Perf Master Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
op_cat: ccl perf for issues tracking performance problems/improvements tg-llama
Projects
None yet
Development

No branches or pull requests

2 participants