Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BH prep #5174

Closed
16 of 23 tasks
pgkeller opened this issue Feb 7, 2024 · 11 comments
Closed
16 of 23 tasks

BH prep #5174

pgkeller opened this issue Feb 7, 2024 · 11 comments
Assignees
Milestone

Comments

@pgkeller
Copy link
Contributor

pgkeller commented Feb 7, 2024

This is a list of todos for BH.
Must to run anything:

Phase1 for BH bring up - target 5/2

@abhullar-tt

Reem

Almeet/David


OLD NOTES BELOW:

Infra Flow - Versim scramble

  1. bring up versim in gitlab
  2. Getting stuff to build on public github
  3. fork to gitlab and test on versim

Development flow (MVP: metal running slow dispatch with a few ops) -

  1. Do the build stuff (make blackhole arch available and ability to build), fork into gitlab, validate versim submodule build.
  2. bring up runtime with slow dispatch and do metal "hello-world" in the github
  3. bring up simple ops with simple config (single core, native numeric) in the github
  4. bring up less simple ops in the github
  5. bring up key features/tool (watcher) + bring in key fixes (64 byte alignment) in the github

note: if we can staff (2) - (5) pretty soon, you should try to test on versim; if not, we should do it on the cards.
note: this flow will give us versim as a backup platform in case things don't work on the cards - but development and testing on both side (github/gitlab) is very cumbersome.

TODO:

  • Need a runtime guy to bring up runtime and address the runtime related items.
  • Need to define simple operators configuration

Phase2 for BH 30-day milestone & Open Source BH SW - target 5/17
Metal Goal - Single Tensix OP
[ ] Versim on CI
[ ] MatMul workload to stress-test single Tensix core

Phase3 for BH 60-day milestone - target 6/17
Metal goal - Multi-Tensix OP
[ ] MatMul workload to stress-test Multi Tensix cores
[ ]

** To be prioritized --> **
[ ] ? NOC/tensix shared access (need to enumerate)
[ ] Eth IRAM? TBD if BH has IRAM on ethernet. If true - need changes for Eth support

Performance changes/new features (not required to run):
[ ] NOC has a RISC-NOC command fifo which allows more non-blocking transactions in flight (legacy interface still works)

  • NOC non-rectangular multicast
  • how to use interrupts w/ metallium
  • risc vector units

SFPU/I Optimizations / New Features:

  • scoreboarding, can remove NOPs
  • use new insns: arcip, aexp, conditionals
  • explore RISC access to dst_reg and if/how to expose

Debug/analysis:

  • NOC traffic counters to help find network congestion
@aliuTT
Copy link
Contributor

aliuTT commented Feb 7, 2024

For performance changes:

  • using both ethernet risc in bi directional dataflow. Would require some stress testing to make sure the eth hardware works as expected.
  • try dram risc

@mo-tenstorrent
Copy link
Contributor

At some point I overheard BH has interrupts support, at least on BRISC, if I remember correctly. Should confirm its functionality.

@DrJessop
Copy link
Contributor

DrJessop commented Feb 9, 2024

"64B-alignment for reads" Just wanted to make sure not a typo, since in GS/WH it's 32B-alignment.

@DrJessop
Copy link
Contributor

DrJessop commented Feb 9, 2024

Also, I recall there being some issues with there being a relatively small NOC max packet size of 8KB. Wondering if BH will have the same limitation.

@pgkeller
Copy link
Contributor Author

pgkeller commented Feb 9, 2024

"64B-alignment for reads" Just wanted to make sure not a typo, since in GS/WH it's 32B-alignment.

not a typo, needs to change

@tt-rkim
Copy link
Collaborator

tt-rkim commented Apr 2, 2024

Some risks I see from infra / CI side:

  • Versim cannot run on cloud network / metal CI unfortunately... we need some serious rethinking around CI / dev for this
  • Similarly, this means versim-related source files cannot be checked into metal GH repo
  • Unknown timelines from syseng for HW delivery - both non-ethernet and ethernet
  • Fuzz testing our infra scripts + knowledge + syseng tools on BM BH systems
  • Availability of BH in cloud

@TT-billteng
Copy link
Collaborator

perhaps we need to fork TTMetal on gitlab side and make necessary modifications

@jliangTT
Copy link

jliangTT commented Apr 9, 2024

Versim cannot run on cloud network / metal CI unfortunately... we need some serious rethinking around CI / dev for this

my 2c: versim as a stop-gap to prefetch some the software development to build/bring up our sw stack and some unit test until hardware comes back. Once the hardware comes back, testing / development will be brought up on hardware and versim should be less relevant. As such, CI on versim may be throw away work.

@jliangTT
Copy link

per @rtawfik01 earlier on the current coverage on versim for Buda -

Unary datacopy -> start with single tile, single core, bfloat16, then test more combinations i.e more tiles, different dataformats, etc
Unary SFPU ops
Eltwise binary ops
Reduce ops
Matrix multiply/Convs
Here we can start more testing combination ops/graphs -> layernorms, softmaxes, feedforwards, etc

With buda, we tested all kernels, but we started with the list above
And always try every op with the simplest scenario (bfloat 16, single core, single tile), then add combinations as the tests pass

Also, Reem is getting the llk submodule ready and we should let her know when we complete bulding metal with the current blackhole arch compile through metal stack.

@abhullar-tt
Copy link
Contributor

abhullar-tt commented Jul 31, 2024

TODOs as of 31/07/2024

  • [Blackhole xfunc] Get faster BH machines #10976
  • Run microbenchmarks to make sure they are functional on current BH dev machines
  • Run microbenchmarks on faster BHs to get accurate BW measurements
  • Run MMs and Convs that saturate BW
  • Op/LLK parity

FYI @davorchap

@prajaramanTT prajaramanTT added this to the BHLD milestone Jan 10, 2025
@abhullar-tt
Copy link
Contributor

Most of the items from the initial list have been completed. Outstanding work has been broken out into separate issues and tracked in https://github.com/orgs/tenstorrent/projects/50/views/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests