-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove CommandQueue redirecting usages straight to HWCQ #17219
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty straightforward IMO, thanks for tackling this Artem!
Do we want to rename HWCommandQueue
to CommandQueue
though?
friend void EnqueueTraceImpl(CommandQueue& cq, uint32_t trace_id, bool blocking); | ||
friend void EnqueueProgramImpl(CommandQueue& cq, Program& program, bool blocking); | ||
friend void EnqueueReadBufferImpl( | ||
CommandQueue& cq, | ||
std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>> buffer, | ||
void* dst, | ||
const BufferRegion& region, | ||
bool blocking); | ||
friend void EnqueueWriteBufferImpl( | ||
CommandQueue& cq, | ||
const std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>>& buffer, | ||
HostDataType src, | ||
const BufferRegion& region, | ||
bool blocking); | ||
friend void EnqueueGetBufferAddrImpl(void* dst_buf_addr, const Buffer* buffer); | ||
friend void EnqueueRecordEventImpl( | ||
CommandQueue& cq, const std::shared_ptr<Event>& event, tt::stl::Span<const SubDeviceId> sub_device_ids); | ||
friend void EnqueueWaitForEventImpl(CommandQueue& cq, const std::shared_ptr<Event>& event); | ||
friend void FinishImpl(CommandQueue& cq, tt::stl::Span<const SubDeviceId> sub_device_ids); | ||
friend CommandQueue; | ||
friend detail::Program_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I looked at HWCQ I could not understand what it queues if there are no public methods to enqueue. Then realized we are doing a trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem, this gets the ball rolling on an important change :) Left a few comments.
@@ -228,114 +226,21 @@ struct CommandInterface { | |||
tt::stl::Span<const SubDeviceId> sub_device_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please remove this while we're at it? This was used for deprecated CommandQueue
functionality.
Same for RuntimeArgsMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
inline static uint32_t num_passthrough_cqs = 0; | ||
}; | ||
|
||
} // namespace v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for removing this class and cleaning things up. We have a bunch of command structs defined in this file, ex: EnqueueTraceCommand
, EnqueueProgramCommand
and EnqueueTerminateCommand
.
These are used inside the individual enqueue impls to do command specific processing. These should not be in this file. Can we please move them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Want to make this in the PR after this is merged together with file names.
"Command Queue could not finish: device hang due to illegal NoC transaction. See {} for details.", | ||
tt::watcher_get_log_file_name()); | ||
} | ||
|
||
void EnqueueTrace(CommandQueue& cq, uint32_t trace_id, bool blocking) { | ||
void EnqueueTrace(HWCommandQueue& cq, uint32_t trace_id, bool blocking) { | ||
detail::DispatchStateCheck(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Enqueue*
APIs are defined here. Wondering if this is the right way of doing things long term. I think eventually these should be moved out of this file since these are just wrappers to member functions, and don't really have anything to do with the CommandQueue
impl itself. Perhaps tt_metal.cpp
or some other API specific file would be more suited here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. They don't belong here long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not moving them right now to minimize changes in this PR.
@@ -22,20 +22,18 @@ | |||
#include "worker_config_buffer.hpp" | |||
#include "program_impl.hpp" | |||
#include "trace_buffer.hpp" | |||
#include "hardware_command_queue.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to fix file names and includes in a separate PR
@@ -104,10 +102,48 @@ class HWCommandQueue { | |||
void set_expected_num_workers_completed_for_sub_device(uint32_t sub_device_index, uint32_t num_workers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tt-asaigal these methods will get cleaned up? These were added mainly for MeshCommandQueue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
tt_metal/impl/device/device.cpp
Outdated
@@ -998,13 +996,15 @@ void Device::init_command_queue_device() { | |||
} | |||
|
|||
void Device::initialize_synchronous_sw_cmd_queue() { | |||
TT_THROW("Slow Dispatch is not supported in this change. Help needed to make it right 😱"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix, cause this breaks tests
private: | ||
uint32_t id; | ||
uint32_t id_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add default initialization if you are already here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but not in this PR.
I had to touch these members to comply with what command_queue
gave before
const BufferRegion& region, | ||
bool blocking, | ||
tt::stl::Span<const SubDeviceId> sub_device_ids = {}); | ||
void enqueue_read_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a time to get rid of this overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha. not time yet. minimizing changes only to the core of it.
void enqueue_wait_for_event(const std::shared_ptr<Event>& sync_event, bool clear_count = false); | ||
|
||
void enqueue_write_buffer( | ||
const std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>>& buffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @nhuang-tt
TEST_F(CommandQueueFixture, TestCannotAccessCommandQueueForClosedDevice) { | ||
EXPECT_NO_THROW(device_->command_queue()); | ||
CloseDevice(device_); | ||
EXPECT_ANY_THROW(device_->command_queue()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about keeping this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no old CommandQueue anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the old CommandQueue is removed.
This test technically works for both the old CommandQueue and the HardwareCommandQueue so I was wondering if it was worth keeping around.
…to query cq from device. cq might not be present in case of slow dispatch
### Ticket #17208 ### Problem description Over time `CommandQueue` stopped playing any particular function. Today it is simply proxying calls to HWCQ via creating temporary command structures and immediately routing their execution with a big switch statement, calling into hwcq methods. This adds unreasonable complexity. After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt , we agreed to remove this class. ### What's changed Remove CommandQueue class. Redirect all usage to HWCQ. HWCommandQueue is renamed to CommandQueue Note: * command_queue.hpp was not removed. It contains definitions of commands used by hardware_command_queue, I would prefer us to clean this up in the next PR. ### Checklist - [x] [Post commit CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350) - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261) - [x] [T3k Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389) -
### Ticket #17208 ### Problem description Over time `CommandQueue` stopped playing any particular function. Today it is simply proxying calls to HWCQ via creating temporary command structures and immediately routing their execution with a big switch statement, calling into hwcq methods. This adds unreasonable complexity. After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt , we agreed to remove this class. ### What's changed Remove CommandQueue class. Redirect all usage to HWCQ. HWCommandQueue is renamed to CommandQueue Note: * command_queue.hpp was not removed. It contains definitions of commands used by hardware_command_queue, I would prefer us to clean this up in the next PR. ### Checklist - [x] [Post commit CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350) - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261) - [x] [T3k Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389) -
…17219) ### Ticket tenstorrent#17208 ### Problem description Over time `CommandQueue` stopped playing any particular function. Today it is simply proxying calls to HWCQ via creating temporary command structures and immediately routing their execution with a big switch statement, calling into hwcq methods. This adds unreasonable complexity. After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt , we agreed to remove this class. ### What's changed Remove CommandQueue class. Redirect all usage to HWCQ. HWCommandQueue is renamed to CommandQueue Note: * command_queue.hpp was not removed. It contains definitions of commands used by hardware_command_queue, I would prefer us to clean this up in the next PR. ### Checklist - [x] [Post commit CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350) - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261) - [x] [T3k Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389) -
Ticket
#17208
Problem description
Over time
CommandQueue
stopped playing any particular function.Today it is simply proxying calls to HWCQ via creating temporary command structures and immediately routing their execution with a big switch statement, calling into hwcq methods.
This adds unreasonable complexity.
After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt , we agreed to remove this class.
What's changed
Remove CommandQueue class.
Redirect all usage to HWCQ.
HWCommandQueue is renamed to CommandQueue
Note:
Checklist