Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CommandQueue redirecting usages straight to HWCQ #17219

Merged
merged 10 commits into from
Jan 29, 2025

Conversation

ayerofieiev-tt
Copy link
Member

@ayerofieiev-tt ayerofieiev-tt commented Jan 28, 2025

Ticket

#17208

Problem description

Over time CommandQueue stopped playing any particular function.
Today it is simply proxying calls to HWCQ via creating temporary command structures and immediately routing their execution with a big switch statement, calling into hwcq methods.
This adds unreasonable complexity.

After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt , we agreed to remove this class.

What's changed

Remove CommandQueue class.
Redirect all usage to HWCQ.
HWCommandQueue is renamed to CommandQueue

Note:

  • command_queue.hpp was not removed. It contains definitions of commands used by hardware_command_queue, I would prefer us to clean this up in the next PR.

Checklist

@ayerofieiev-tt ayerofieiev-tt changed the title Remove CommandQueue class redirecting usages straight to HWCQ Remove CommandQueue redirecting usages straight to HWCQ Jan 28, 2025
Copy link
Contributor

@omilyutin-tt omilyutin-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty straightforward IMO, thanks for tackling this Artem!

Do we want to rename HWCommandQueue to CommandQueue though?

tt_metal/impl/device/device.cpp Outdated Show resolved Hide resolved
Comment on lines -191 to -216
friend void EnqueueTraceImpl(CommandQueue& cq, uint32_t trace_id, bool blocking);
friend void EnqueueProgramImpl(CommandQueue& cq, Program& program, bool blocking);
friend void EnqueueReadBufferImpl(
CommandQueue& cq,
std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>> buffer,
void* dst,
const BufferRegion& region,
bool blocking);
friend void EnqueueWriteBufferImpl(
CommandQueue& cq,
const std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>>& buffer,
HostDataType src,
const BufferRegion& region,
bool blocking);
friend void EnqueueGetBufferAddrImpl(void* dst_buf_addr, const Buffer* buffer);
friend void EnqueueRecordEventImpl(
CommandQueue& cq, const std::shared_ptr<Event>& event, tt::stl::Span<const SubDeviceId> sub_device_ids);
friend void EnqueueWaitForEventImpl(CommandQueue& cq, const std::shared_ptr<Event>& event);
friend void FinishImpl(CommandQueue& cq, tt::stl::Span<const SubDeviceId> sub_device_ids);
friend CommandQueue;
friend detail::Program_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I looked at HWCQ I could not understand what it queues if there are no public methods to enqueue. Then realized we are doing a trick.

Copy link
Contributor

@tt-asaigal tt-asaigal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem, this gets the ball rolling on an important change :) Left a few comments.

@@ -228,114 +226,21 @@ struct CommandInterface {
tt::stl::Span<const SubDeviceId> sub_device_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please remove this while we're at it? This was used for deprecated CommandQueue functionality.
Same for RuntimeArgsMetadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

inline static uint32_t num_passthrough_cqs = 0;
};

} // namespace v0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing this class and cleaning things up. We have a bunch of command structs defined in this file, ex: EnqueueTraceCommand, EnqueueProgramCommand and EnqueueTerminateCommand.
These are used inside the individual enqueue impls to do command specific processing. These should not be in this file. Can we please move them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Want to make this in the PR after this is merged together with file names.

"Command Queue could not finish: device hang due to illegal NoC transaction. See {} for details.",
tt::watcher_get_log_file_name());
}

void EnqueueTrace(CommandQueue& cq, uint32_t trace_id, bool blocking) {
void EnqueueTrace(HWCommandQueue& cq, uint32_t trace_id, bool blocking) {
detail::DispatchStateCheck(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Enqueue* APIs are defined here. Wondering if this is the right way of doing things long term. I think eventually these should be moved out of this file since these are just wrappers to member functions, and don't really have anything to do with the CommandQueue impl itself. Perhaps tt_metal.cpp or some other API specific file would be more suited here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. They don't belong here long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not moving them right now to minimize changes in this PR.

@@ -22,20 +22,18 @@
#include "worker_config_buffer.hpp"
#include "program_impl.hpp"
#include "trace_buffer.hpp"
#include "hardware_command_queue.hpp"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to fix file names and includes in a separate PR

@@ -104,10 +102,48 @@ class HWCommandQueue {
void set_expected_num_workers_completed_for_sub_device(uint32_t sub_device_index, uint32_t num_workers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tt-asaigal these methods will get cleaned up? These were added mainly for MeshCommandQueue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -998,13 +996,15 @@ void Device::init_command_queue_device() {
}

void Device::initialize_synchronous_sw_cmd_queue() {
TT_THROW("Slow Dispatch is not supported in this change. Help needed to make it right 😱");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix, cause this breaks tests

private:
uint32_t id;
uint32_t id_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add default initialization if you are already here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but not in this PR.
I had to touch these members to comply with what command_queue gave before

const BufferRegion& region,
bool blocking,
tt::stl::Span<const SubDeviceId> sub_device_ids = {});
void enqueue_read_buffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a time to get rid of this overload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha. not time yet. minimizing changes only to the core of it.

void enqueue_wait_for_event(const std::shared_ptr<Event>& sync_event, bool clear_count = false);

void enqueue_write_buffer(
const std::variant<std::reference_wrapper<Buffer>, std::shared_ptr<Buffer>>& buffer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll note

Copy link
Contributor

@tt-dma tt-dma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -24 to -28
TEST_F(CommandQueueFixture, TestCannotAccessCommandQueueForClosedDevice) {
EXPECT_NO_THROW(device_->command_queue());
CloseDevice(device_);
EXPECT_ANY_THROW(device_->command_queue());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about keeping this test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no old CommandQueue anymore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the old CommandQueue is removed.

This test technically works for both the old CommandQueue and the HardwareCommandQueue so I was wondering if it was worth keeping around.

…to query cq from device. cq might not be present in case of slow dispatch
@ayerofieiev-tt ayerofieiev-tt changed the title Remove CommandQueue redirecting usages straight to HWCQ 17208: Remove CommandQueue redirecting usages straight to HWCQ Jan 29, 2025
@ayerofieiev-tt ayerofieiev-tt changed the title 17208: Remove CommandQueue redirecting usages straight to HWCQ #17208: Remove CommandQueue redirecting usages straight to HWCQ Jan 29, 2025
@ayerofieiev-tt ayerofieiev-tt changed the title #17208: Remove CommandQueue redirecting usages straight to HWCQ Remove CommandQueue redirecting usages straight to HWCQ Jan 29, 2025
@ayerofieiev-tt ayerofieiev-tt linked an issue Jan 29, 2025 that may be closed by this pull request
@ayerofieiev-tt ayerofieiev-tt merged commit 58f9654 into main Jan 29, 2025
10 checks passed
@ayerofieiev-tt ayerofieiev-tt deleted the ay/hwcq_friends branch January 29, 2025 20:00
williamlyTT pushed a commit that referenced this pull request Jan 30, 2025
### Ticket
#17208

### Problem description
Over time `CommandQueue` stopped playing any particular function.
Today it is simply proxying calls to HWCQ via creating temporary command
structures and immediately routing their execution with a big switch
statement, calling into hwcq methods.
This adds unreasonable complexity. 

After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt ,
we agreed to remove this class.

### What's changed
Remove CommandQueue class.
Redirect all usage to HWCQ.
HWCommandQueue is renamed to CommandQueue

Note:
* command_queue.hpp was not removed. It contains definitions of commands
used by hardware_command_queue, I would prefer us to clean this up in
the next PR.

### Checklist
- [x] [Post commit
CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350)
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261)
- [x] [T3k
Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389)
-
yieldthought pushed a commit that referenced this pull request Jan 31, 2025
### Ticket
#17208

### Problem description
Over time `CommandQueue` stopped playing any particular function.
Today it is simply proxying calls to HWCQ via creating temporary command
structures and immediately routing their execution with a big switch
statement, calling into hwcq methods.
This adds unreasonable complexity. 

After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt ,
we agreed to remove this class.

### What's changed
Remove CommandQueue class.
Redirect all usage to HWCQ.
HWCommandQueue is renamed to CommandQueue

Note:
* command_queue.hpp was not removed. It contains definitions of commands
used by hardware_command_queue, I would prefer us to clean this up in
the next PR.

### Checklist
- [x] [Post commit
CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350)
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261)
- [x] [T3k
Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389)
-
nikileshx pushed a commit to nikileshx/tt-metal that referenced this pull request Feb 3, 2025
…17219)

### Ticket
tenstorrent#17208

### Problem description
Over time `CommandQueue` stopped playing any particular function.
Today it is simply proxying calls to HWCQ via creating temporary command
structures and immediately routing their execution with a big switch
statement, calling into hwcq methods.
This adds unreasonable complexity. 

After the sync with @tt-asaigal @dmakoviichuk-tt @cfjchu @omilyutin-tt ,
we agreed to remove this class.

### What's changed
Remove CommandQueue class.
Redirect all usage to HWCQ.
HWCommandQueue is renamed to CommandQueue

Note:
* command_queue.hpp was not removed. It contains definitions of commands
used by hardware_command_queue, I would prefer us to clean this up in
the next PR.

### Checklist
- [x] [Post commit
CI](https://github.com/tenstorrent/tt-metal/actions/runs/13023052350)
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/runs/13023279261)
- [x] [T3k
Frequent](https://github.com/tenstorrent/tt-metal/actions/runs/13021799389)
-
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace CommandQueue in MeshDevice
8 participants