Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][connector-file-base] The number of files allocated to subtasks is unbalanced. #8451

Open
3 tasks done
JeremyXin opened this issue Jan 4, 2025 · 0 comments
Open
3 tasks done

Comments

@JeremyXin
Copy link
Contributor

Search before asking

  • I had searched in the feature and found no similar feature requirement.

Description

When I read the files using HdfsFile as Source, I found that according to the output log, some subtasks were assigned multiple files, while the remaining subtasks were not assigned files. The result of this allocation is that some subtasks are idle and do not process file reads, and some subtasks need to process multiple file reads, resulting in performance degradation. The log output after the sensitive hdfs path information is deleted is as follows:

2025-01-02 17:04:34,572 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 0 is assigned to [hdfs://xxx,hdfs://xxx,hdfs://xxx]
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 1 is assigned to []
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [2]
2025-01-02 17:04:34,574 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 2 is assigned to []
... (all assigned to [])
2025-01-02 17:04:34,577 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [9]
2025-01-02 17:04:34,577 INFO [s.c.s.f.s.BaseFileSourceReader] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=50002}] - Closed the bounded File source
2025-01-02 17:04:34,578 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 9 is assigned to [hdfs://xxx]

After analyzing the source code, I found that the existing file allocation algorithm is randomly allocated according to the file path hashcode and parallelism redundancy. In my opinion, is it possible to use the round polling file allocation algorithm to ensure that the file load of each SubTask is balanced, so as to improve the processing performance?

Usage Scenario

This feature can be used to improve file processing performance when the connector is file

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant