You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the feature and found no similar feature requirement.
Description
When I read the files using HdfsFile as Source, I found that according to the output log, some subtasks were assigned multiple files, while the remaining subtasks were not assigned files. The result of this allocation is that some subtasks are idle and do not process file reads, and some subtasks need to process multiple file reads, resulting in performance degradation. The log output after the sensitive hdfs path information is deleted is as follows:
2025-01-02 17:04:34,572 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 0 is assigned to [hdfs://xxx,hdfs://xxx,hdfs://xxx]
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 1 is assigned to []
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [2]
2025-01-02 17:04:34,574 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 2 is assigned to []
... (all assigned to [])
2025-01-02 17:04:34,577 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [9]
2025-01-02 17:04:34,577 INFO [s.c.s.f.s.BaseFileSourceReader] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=50002}] - Closed the bounded File source
2025-01-02 17:04:34,578 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 9 is assigned to [hdfs://xxx]
After analyzing the source code, I found that the existing file allocation algorithm is randomly allocated according to the file path hashcode and parallelism redundancy. In my opinion, is it possible to use the round polling file allocation algorithm to ensure that the file load of each SubTask is balanced, so as to improve the processing performance?
Usage Scenario
This feature can be used to improve file processing performance when the connector is file
Search before asking
Description
When I read the files using HdfsFile as Source, I found that according to the output log, some subtasks were assigned multiple files, while the remaining subtasks were not assigned files. The result of this allocation is that some subtasks are idle and do not process file reads, and some subtasks need to process multiple file reads, resulting in performance degradation. The log output after the sensitive hdfs path information is deleted is as follows:
2025-01-02 17:04:34,572 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 0 is assigned to [hdfs://xxx,hdfs://xxx,hdfs://xxx]
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 1 is assigned to []
2025-01-02 17:04:34,573 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [2]
2025-01-02 17:04:34,574 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 2 is assigned to []
... (all assigned to [])
2025-01-02 17:04:34,577 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - Assigned splits to reader [9]
2025-01-02 17:04:34,577 INFO [s.c.s.f.s.BaseFileSourceReader] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=50002}] - Closed the bounded File source
2025-01-02 17:04:34,578 INFO [.s.s.FileSourceSplitEnumerator] [BlockingWorker-TaskGroupLocation{jobId=927125150820728833, pipelineId=1, taskGroupId=1}] - SubTask 9 is assigned to [hdfs://xxx]
Usage Scenario
This feature can be used to improve file processing performance when the connector is file
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: