Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1679] Estimated ApplicationDiskUsage in cluster should be multiplied by worker count. #2865

Closed
wants to merge 1 commit into from

Conversation

Z1Wu
Copy link
Contributor

@Z1Wu Z1Wu commented Oct 30, 2024

What changes were proposed in this pull request?

Assumption : For an application, its shuffle data will be equally distributed to every worker, so we can use application disk usage in one worker to estimate application disk usage in whole cluster.

Logic for estimating application disk usage:

  1. Get application disk usage in one worker from heartbeat of worker. This represents the expected disk usage for every worker.
  2. Multiply the expected disk usage per worker by the current number of workers to approximate the total disk usage of the application across the cluster.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

image

image

@Z1Wu Z1Wu force-pushed the fix/app_disk_usage branch 3 times, most recently from f36b9cc to f1a7c63 Compare October 30, 2024 16:19
… cluster should be multiplied by worker size.
@Z1Wu Z1Wu force-pushed the fix/app_disk_usage branch from f1a7c63 to b1b6c74 Compare October 30, 2024 16:20
@Z1Wu
Copy link
Contributor Author

Z1Wu commented Oct 31, 2024

cc @FMX

@FMX
Copy link
Contributor

FMX commented Nov 1, 2024

Thanks for this PR but the assumption is not solid. Every worker will report its disk usage metrics to the master node by the worker heartbeat.
You can not multiply the worker count because all workers will report these metrics.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is incorrect for the following reasons:

  1. The master node collects the total disk usage from all workers, so this value should not be multiplied by the number of workers. Multiplying the usage by the worker count would result in an inflated and inaccurate total, significantly exceeding the actual usage.
  2. Additionally, the shuffle distribution may not be evenly distributed among the workers, particularly with Celeborn workers that support the 'LOADAWARE' slot assignment policy.

@Z1Wu
Copy link
Contributor Author

Z1Wu commented Nov 1, 2024

Thanks for your review and two issues you mentioned are reasonable.

But if I've understood correctly, in current implementation, it appears that an application's usage on a single worker is considered as the usage for that application across the entire cluster, as shown in code blow:

// org.apache.celeborn.common.meta.AppDiskUsageSnapShot#updateAppDiskUsage
// param: usage -> application disk usage in one worker, such as worker A
def updateAppDiskUsage(appId: String, usage: Long): Unit = {
    // drop old application disk usage in topNitems
    val dropIndex = topNItems.indexWhere(usage => usage != null && usage.appId == appId)
    if (dropIndex != -1) {
        drop(dropIndex)
    }
    // find the position to insert to persist the sorted order
    val insertIndex = findInsertPosition(usage)
    // put application disk usage in worker A into topNitems as application disk usage in cluster
    if (insertIndex != -1) {
        shift(insertIndex)
        topNItems(insertIndex) = AppDiskUsage(appId, usage)
    }
}

Due to the issue previously mentioned, this approach would result in the reported Application Disk Usage being significantly lower than the actual usage of the Application across the cluster.
To get accurate application disk usage in the cluster, it would be necessary for the Master to maintain a data structure to record each application's usage on every worker. This information can be obtained from the heartbeat sent from workers. Maintaining such a data structure would have a space complexity O(m * n), where m is number of worker and n is the number of current active applications. WDYT?

@FMX
Copy link
Contributor

FMX commented Nov 4, 2024

@Z1Wu Thank you for your enthusiasm! The feature you're interested in has been addressed in this pull request. I recommend removing the AppDiskUsageMetric and the estimatedAppDiskUsage values from the worker's heartbeat, as they are now outdated.

Copy link

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Nov 24, 2024
@turboFei
Copy link
Member

Thanks @Z1Wu , we plan to remove the code for old top app usages.
#2949

I think we can close this PR now.

@turboFei turboFei closed this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants