Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixChangeFeedHangWhenUsingStaleContainerRid #43729

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Jan 7, 2025

Following up this pr #43114

Issue:

When query changeFeed with an invalid continuation token (container got recreated in this case, which means the continuationToken uses stale containerRid), SDK will return incorrect result or hang.

Root Cause:

  • For container re-created with same RU or same feedRanges -> SDK will reuse the token(lsn) to resume the read from the new container, which could cause missing data

  • For container re-created with different RU or different feed Ranges, the feedRange in the continuationToken spans multiple partitions of the new container -> SDK will be in a hang status (due to endless retries).

     1.   During `populateFeedRangeFilteringHeaders` - SDK detected that the feedRange spans multiple partitions, throw `PartitionKeyRangeGoneException`
     2.   `ChangeFeedFetcher.FeedRangeContinuationFeedRangeGoneRetryPolicy`
           -  handle `PartitionKeyRangeGoneException`, it tries to find overlapping ranges based on the continuationToken collectionRid, null list being returned
           -  retry with same feedRange from the continuationToken
           -  repeat step1
    

Fixes:

  • For container re-created with same RU or with same feedRanges -> always Populate x-ms-cosmos-intended-collection-rid for changeFeed request, eventually 400/1024 will be bubbled up to customer
  • For container re-created with different RU or different feed Ranges, the feedRange in the continuationToken spans multiple partitions of the new container -> SDK will internally create 400/1024 exceptions, refresh container cache once, retry. And if still not successful, then bubble 400/1024 to customer

@github-actions github-actions bot added the Cosmos label Jan 7, 2025
@tvaron3
Copy link
Member Author

tvaron3 commented Jan 7, 2025

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants