Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filestore] Implement sharding for directories #2674

Open
qkrorlqr opened this issue Dec 11, 2024 · 3 comments
Open

[Filestore] Implement sharding for directories #2674

qkrorlqr opened this issue Dec 11, 2024 · 3 comments
Assignees
Labels
filestore Add this label to run only cloud/filestore build and tests on PR

Comments

@qkrorlqr
Copy link
Collaborator

Right now sharding is done for regular inodes (i.e. files). Directories, symlinks, etc. are managed by a single tablet per single logical filesystem. It's a bottleneck for:

  • directory listing
  • file creation, renaming and unlinking
  • inode lookup by parent id + child name

For example, regarding file creation - we are limited by numbers around 5-10k creations per second per single logical filesystem.

We need to implement sharding for directories. It will lead to the need to implement something like distributed transactions - most notably for the rename operation: either source directory shard or target directory shard may be unable to perform the operation and we need to either commit the operation in both source and target or not commit it anywhere. The most straightforward way to implement it is via 2PC. Considering the fact that all of our transactions will have at most 2 participants and the fact that each of those participants is reliable and has access to persistent storage, we can assign transaction coordinator role to one of the participants (e.g. source directory shard). Then 2PC actually transforms into something which is pretty simple to implement:

  1. source directory shard receives the request, checks whether it can perform it - if it can, locks the directory for unlinking and locks the source name in the directory for any kind of ops
  2. source directory shard sends the request to the target directory shard
  3. if target directory shard returns a success code, source directory shard commits the change on its side and releases the locks
  4. if target directory shard returns an error code, source directory shard just releases the locks
@qkrorlqr qkrorlqr added the filestore Add this label to run only cloud/filestore build and tests on PR label Dec 11, 2024
@qkrorlqr qkrorlqr changed the title [Filestore] Implement sharding for all inodes (not only regular inodes) [Filestore] Implement sharding for directories Dec 11, 2024
@qkrorlqr qkrorlqr self-assigned this Dec 28, 2024
@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Dec 30, 2024

TODO: safety check (if our guids are not gUids) - check that NodeAttr in TCreateNodeResponse from shard matches the corresponding TCreateNodeRequest

UPD: done

@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Jan 8, 2025

RenameNode implementation plan

Important rules:

  • directory creation in shard is allowed only if LastNodeId == 1 in the main tablet (i.e. main tablet doesn't manage ANY nodes) - by this we get rid of the case when we need to convert a local node into an external node
  • all nodes are either nodes referenced by <RootNodeId, GUID> or are external nodes - serves the same goal plus simplifies the directory structure - it's max depth in each shard is now 1 or 2

RenameNodeRequest will be sent from TStorageServiceActor to the shard in charge of the source directory. This shard will play the role of the transaction coordinator.

If both source and destination directories are managed by the same tablet, RenameNode works in the same way it works right now. It's a simple way to achieve 2 things:

  • keep RenameNode logic the same for shardless filesystems
  • optimize the case when a file is moved within the same directory - should happen a lot, e.g. for the cases like "create tmp file" -> "populate tmp file" -> "mv file.tmp file"

Else => destination is managed by another shard. The logic is then like this:

  1. We perform a special tx in coordinator which "locks" source NodeRef (makes any operation - at least any modifying operation - return E_REJECTED) AND adds a RenameNodeInDestinationRequest to the OpLog table
  2. Sends that RenameNodeInDestinationRequest to the destination shard (RenameNodeInDestinationRequest from OpLog should be sent upon LoadState as well - just like any other requests in OpLog)
  3. Destination shard performs the RenameNodeInDestination tx which does the part of the normal RenameNode tx related to the destination directory
  4. Upon receiving RenameNodeInDestinationResponse, coordinator removes the corresponding RenameNodeInDestinationRequest from OpLog AND removes source NodeRef in the same tx

That seems to be it.

@qkrorlqr
Copy link
Collaborator Author

qkrorlqr commented Jan 17, 2025

TODOs left after #2838:

  • DupCache for RenameNodeInDestination
  • RenameNodeInDestination request replay from OpLog upon tablet start DONE
  • NodeRefs locking DONE
  • uts for non-happy path - DONE
  • implement NodeType checks (via extra GetNodeAttr calls or by storing NodeType in NodeRefs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filestore Add this label to run only cloud/filestore build and tests on PR
Projects
None yet
Development

No branches or pull requests

1 participant