[Filestore] Implement sharding for directories #2674

qkrorlqr · 2024-12-11T15:59:03Z

Right now sharding is done for regular inodes (i.e. files). Directories, symlinks, etc. are managed by a single tablet per single logical filesystem. It's a bottleneck for:

directory listing
file creation, renaming and unlinking
inode lookup by parent id + child name

For example, regarding file creation - we are limited by numbers around 5-10k creations per second per single logical filesystem.

We need to implement sharding for directories. It will lead to the need to implement something like distributed transactions - most notably for the rename operation: either source directory shard or target directory shard may be unable to perform the operation and we need to either commit the operation in both source and target or not commit it anywhere. The most straightforward way to implement it is via 2PC. Considering the fact that all of our transactions will have at most 2 participants and the fact that each of those participants is reliable and has access to persistent storage, we can assign transaction coordinator role to one of the participants (e.g. source directory shard). Then 2PC actually transforms into something which is pretty simple to implement:

source directory shard receives the request, checks whether it can perform it - if it can, locks the directory for unlinking and locks the source name in the directory for any kind of ops
source directory shard sends the request to the target directory shard
if target directory shard returns a success code, source directory shard commits the change on its side and releases the locks
if target directory shard returns an error code, source directory shard just releases the locks

qkrorlqr · 2024-12-30T18:20:22Z

TODO: safety check (if our guids are not gUids) - check that NodeAttr in TCreateNodeResponse from shard matches the corresponding TCreateNodeRequest

UPD: done

qkrorlqr · 2025-01-08T22:22:48Z

RenameNode implementation plan

Important rules:

directory creation in shard is allowed only if LastNodeId == 1 in the main tablet (i.e. main tablet doesn't manage ANY nodes) - by this we get rid of the case when we need to convert a local node into an external node
all nodes are either nodes referenced by <RootNodeId, GUID> or are external nodes - serves the same goal plus simplifies the directory structure - it's max depth in each shard is now 1 or 2

RenameNodeRequest will be sent from TStorageServiceActor to the shard in charge of the source directory. This shard will play the role of the transaction coordinator.

If both source and destination directories are managed by the same tablet, RenameNode works in the same way it works right now. It's a simple way to achieve 2 things:

keep RenameNode logic the same for shardless filesystems
optimize the case when a file is moved within the same directory - should happen a lot, e.g. for the cases like "create tmp file" -> "populate tmp file" -> "mv file.tmp file"

Else => destination is managed by another shard. The logic is then like this:

We perform a special tx in coordinator which "locks" source NodeRef (makes any operation - at least any modifying operation - return E_REJECTED) AND adds a RenameNodeInDestinationRequest to the OpLog table
Sends that RenameNodeInDestinationRequest to the destination shard (RenameNodeInDestinationRequest from OpLog should be sent upon LoadState as well - just like any other requests in OpLog)
Destination shard performs the RenameNodeInDestination tx which does the part of the normal RenameNode tx related to the destination directory
Upon receiving RenameNodeInDestinationResponse, coordinator removes the corresponding RenameNodeInDestinationRequest from OpLog AND removes source NodeRef in the same tx

That seems to be it.

qkrorlqr · 2025-01-17T13:45:17Z

TODOs left after #2838:

DupCache for RenameNodeInDestination
RenameNodeInDestination request replay from OpLog upon tablet start DONE
NodeRefs locking DONE
uts for non-happy path - DONE
implement NodeType checks (via extra GetNodeAttr calls or by storing NodeType in NodeRefs)

qkrorlqr added the filestore Add this label to run only cloud/filestore build and tests on PR label Dec 11, 2024

qkrorlqr mentioned this issue Dec 11, 2024

[Filestore] Optimization epic #1160

Open

qkrorlqr changed the title ~~[Filestore] Implement sharding for all inodes (not only regular inodes)~~ [Filestore] Implement sharding for directories Dec 11, 2024

qkrorlqr mentioned this issue Dec 27, 2024

issue-2674: directories in shards - implemented directory creation inshards and supported directories in shards in CreateHandle and CreateNode requests; TODO: other requests, TODO: hard links #2772

Merged

qkrorlqr self-assigned this Dec 28, 2024

qkrorlqr mentioned this issue Jan 17, 2025

issue-2674: locking NodeRefs for modification upon RenameNode and UnlinkNode ops #2873

Merged

qkrorlqr mentioned this issue Jan 17, 2025

issue-2674: cross-shard RenameNode ut with error upon RenameNodeInDestination #2878

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filestore] Implement sharding for directories #2674

[Filestore] Implement sharding for directories #2674

qkrorlqr commented Dec 11, 2024

qkrorlqr commented Dec 30, 2024 •

edited

Loading

qkrorlqr commented Jan 8, 2025 •

edited

Loading

qkrorlqr commented Jan 17, 2025 •

edited

Loading

[Filestore] Implement sharding for directories #2674

[Filestore] Implement sharding for directories #2674

Comments

qkrorlqr commented Dec 11, 2024

qkrorlqr commented Dec 30, 2024 • edited Loading

qkrorlqr commented Jan 8, 2025 • edited Loading

qkrorlqr commented Jan 17, 2025 • edited Loading

qkrorlqr commented Dec 30, 2024 •

edited

Loading

qkrorlqr commented Jan 8, 2025 •

edited

Loading

qkrorlqr commented Jan 17, 2025 •

edited

Loading