Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(v2): prune performance #1034

Open
kocubinski opened this issue Dec 31, 2024 · 0 comments
Open

refactor(v2): prune performance #1034

kocubinski opened this issue Dec 31, 2024 · 0 comments

Comments

@kocubinski
Copy link
Member

kocubinski commented Dec 31, 2024

A pruning event deletes all tree nodes invalidated (orphaned) at an arbitrary prune version, less than the tree's current version. Given the earliest version of the tree on disk m, and prune version n, v2 calculates the set difference all_nodes(m -> n) - orphaned_at(n) and, importantly, writes the set to disk in a new shard. For a B+Tree, inserting is markedly faster than deleting. After completion the old shard(s) are dropped, and complete version of the tree at version n is in shard n.

There are some opportunities in the current implementation for performance improvements.

Rate Limiting

The current rate limit mechanism is global pruning limit. This works but we can probably do better.

Right now pruning begins at prune_height + keep_versions even though a new shard is created at prune_height+1. Pruning could be started at prune_height and rate limited (slowed way down) to reduce I/O pressure. This is probably as simple as smaller batches and a configurable pause between to keep to a target insert rate.

SQLite JOIN

The current set difference calculation is CPU and memory intensive. It naively loads all orphans into a map and filters branch and leaf nodes against that map. The reason for this design is to keep leaf and branch orphan writes fast, there is no index on orphan tables. If there were a SQLite JOIN statement could be used to populate the new shard, which is probably vastly more efficient.

Pruning could build the orphan indexes on (version, sequence) then execute the JOIN statements. Of course this still costs CPU and I/O, so it's not clear where the infection point (on tree size) is for it to be more efficient, but it probably is in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant