Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: fix long initialization times #811

Open
Andrew7234 opened this issue Dec 2, 2024 · 0 comments
Open

cache: fix long initialization times #811

Andrew7234 opened this issue Dec 2, 2024 · 0 comments

Comments

@Andrew7234
Copy link
Collaborator

Nexus currently uses pogreb to cache static node queries in order to reduce the load on oasis-node and improve performance during reindexing. We currently maintain a separate cache for each runtime, eg /rpc-cache/indexer/consensus.

Pogreb has a known limitation where it rebuilds the entire index during the recovery process, which can take hours to days for large databases. The recovery process is triggered if the lockfile is found in the cache directory, which (often) can happen if Nexus is interrupted while rebuilding the index.

The consensus cache is by far the largest and thus the most problematic. The following are the cache sizes for each runtime on mainnet Nexus.

root@mainnet-oasis-indexer-analyzer-754947d498-ltl5f:/rpc-cache/indexer# du -sh *
28K	cipher
24K	cipher.backup
7.2T	consensus
5.1G	consensus.backup
87G	emerald
2.2G	emerald.backup
28K	pontusx
57G	sapphire
796M	sapphire.backup

And testnet Nexus

root@testnet-oasis-indexer-analyzer-5b87dc57c9-p9bcq:/rpc-cache/indexer# du -sh *
28K	cipher
24K	cipher.backup
4.5T	consensus
149M	consensus.backup
6.2G	emerald
76M	emerald.backup
3.5G	pontusx_dev
63M	pontusx_dev.backup
5.7G	pontusx_test
63M	pontusx_test.backup
14G	sapphire
80M	sapphire.backup

The Nexus logs show that testnet consensus cache initialization takes ~6 days for 4.5TB. (Wasn't able to find logs for mainnet consensus but it's slower).
Screenshot 2024-12-01 at 10 51 45 PM

We should find a way to eliminate these long initialization times. One option would be to shard the consensus caches into multiple pogreb dbs, for instance by upgrade (mainnet/cobalt/damask/eden). We could also explore alternative local kv stores instead of pogreb.

Side note: The logs indicated that the preBackup call might be taking a long time as well - especially the deleteFiles call. It might be good to add some log messages/timings and test it out on the production deploy to see if there's some easy optimization there. When I checked on 12/1 there were ~1800 segment files and 200 backups (*.bac.bac) in the consensus mainnet database - maybe having a large number of files in the cache directory slows the deleteFiles calls down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant