cache: fix long initialization times #811

Andrew7234 · 2024-12-02T07:11:55Z

Nexus currently uses pogreb to cache static node queries in order to reduce the load on oasis-node and improve performance during reindexing. We currently maintain a separate cache for each runtime, eg /rpc-cache/indexer/consensus.

Pogreb has a known limitation where it rebuilds the entire index during the recovery process, which can take hours to days for large databases. The recovery process is triggered if the lockfile is found in the cache directory, which (often) can happen if Nexus is interrupted while rebuilding the index.

The consensus cache is by far the largest and thus the most problematic. The following are the cache sizes for each runtime on mainnet Nexus.

root@mainnet-oasis-indexer-analyzer-754947d498-ltl5f:/rpc-cache/indexer# du -sh *
28K	cipher
24K	cipher.backup
7.2T	consensus
5.1G	consensus.backup
87G	emerald
2.2G	emerald.backup
28K	pontusx
57G	sapphire
796M	sapphire.backup

And testnet Nexus

root@testnet-oasis-indexer-analyzer-5b87dc57c9-p9bcq:/rpc-cache/indexer# du -sh *
28K	cipher
24K	cipher.backup
4.5T	consensus
149M	consensus.backup
6.2G	emerald
76M	emerald.backup
3.5G	pontusx_dev
63M	pontusx_dev.backup
5.7G	pontusx_test
63M	pontusx_test.backup
14G	sapphire
80M	sapphire.backup

The Nexus logs show that testnet consensus cache initialization takes ~6 days for 4.5TB. (Wasn't able to find logs for mainnet consensus but it's slower).

We should find a way to eliminate these long initialization times. One option would be to shard the consensus caches into multiple pogreb dbs, for instance by upgrade (mainnet/cobalt/damask/eden). We could also explore alternative local kv stores instead of pogreb.

Side note: The logs indicated that the preBackup call might be taking a long time as well - especially the deleteFiles call. It might be good to add some log messages/timings and test it out on the production deploy to see if there's some easy optimization there. When I checked on 12/1 there were ~1800 segment files and 200 backups (*.bac.bac) in the consensus mainnet database - maybe having a large number of files in the cache directory slows the deleteFiles calls down.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: fix long initialization times #811

cache: fix long initialization times #811

Andrew7234 commented Dec 2, 2024

cache: fix long initialization times #811

cache: fix long initialization times #811

Comments

Andrew7234 commented Dec 2, 2024