You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, larger trees run into an index length limit which prevents more basal nodes from being uploaded correctly. There are a few potential solutions to this problem I would like to explore:
Fulltext index (pros: easy to implement; cons: also has a limit, slower indexing, not really what its designed for so may run into issues)
Manual indexing with a python script (e.g. store LICA index like tree source index (still has an upper limit less than 100k, python script that loads data with hashmap built in as part of load process instead of loading using cypher and csv files)
An array index on accession id integers (Not feasible. Limit is 8167 for an index, sizing for each element is 4+(length*4) so around 186 samples would be the max for any LICA).
LICA_index nodes with out degree stored, connected to just the samples, we could then loop through the child samples, match on all, and verify the out degree is equal. This is likely slow.
No index. This will get progressively slower with longer and longer names as more and more files will need to be searched/housed in memory. (not feasible. This takes 12+hours for trees with 25+ taxa)
Separate indexing for LICA nodes. Provide a lookup table in SQLite to translate a composite index (unique id+source pair) into list of children and vice versa. SQLite should work up to ~195M nodes for this purpose @ which point the bottleneck would be treebuilding/dating.
The text was updated successfully, but these errors were encountered:
"Composite" index for tree source and node. Two string indexes, one with the tree and one with the node # auto-generated. Text index has a cap of around 32kb. Efficiently uses contains queries. Maintain both for each LICA, filter on substring for each, thereby mimicking a composite index but with a much higher upper limit. It will also be more robust to the upcoming index overhaul in version 5. I'll test this one out next.
I may have found a solution! This preprint by chaudhury et al. discusses a slightly different approach for generating TAG nodes. It solves two problems, the indexing (they use bitstrings on load), and the order dependence (which is a problem I only recently encountered with generating clades). I'll keep reading, but this may fix both major issues! There will still be an upper limit to what is indexable, 256,000 bits, but that seems reasonable. I'll keep working on it.
Currently, larger trees run into an index length limit which prevents more basal nodes from being uploaded correctly. There are a few potential solutions to this problem I would like to explore:
The text was updated successfully, but these errors were encountered: