Update indexing of LICA nodes for larger graphs #16

glstott · 2022-08-12T20:06:07Z

Currently, larger trees run into an index length limit which prevents more basal nodes from being uploaded correctly. There are a few potential solutions to this problem I would like to explore:

Fulltext index (pros: easy to implement; cons: also has a limit, slower indexing, not really what its designed for so may run into issues)
Manual indexing with a python script (e.g. store LICA index like tree source index (still has an upper limit less than 100k, python script that loads data with hashmap built in as part of load process instead of loading using cypher and csv files)
An array index on accession id integers (Not feasible. Limit is 8167 for an index, sizing for each element is 4+(length*4) so around 186 samples would be the max for any LICA).
LICA_index nodes with out degree stored, connected to just the samples, we could then loop through the child samples, match on all, and verify the out degree is equal. This is likely slow.
No index. This will get progressively slower with longer and longer names as more and more files will need to be searched/housed in memory. (not feasible. This takes 12+hours for trees with 25+ taxa)
Separate indexing for LICA nodes. Provide a lookup table in SQLite to translate a composite index (unique id+source pair) into list of children and vice versa. SQLite should work up to ~195M nodes for this purpose @ which point the bottleneck would be treebuilding/dating.

glstott · 2022-09-12T20:52:55Z

"Composite" index for tree source and node. Two string indexes, one with the tree and one with the node # auto-generated. Text index has a cap of around 32kb. Efficiently uses contains queries. Maintain both for each LICA, filter on substring for each, thereby mimicking a composite index but with a much higher upper limit. It will also be more robust to the upcoming index overhaul in version 5. I'll test this one out next.

glstott · 2022-10-17T14:40:12Z

I may have found a solution! This preprint by chaudhury et al. discusses a slightly different approach for generating TAG nodes. It solves two problems, the indexing (they use bitstrings on load), and the order dependence (which is a problem I only recently encountered with generating clades). I'll keep reading, but this may fix both major issues! There will still be an upper limit to what is indexable, 256,000 bits, but that seems reasonable. I'll keep working on it.

glstott · 2022-10-18T18:21:58Z

https://link.springer.com/content/pdf/10.1007/978-3-319-21233-3.pdf

glstott self-assigned this Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update indexing of LICA nodes for larger graphs #16

Update indexing of LICA nodes for larger graphs #16

glstott commented Aug 12, 2022 •

edited

Loading

glstott commented Sep 12, 2022

glstott commented Oct 17, 2022 •

edited

Loading

glstott commented Oct 18, 2022

Update indexing of LICA nodes for larger graphs #16

Update indexing of LICA nodes for larger graphs #16

Comments

glstott commented Aug 12, 2022 • edited Loading

glstott commented Sep 12, 2022

glstott commented Oct 17, 2022 • edited Loading

glstott commented Oct 18, 2022

glstott commented Aug 12, 2022 •

edited

Loading

glstott commented Oct 17, 2022 •

edited

Loading