Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update indexing of LICA nodes for larger graphs #16

Open
2 of 6 tasks
glstott opened this issue Aug 12, 2022 · 3 comments
Open
2 of 6 tasks

Update indexing of LICA nodes for larger graphs #16

glstott opened this issue Aug 12, 2022 · 3 comments
Assignees

Comments

@glstott
Copy link
Owner

glstott commented Aug 12, 2022

Currently, larger trees run into an index length limit which prevents more basal nodes from being uploaded correctly. There are a few potential solutions to this problem I would like to explore:

  • Fulltext index (pros: easy to implement; cons: also has a limit, slower indexing, not really what its designed for so may run into issues)
  • Manual indexing with a python script (e.g. store LICA index like tree source index (still has an upper limit less than 100k, python script that loads data with hashmap built in as part of load process instead of loading using cypher and csv files)
  • An array index on accession id integers (Not feasible. Limit is 8167 for an index, sizing for each element is 4+(length*4) so around 186 samples would be the max for any LICA).
  • LICA_index nodes with out degree stored, connected to just the samples, we could then loop through the child samples, match on all, and verify the out degree is equal. This is likely slow.
  • No index. This will get progressively slower with longer and longer names as more and more files will need to be searched/housed in memory. (not feasible. This takes 12+hours for trees with 25+ taxa)
  • Separate indexing for LICA nodes. Provide a lookup table in SQLite to translate a composite index (unique id+source pair) into list of children and vice versa. SQLite should work up to ~195M nodes for this purpose @ which point the bottleneck would be treebuilding/dating.
@glstott glstott self-assigned this Sep 2, 2022
@glstott
Copy link
Owner Author

glstott commented Sep 12, 2022

  • "Composite" index for tree source and node. Two string indexes, one with the tree and one with the node # auto-generated. Text index has a cap of around 32kb. Efficiently uses contains queries. Maintain both for each LICA, filter on substring for each, thereby mimicking a composite index but with a much higher upper limit. It will also be more robust to the upcoming index overhaul in version 5. I'll test this one out next.

@glstott
Copy link
Owner Author

glstott commented Oct 17, 2022

I may have found a solution! This preprint by chaudhury et al. discusses a slightly different approach for generating TAG nodes. It solves two problems, the indexing (they use bitstrings on load), and the order dependence (which is a problem I only recently encountered with generating clades). I'll keep reading, but this may fix both major issues! There will still be an upper limit to what is indexable, 256,000 bits, but that seems reasonable. I'll keep working on it.

@glstott
Copy link
Owner Author

glstott commented Oct 18, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant