-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BTrees in Motoko. #396
base: master
Are you sure you want to change the base?
BTrees in Motoko. #396
Changes from 5 commits
c35b3d9
b95320f
d81a1c3
142a0d0
ffd99ec
61334ec
0d06506
273fb6f
9db84ca
4393a81
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
/// Imperative sequences as B-Trees. | ||
|
||
import A "Array"; | ||
import I "Iter"; | ||
import List "List"; | ||
import Option "Option"; | ||
import Order "Order"; | ||
import P "Prelude"; | ||
import Prim "mo:⛔"; | ||
|
||
module { | ||
|
||
/// Constants we use to shape the tree. | ||
/// See https://en.wikipedia.org/wiki/B-tree#Definition | ||
module Constants { | ||
let MAX_CHILDREN = 4; | ||
}; | ||
|
||
public type Compare<K> = { | ||
compare : (K, K) -> Order.Order | ||
}; | ||
|
||
public type Data<K, V> = [(K, V)]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious, what's the benefit of the
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a slight space savings to the tuple (1 word per instance). Otherwise, I'd prefer the record with labeled fields. I actually had that same record definition initially, but recalled this recent review from Claudio for the HashMap improvements. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, why is
I'm probably missing something as it's early and I definitely didn't get enough sleep last night 😅 Is there a reference implementation you used for this that I can peek through? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
public type Index<K, V> = { | ||
data : Data<K, V>; | ||
trees : [Tree<K, V>]; | ||
}; | ||
|
||
public type Tree<K, V> = { | ||
#index : Index<K, V>; | ||
#data : Data<K, V>; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the Tree variant's values be renamed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, except that I'd prefer to call it "internal" and "leaf", since both are kinds of "nodes" in my mind. "Index" seemed more natural when I was misunderstanding the structure of a B-Tree, and thinking of it as a B+-Tree (I guess?) where the data is only at the leaves; then the internal nodes only store keys, and serve as an index. Now, I've adopted the standard B-Tree definition where key-value pairs can be internal too, and the name "index" is more inappropriate. BTW, as one reference for "standard", I've been looking at the Rust implementation in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, I'm fine with both of those as long as there's a comment above the type definition for the reader. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Went with "internal" and "leaf". What should the comment say? |
||
}; | ||
|
||
func find_data<K, V>(data : Data<K, V>, find_k : K, c : Compare<K>) : ?V { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So the function signature can become
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the advantage of that? Less cognitive overhead in some way? I actually like the record with a function inside pattern, even with just one function: If I want to add debugging later, or any other operations, it helps to have a record in place already to place those operations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't change the API because of concern about adding debugging. If you want backwards compatibility and a more flexible API, why not just make the function parameters a record?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay. I've removed that record-of-function from the public-facing API used in tests, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, it was indeed useful to keep in in the checking API (just for tests), since then I could debug by printing out keys from within the BTree code. Without having a |
||
for ((k, v) in data.vals()) { | ||
if (c.compare(k, find_k) == #equal) { return ?v }; | ||
}; | ||
return null | ||
}; | ||
|
||
func find<K, V>(t : Tree<K, V>, k : K, c : Compare<K>) : ?V { | ||
switch t { | ||
case (#data(d)) { return find_data<K, V>(d, k, c) }; | ||
case (#index(i)) { | ||
for (j in I.range(0, i.data.size())) { | ||
switch (c.compare(k, i.data[j].0)) { | ||
case (#equal) { return ?i.data[j].1 }; | ||
case (#less) { return find<K, V>(i.trees[j], k, c) }; | ||
case _ { } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, true! OTOH, that would not be a "baby step". : ) Ideally, we'd profile both variations, at various choices of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any difference in performance in terms of what it gets transformed down to? I'm planning on writing an imperative RBTree later this week to test out the RBTree performance issues I was running into in #390 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might end up being the case that a BTree with index/sub-tree size 16/32/64 removes this issue since the tree is more shallow and there's less rebalancing steps required 🤷♂️ Just for context, the reason the insertion performance is important to me is that I'd prefer to find a balanced data structure solution that avoids hitting the message instruction limit and doesn't rely on deterministic time slicing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, I think B-Trees will be shallow for any realistic internal node size (16, 32, etc.). As mentioned above, in Motoko, any self-contained tail-recursive function (not two mutually-recursive ones) will become a single while loop without a stack during compilation. For any realistic BTree, that stack-free loop should run at most a handful of iterations (less than 10), so the cycle limit will certainly not matter for searching, even if you need to search for a bunch of records in random positions all within one message response time. |
||
} | ||
}; | ||
find<K, V>(i.trees[i.data.size()], k, c) | ||
}; | ||
}; | ||
}; | ||
|
||
/// Check that a B-Tree instance observes invariants of B-Trees. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this module is just for testing/debugging, should When I import the default BTree module like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This module is static. The compiler will not compile static code that you import but do not use. (However, class instances will always contain all methods, regardless of usage. Not applicable here.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why? It's more enclosed and encapsulated here. There is no way to "hide" a top-level module in |
||
/// Invariants ensure performance is what we expect. | ||
/// For testing and debugging. | ||
public module Check { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Didn't know we could have more than one module in a file! I learn something new every day :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep! And they can even vary in whether they are public or private, and nest to any depth. All imports have to be at the top of the file though (very unlike Rust). |
||
|
||
type CompareOp<K> = { | ||
compare : (?K, ?K) -> Order.Order | ||
}; | ||
|
||
func compareOp<K>(c : Compare<K>) : CompareOp<K> = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar comment on the |
||
compare = func (k1 : ?K, k2 : ?K) : Order.Order { | ||
switch (k1, k2) { | ||
case (null, null) { assert false; loop {} }; | ||
case (null, _) #less; | ||
case (_, null) #greater; | ||
case (?k1, ?k2) c.compare(k1, k2) | ||
} | ||
} | ||
}; | ||
|
||
public func check<K, V>(c : Compare<K>, t : Tree<K, V>) { | ||
rec(null, compareOp(c), t, null) | ||
}; | ||
|
||
func rec<K, V>(lower : ?K, c : CompareOp<K>, t : Tree<K, V>, upper : ?K) { | ||
switch t { | ||
case (#data(d)) { data(lower, c, d, upper) }; | ||
case (#index(i)) { index(lower, c, i, upper) }; | ||
} | ||
}; | ||
|
||
func data<K, V>(lower : ?K, c : CompareOp<K>, d : Data<K, V>, upper : ?K) { | ||
var prev_k : ?K = null; | ||
for ((k, _) in d.vals()) { | ||
assert (c.compare(prev_k, ?k) != #greater); | ||
assert (c.compare(lower, ?k) != #greater); | ||
assert (c.compare(?k, upper) != #greater); | ||
matthewhammer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
prev_k := ?k; | ||
} | ||
}; | ||
|
||
func index<K, V>(lower : ?K, c : CompareOp<K>, i : Index<K, V>, upper : ?K) { | ||
assert (i.data.size() + 1 == i.trees.size()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this keep the invariant? I thought There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I don't understand the question entirely, but the arrays of (sub) From that work, I get that this property holds (related to
(abusing notation to compare sets of keys with a single key; what I mean is every key in each set observes the relevant comparison operator holding.) I found it very helpful to look at pictures that people drew, and draw some myself. For instance, the one on the wikipedia page shows this pattern where keys are shown between consecutive subtrees.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PS: I strongly suspect that this invariant code is wrong. It's a baby step that will need to be revised, probably a few times. Thanks for taking a look! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on briefly looking at that wikipedia page image, let me know if I'm understanding this correctly... So for each new level/sub-tree the "trees" at that level get instantiated to be a specific size array with each data slot containing a default "empty" value until filled. I understand this invariant a bit better now given my response to this comment of yours. It has to do that you have both a data array and a sub-trees array - which I argue could be folded into one array, and just have a single data tuple (and not a data tuple array). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm cribbing the type structure from this Rust implementation from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What makes the B-Tree "tricky" as a simple type def, IMO is the combination of array structure and tree structure, and how it mixes them together in an efficient implementation that uses arrays to organize things of the same type. In a "usual" internal node, there is a key-value between each pair of consecutive sub-trees, giving a key that separates these sub-trees' keys. But there is no key-value pair on either end of the sub-tree list, so in total, there is one fewer key-value pairs than sub-trees. For instance, see this line of the checking algorithm. The Rust implementation linked above does this, based on me reading the comments (not the code itself). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, I understand completely now! Thanks for explaining. |
||
data(lower, c, i.data, upper); | ||
for (j in I.range(0, i.trees.size())) { | ||
let lower_ = if (j == 0) { lower } else { ?(i.data[j - 1].0) }; | ||
let upper_ = if (j == i.data.size()) { upper } else { ?(i.data[j]).0 }; | ||
rec<K, V>(lower_, c, i.trees[j], upper_) | ||
} | ||
}; | ||
}; | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the max 4? https://panthema.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html shows that 32-128 perform considerably better at large n.
I think it therefore might make sense to allow the developer to configure larger child values (i.e. 4, 16, 32, 64, 128, 256). I'd be curious to run some performance tests inserting a running counter or batch of elements (ordered or unordered) to see what difference this might make as the tree grows in size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to write simple tests before adjusting this number into something that varies, which I agree is desirable.