Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTrees in Motoko. #396

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
110 changes: 110 additions & 0 deletions src/BTree.mo
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
/// Imperative sequences as B-Trees.

import A "Array";
import I "Iter";
import List "List";
import Option "Option";
import Order "Order";
import P "Prelude";
import Prim "mo:⛔";

module {

/// Constants we use to shape the tree.
/// See https://en.wikipedia.org/wiki/B-tree#Definition
module Constants {
let MAX_CHILDREN = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the max 4? https://panthema.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html shows that 32-128 perform considerably better at large n.

I think it therefore might make sense to allow the developer to configure larger child values (i.e. 4, 16, 32, 64, 128, 256). I'd be curious to run some performance tests inserting a running counter or batch of elements (ordered or unordered) to see what difference this might make as the tree grows in size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to write simple tests before adjusting this number into something that varies, which I agree is desirable.

};

public type Compare<K> = {
compare : (K, K) -> Order.Order
};

public type Data<K, V> = [(K, V)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what's the benefit of the Data<K, V> type being the tuple [(K, V)] over something like

public type Data<K, V> = {
  key: K;
  value: V;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a slight space savings to the tuple (1 word per instance). Otherwise, I'd prefer the record with labeled fields.

I actually had that same record definition initially, but recalled this recent review from Claudio for the HashMap improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why is Data an array and not just (K, V)? I thought the data of each leaf in the index was already folded into the index type here

  public type Index<K, V> = {
    data : Data<K, V>; // data value pointing to this index
    trees : [Tree<K, V>]; // each of these sub-trees in the array has their own (K, V)
  };

I'm probably missing something as it's early and I definitely didn't get enough sleep last night 😅

Is there a reference implementation you used for this that I can peek through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


public type Index<K, V> = {
data : Data<K, V>;
trees : [Tree<K, V>];
};

public type Tree<K, V> = {
#index : Index<K, V>;
#data : Data<K, V>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the Tree variant's values be renamed #node and #leaf? Index makes it sound more like a hash table type of lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, except that I'd prefer to call it "internal" and "leaf", since both are kinds of "nodes" in my mind.

"Index" seemed more natural when I was misunderstanding the structure of a B-Tree, and thinking of it as a B+-Tree (I guess?) where the data is only at the leaves; then the internal nodes only store keys, and serve as an index.

Now, I've adopted the standard B-Tree definition where key-value pairs can be internal too, and the name "index" is more inappropriate.

BTW, as one reference for "standard", I've been looking at the Rust implementation in std, and the helpful comments in particular. Hopefully this implementation will be much (much!) shorter, at least initially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm fine with both of those as long as there's a comment above the type definition for the reader.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with "internal" and "leaf".

What should the comment say?

};

func find_data<K, V>(data : Data<K, V>, find_k : K, c : Compare<K>) : ?V {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c can just be of type (K, K) -> Order.Order.

So the function signature can become

func find_data<K, V>(data : Data<K, V>, find_k : K, compareTo: (K, K) -> Order.Order) : ?V {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the advantage of that? Less cognitive overhead in some way?

I actually like the record with a function inside pattern, even with just one function: If I want to add debugging later, or any other operations, it helps to have a record in place already to place those operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't change the API because of concern about adding debugging.

If you want backwards compatibility and a more flexible API, why not just make the function parameters a record?

func find_data<K, V>(options: {
  data : Data<K, V>;
  find_k : K;
  compareTo: (K, K) -> Order.Order;
  // add new optional parameters here later if desired
}) : ?V {
 ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I've removed that record-of-function from the public-facing API used in tests, etc.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it was indeed useful to keep in in the checking API (just for tests), since then I could debug by printing out keys from within the BTree code.

Without having a show function for the key type, there's no way for the generic code to print them during debugging phases of development.

for ((k, v) in data.vals()) {
if (c.compare(k, find_k) == #equal) { return ?v };
};
return null
};

func find<K, V>(t : Tree<K, V>, k : K, c : Compare<K>) : ?V {
switch t {
case (#data(d)) { return find_data<K, V>(d, k, c) };
case (#index(i)) {
for (j in I.range(0, i.data.size())) {
switch (c.compare(k, i.data[j].0)) {
case (#equal) { return ?i.data[j].1 };
case (#less) { return find<K, V>(i.trees[j], k, c) };
case _ { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since trees is an ordered array, this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a great place to do a modified binary search on the i.trees array to find the correct subtree to navigate down.

Yes, true!

OTOH, that would not be a "baby step". : )

Ideally, we'd profile both variations, at various choices of m (the branching factor for B-Trees seems to be called m by everyone for some reason) and we'd have some way to know if there's a benefit, and how much. There may be little benefit when m is small, or it may even be worse (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are there any (scalability) concerns about the call stack size since we're using recursion vs. a loop?

Within the compiler, tail recursion to the same function gets transformed into a loop, with no stack/recursion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference in performance in terms of what it gets transformed down to? I'm planning on writing an imperative RBTree later this week to test out the RBTree performance issues I was running into in #390

Copy link
Contributor

@ByronBecker ByronBecker Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might end up being the case that a BTree with index/sub-tree size 16/32/64 removes this issue since the tree is more shallow and there's less rebalancing steps required 🤷‍♂️

Just for context, the reason the insertion performance is important to me is that I'd prefer to find a balanced data structure solution that avoids hitting the message instruction limit and doesn't rely on deterministic time slicing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think B-Trees will be shallow for any realistic internal node size (16, 32, etc.).

As mentioned above, in Motoko, any self-contained tail-recursive function (not two mutually-recursive ones) will become a single while loop without a stack during compilation.

For any realistic BTree, that stack-free loop should run at most a handful of iterations (less than 10), so the cycle limit will certainly not matter for searching, even if you need to search for a bunch of records in random positions all within one message response time.

}
};
find<K, V>(i.trees[i.data.size()], k, c)
};
};
};

/// Check that a B-Tree instance observes invariants of B-Trees.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module is just for testing/debugging, should Check be in a different file?

When I import the default BTree module like import BTree "mo:base/BTree";, does it shake out the Check, or does it include it and bloat the import size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it shake out the Check, or does it include it and bloat the import size?

This module is static. The compiler will not compile static code that you import but do not use. (However, class instances will always contain all methods, regardless of usage. Not applicable here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module is just for testing/debugging, should Check be in a different file?

Why? It's more enclosed and encapsulated here. There is no way to "hide" a top-level module in base.

/// Invariants ensure performance is what we expect.
/// For testing and debugging.
public module Check {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know we could have more than one module in a file! I learn something new every day :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! And they can even vary in whether they are public or private, and nest to any depth.

All imports have to be at the top of the file though (very unlike Rust).


type CompareOp<K> = {
compare : (?K, ?K) -> Order.Order
};

func compareOp<K>(c : Compare<K>) : CompareOp<K> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment on the Compare type to above

compare = func (k1 : ?K, k2 : ?K) : Order.Order {
switch (k1, k2) {
case (null, null) { assert false; loop {} };
case (null, _) #less;
case (_, null) #greater;
case (?k1, ?k2) c.compare(k1, k2)
}
}
};

public func check<K, V>(c : Compare<K>, t : Tree<K, V>) {
rec(null, compareOp(c), t, null)
};

func rec<K, V>(lower : ?K, c : CompareOp<K>, t : Tree<K, V>, upper : ?K) {
switch t {
case (#data(d)) { data(lower, c, d, upper) };
case (#index(i)) { index(lower, c, i, upper) };
}
};

func data<K, V>(lower : ?K, c : CompareOp<K>, d : Data<K, V>, upper : ?K) {
var prev_k : ?K = null;
for ((k, _) in d.vals()) {
assert (c.compare(prev_k, ?k) != #greater);
assert (c.compare(lower, ?k) != #greater);
assert (c.compare(?k, upper) != #greater);
matthewhammer marked this conversation as resolved.
Show resolved Hide resolved
prev_k := ?k;
}
};

func index<K, V>(lower : ?K, c : CompareOp<K>, i : Index<K, V>, upper : ?K) {
assert (i.data.size() + 1 == i.trees.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this keep the invariant? I thought i.data was just [(K, V)], whereas i.trees could be any size array specified by the BTree child size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I don't understand the question entirely, but the arrays of (sub)trees and keys (in data) are highly related under my current understanding of B-Trees, and the way they work in the rust implementation.

From that work, I get that this property holds (related to assert (i.data.size() + 1 == i.trees.size());)

For each pair of consecutive subtrees (t1, t2), there is a key k12 (and value v12) such that keysOf(t1) < k12 < keysOf(t2)

(abusing notation to compare sets of keys with a single key; what I mean is every key in each set observes the relevant comparison operator holding.)

I found it very helpful to look at pictures that people drew, and draw some myself. For instance, the one on the wikipedia page shows this pattern where keys are shown between consecutive subtrees..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I strongly suspect that this invariant code is wrong. It's a baby step that will need to be revised, probably a few times. Thanks for taking a look!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on briefly looking at that wikipedia page image, let me know if I'm understanding this correctly...

So for each new level/sub-tree the "trees" at that level get instantiated to be a specific size array with each data slot containing a default "empty" value until filled.

I understand this invariant a bit better now given my response to this comment of yours.

It has to do that you have both a data array and a sub-trees array - which I argue could be folded into one array, and just have a single data tuple (and not a data tuple array).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@matthewhammer matthewhammer Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes the B-Tree "tricky" as a simple type def, IMO is the combination of array structure and tree structure, and how it mixes them together in an efficient implementation that uses arrays to organize things of the same type.

In a "usual" internal node, there is a key-value between each pair of consecutive sub-trees, giving a key that separates these sub-trees' keys.

But there is no key-value pair on either end of the sub-tree list, so in total, there is one fewer key-value pairs than sub-trees.

For instance, see this line of the checking algorithm.

The Rust implementation linked above does this, based on me reading the comments (not the code itself).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I understand completely now! Thanks for explaining.

data(lower, c, i.data, upper);
for (j in I.range(0, i.trees.size())) {
let lower_ = if (j == 0) { lower } else { ?(i.data[j - 1].0) };
let upper_ = if (j == i.data.size()) { upper } else { ?(i.data[j]).0 };
rec<K, V>(lower_, c, i.trees[j], upper_)
}
};
};

}