-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch NewLeaf node work in base API #349
Conversation
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
cfg := GetConfig() | ||
|
||
// C1. | ||
var c1poly [NodeWidth]Fr | ||
var c1 *Point | ||
count := fillSuffixTreePoly(c1poly[:], values[:NodeWidth/2]) | ||
containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx && | ||
c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) && | ||
c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue) | ||
if containsEmptyCodeHash { | ||
// Clear out values of the cached point. | ||
c1poly[EmptyCodeHashFirstHalfIdx] = FrZero | ||
c1poly[EmptyCodeHashSecondHalfIdx] = FrZero | ||
// Calculate the remaining part of c1 and add to the base value. | ||
partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2) | ||
c1 = new(Point) | ||
c1.Add(&EmptyCodeHashPoint, partialc1) | ||
} else { | ||
c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count) | ||
} | ||
|
||
// C2. | ||
var c2poly [NodeWidth]Fr | ||
count = fillSuffixTreePoly(c2poly[:], values[NodeWidth/2:]) | ||
c2 := cfg.CommitToPoly(c2poly[:], NodeWidth-count) | ||
|
||
// Root commitment preparation for calculation. | ||
stem = stem[:StemSize] // enforce a 31-byte length | ||
var poly [NodeWidth]Fr | ||
poly[0].SetUint64(1) | ||
StemFromBytes(&poly[1], stem) | ||
toFrMultiple([]*Fr{&poly[2], &poly[3]}, []*Point{c1, c2}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We stop doing work in NewLeafNode
and allow c1
, c2
and commitment
to be nil
.
func (n *InternalNode) findNewLeafNodes(newLeaves []*LeafNode) []*LeafNode { | ||
for idx := range n.cow { | ||
child := n.children[idx] | ||
if childInternalNode, ok := child.(*InternalNode); ok && len(childInternalNode.cow) > 0 { | ||
newLeaves = childInternalNode.findNewLeafNodes(newLeaves) | ||
} else if leafNode, ok := child.(*LeafNode); ok { | ||
if leafNode.commitment == nil { | ||
newLeaves = append(newLeaves, leafNode) | ||
} | ||
} | ||
} | ||
return newLeaves | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See L634 to understand the context for this, it's quite simple.
// New leaf nodes. | ||
newLeaves := make([]*LeafNode, 0, 64) | ||
newLeaves = n.findNewLeafNodes(newLeaves) | ||
if len(newLeaves) > 0 { | ||
batchCommitLeafNodes(newLeaves) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so now in (*InternalNode).Commit()
we can't continue assuming all LeafNodes
are prepared.
We're responsible now to detect any newly created LeafNodes, and do the CPU-heavy work.
Note that I mention new leaf nodes. Existing LeafNodes were kept up to date with the usual diff-updating.
What findeNewLeafNodes(...)
does is quite simple: simply walk the tree and look for LeafNode
that have commitment == nil
, which signals is a newly created leaf node.
The batchCommitLeafNodes(...)
is a twist of the previous BatchNewLeafNodes(...)
that we had in the specialized API that was removed. I'll comment on it later below.
if n.commitment == nil { | ||
n.values[index] = value | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we receive an update for a key in a leaf that was created before, we simply put the value and move on.
We don't do diff-updating for obvious reasons: there's no previous commitment.
This is good. If a new leaf node is created, and touched and re-touched in the same block execution, it's very cheap now.
if n.commitment == nil { | ||
for i, v := range values { | ||
if len(v) != 0 && !bytes.Equal(v, n.values[i]) { | ||
n.values[i] = v | ||
} | ||
} | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here when we update multiple keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like adding comments breaks tests? 🤷 ha
Looking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gballet, go test ./... -race
looks to be working fine in my machine. I wonder if this one of those weird CI things that happened before of pulling wrong code.
Do you mind doing go test ./... -race
on your machine? To double check.
@@ -1297,6 +1307,7 @@ func (n *LeafNode) GetProofItems(keys keylist) (*ProofElements, []byte, [][]byte | |||
// Serialize serializes a LeafNode. | |||
// The format is: <nodeType><stem><bitlist><c1comm><c2comm><children...> | |||
func (n *LeafNode) Serialize() ([]byte, error) { | |||
n.Commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we always make sure that leaf node is committed.
func batchCommitLeafNodes(leaves []*LeafNode) { | ||
minBatchSize := 8 | ||
if len(leaves) < minBatchSize { | ||
commitLeafNodes(leaves) | ||
return | ||
} | ||
|
||
batchSize := len(leaves) / runtime.NumCPU() | ||
if batchSize < minBatchSize { | ||
batchSize = minBatchSize | ||
} | ||
|
||
var wg sync.WaitGroup | ||
for start := 0; start < len(leaves); start += batchSize { | ||
end := start + batchSize | ||
if end > len(leaves) { | ||
end = len(leaves) | ||
} | ||
wg.Add(1) | ||
go func(leaves []*LeafNode) { | ||
defer wg.Done() | ||
commitLeafNodes(leaves) | ||
}(leaves[start:end]) | ||
} | ||
wg.Wait() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so if you remember this is what we called when we collected all the new leaf nodes in (*InternalNode).Commit()
.
The idea in this method is the following:
- If we have less than 8 new leaves, avoid spinning up any new goroutine and do the work serially (but batching work; you'll see
commitLeafNode(...)
below). - If we have more than 8 new leaves, we split the work in groups of 8 leaves at a minimum.
The idea of the last point is that even if we have 16 cores and 16 leaves, it doesn't make sense to spin up 16 goroutines since that's a lot of goroutine overhead for the amount of work. What this logic will do is always make sure each goroutine is at least doing work for 8 leaf nodes, so the overhead of spinning up goroutines is well justified.
That's just a "good practice" of not always assuming that you should split up work in the number of cores, but just be sure each one will have enough work to do to justify the scheduling overhead.
Note that "8" is a magic number that I handwaved it considering what's usually the amount of CPU work. This is leaning more into going single core, and going multi-core (up to N cores in you machine) if clearly justified. We can also play moving this number and the rest will accommodate alone.
func commitLeafNodes(leaves []*LeafNode) { | ||
cfg := GetConfig() | ||
|
||
c1c2points := make([]*Point, 2*len(leaves)) | ||
c1c2frs := make([]*Fr, 2*len(leaves)) | ||
for i, n := range leaves { | ||
// C1. | ||
var c1poly [NodeWidth]Fr | ||
count := fillSuffixTreePoly(c1poly[:], n.values[:NodeWidth/2]) | ||
containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx && | ||
c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) && | ||
c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue) | ||
if containsEmptyCodeHash { | ||
// Clear out values of the cached point. | ||
c1poly[EmptyCodeHashFirstHalfIdx] = FrZero | ||
c1poly[EmptyCodeHashSecondHalfIdx] = FrZero | ||
// Calculate the remaining part of c1 and add to the base value. | ||
partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2) | ||
n.c1 = new(Point) | ||
n.c1.Add(&EmptyCodeHashPoint, partialc1) | ||
} else { | ||
n.c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count) | ||
} | ||
|
||
// C2. | ||
var c2poly [NodeWidth]Fr | ||
count = fillSuffixTreePoly(c2poly[:], n.values[NodeWidth/2:]) | ||
n.c2 = cfg.CommitToPoly(c2poly[:], NodeWidth-count) | ||
|
||
c1c2points[2*i], c1c2points[2*i+1] = n.c1, n.c2 | ||
c1c2frs[2*i], c1c2frs[2*i+1] = new(Fr), new(Fr) | ||
} | ||
|
||
toFrMultiple(c1c2frs, c1c2points) | ||
|
||
var poly [NodeWidth]Fr | ||
poly[0].SetUint64(1) | ||
for i, nv := range leaves { | ||
StemFromBytes(&poly[1], nv.stem) | ||
poly[2] = *c1c2frs[2*i] | ||
poly[3] = *c1c2frs[2*i+1] | ||
|
||
nv.commitment = cfg.CommitToPoly(poly[:], 252) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what each goroutine (if spinned) will work for their batch. This code isn't new, but a twist/mixture of what we did previously, NewLeafNode
and the previous BatchNewLeafNode.
// ***Insert the key pairs with optimized strategy & methods*** | ||
rand = mRand.New(mRand.NewSource(42)) //skipcq: GSC-G404 | ||
tree = genRandomTree(rand, treeInitialKeyValCount) | ||
randomKeyValues = genRandomKeyValues(rand, migrationKeyValueCount) | ||
|
||
now = time.Now() | ||
// Create LeafNodes in batch mode. | ||
nodeValues := make([]BatchNewLeafNodeData, 0, len(randomKeyValues)) | ||
curr := BatchNewLeafNodeData{ | ||
Stem: randomKeyValues[0].key[:StemSize], | ||
Values: map[byte][]byte{randomKeyValues[0].key[StemSize]: randomKeyValues[0].value}, | ||
} | ||
for _, kv := range randomKeyValues[1:] { | ||
if bytes.Equal(curr.Stem, kv.key[:StemSize]) { | ||
curr.Values[kv.key[StemSize]] = kv.value | ||
continue | ||
} | ||
nodeValues = append(nodeValues, curr) | ||
curr = BatchNewLeafNodeData{ | ||
Stem: kv.key[:StemSize], | ||
Values: map[byte][]byte{kv.key[StemSize]: kv.value}, | ||
} | ||
} | ||
// Append last remaining node. | ||
nodeValues = append(nodeValues, curr) | ||
|
||
// Create all leaves in batch mode so we can optimize cryptography operations. | ||
newLeaves := BatchNewLeafNode(nodeValues) | ||
if err := tree.(*InternalNode).InsertMigratedLeaves(newLeaves, nil); err != nil { | ||
t.Fatalf("failed to insert key: %v", err) | ||
} | ||
|
||
batchedRoot := tree.Commit().Bytes() | ||
if _, err := tree.(*InternalNode).BatchSerialize(); err != nil { | ||
t.Fatalf("failed to serialize batched tree: %v", err) | ||
} | ||
batchedDuration += time.Since(now) | ||
|
||
if unbatchedRoot != batchedRoot { | ||
t.Fatalf("expected %x, got %x", unbatchedRoot, batchedRoot) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is gone since now we don't have a base API vs batch API anymore; the fast version is the base API.
We can evaluate later if it makes sense to keep this test/benchmark; but for now can give useful information as shown in the PR description.
@@ -1232,7 +1190,7 @@ func genRandomKeyValues(rand *mRand.Rand, count int) []keyValue { | |||
return ret | |||
} | |||
|
|||
func BenchmarkBatchLeavesInsert(b *testing.B) { | |||
func BenchmarkNewLeavesInsert(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this benchmark uses the base API since should be the fastest way to insert new leaves; so got simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before:
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkBatchLeavesInsert-32 13 85987042 ns/op 51667848 B/op 114859 allocs/op
PASS
ok github.com/gballet/go-verkle 2.546s
After with 8:
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor
BenchmarkNewLeavesInsert-32 14 86549506 ns/op 49047049 B/op 87847 allocs/op
PASS
ok github.com/gballet/go-verkle 2.428s
So it doesn't seem to have a great performance impact on my machine. The reduction in allocations, however, is quite welcome.
Tweaking the batch size doesn't seem to have any impact. I will run it on the replay benchmark as soon as I fixed the bugs in my new conversion code.
if n.commitment == nil { | ||
for i, v := range values { | ||
if len(v) != 0 && !bytes.Equal(v, n.values[i]) { | ||
n.values[i] = v | ||
} | ||
} | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991
…updating Signed-off-by: Ignacio Hagopian <[email protected]>
This PR removes the new-ish specialized API for inserting new leaves that we created in #343, and does some internal refactors to have equal performance just using the base API (i.e:
Insert(key, value)
).The high-level idea is to avoid doing CPU-heavy work in
NewLeafNode(...)
and let thecommitment
be nil. We add awareness in some code sections that case can happen, and simply means that’s an uncommited new leaf node.In particular, when we call:
(*InternalNode).Commit(...)
: we stop assuming the leafs are prepared. Instead, we detect all uncommited leaves and do the batch-way of creating a set of uncommited leaves that I created in Overlay tree migration explorations #343 but with some twists (but the same idea).(LeafNode).Commit(...)
: we only commit thatLeafNode
using the same batch method but with a single element; so the logic is reduced.I also refined the way the new-leaf-node-batching works but creating hot-paths for single-leaf node creation and only going full steam using cores if there’s enough work to do. (More about this in PR comments).
Note that we aren’t doing exactly #314. There’s no COW here in leaves since we’re exclusively focusing on optimizing freshly created leaves which is exactly the case for overlay tree key/value migrations. But, this optimization will also be exploited under normal block-execution circumstances since new/fresh leaves will be created too.
The logic for updating values in existing leaves remains the same as today; doing diff updating. If we ever want to optimize this case, then it might make sense to introduce the COW idea. But that needs more justification since it introduces a new map and more logical complexity. This PR doesn’t introduce extra memory overhead.
Below are the comparisons between the current-newish batch API for key/value migration and the new version of
Insert(key, value)
.Synthetic insertion
As a refresher, this was a simulation of a pre-existing tree with X random key/values, and then how long it takes to insert another random Y key/values.
Below I repeat the output of our current
master
branch. Note that the important number isbatched XXms
since that is showing how fast the new APIs that we introduced runs:Below, is the same test/benchmark but in this PR. i.e: using
Insert(key, value)
, what we called “unbatched” above. (i.e: no special APIs used):Benchmark-style measurement
In the previous PR, I also introduced a more formal benchmark.
Before (using new-ish batched API) took ~101ms per op:
After (this PR) the use of plain
Insert(key, value)
takes ~102ms per op:TL;DR:
Insert(..., ...)
that create new LeafNodes will have a faster treeCommit(...)
performance.I tested this branch in the replay benchmark and got a similar performance. I inspected why, and it’s because all block executions only have a few new leaves created, so this optimization of batching newly created leaves isn’t exploited that much. Probably when importing “heavier” blocks and being post-byzantium, we could appreciate the difference more. It might still be useful to double-check our current replay benchmark in the reference machine.