Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch NewLeaf node work in base API #349

Closed
wants to merge 7 commits into from
Closed

Batch NewLeaf node work in base API #349

wants to merge 7 commits into from

Conversation

jsign
Copy link
Collaborator

@jsign jsign commented Apr 26, 2023

This PR removes the new-ish specialized API for inserting new leaves that we created in #343, and does some internal refactors to have equal performance just using the base API (i.e: Insert(key, value)).

The high-level idea is to avoid doing CPU-heavy work in NewLeafNode(...) and let the commitment be nil. We add awareness in some code sections that case can happen, and simply means that’s an uncommited new leaf node.

In particular, when we call:

  • (*InternalNode).Commit(...): we stop assuming the leafs are prepared. Instead, we detect all uncommited leaves and do the batch-way of creating a set of uncommited leaves that I created in Overlay tree migration explorations #343 but with some twists (but the same idea).
  • (LeafNode).Commit(...): we only commit that LeafNode using the same batch method but with a single element; so the logic is reduced.

I also refined the way the new-leaf-node-batching works but creating hot-paths for single-leaf node creation and only going full steam using cores if there’s enough work to do. (More about this in PR comments).

Note that we aren’t doing exactly #314. There’s no COW here in leaves since we’re exclusively focusing on optimizing freshly created leaves which is exactly the case for overlay tree key/value migrations. But, this optimization will also be exploited under normal block-execution circumstances since new/fresh leaves will be created too.

The logic for updating values in existing leaves remains the same as today; doing diff updating. If we ever want to optimize this case, then it might make sense to introduce the COW idea. But that needs more justification since it introduces a new map and more logical complexity. This PR doesn’t introduce extra memory overhead.

Below are the comparisons between the current-newish batch API for key/value migration and the new version of Insert(key, value).

Synthetic insertion

As a refresher, this was a simulation of a pre-existing tree with X random key/values, and then how long it takes to insert another random Y key/values.

Below I repeat the output of our current master branch. Note that the important number is batched XXms since that is showing how fast the new APIs that we introduced runs:

Assuming 0 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 43ms, batched 23ms, 1.89x
        If 2000 extra key-values are migrated: unbatched 78ms, batched 39ms, 2.01x
        If 5000 extra key-values are migrated: unbatched 192ms, batched 87ms, 2.21x
        If 8000 extra key-values are migrated: unbatched 304ms, batched 137ms, 2.22x
Assuming 500 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 44ms, batched 29ms, 1.50x
        If 2000 extra key-values are migrated: unbatched 81ms, batched 44ms, 1.82x
        If 5000 extra key-values are migrated: unbatched 193ms, batched 95ms, 2.03x
        If 8000 extra key-values are migrated: unbatched 308ms, batched 144ms, 2.14x
Assuming 1000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 51ms, batched 34ms, 1.47x
        If 2000 extra key-values are migrated: unbatched 88ms, batched 52ms, 1.67x
        If 5000 extra key-values are migrated: unbatched 200ms, batched 102ms, 1.95x
        If 8000 extra key-values are migrated: unbatched 317ms, batched 150ms, 2.10x
Assuming 2000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 63ms, batched 47ms, 1.32x
        If 2000 extra key-values are migrated: unbatched 99ms, batched 63ms, 1.58x
        If 5000 extra key-values are migrated: unbatched 214ms, batched 113ms, 1.88x
        If 8000 extra key-values are migrated: unbatched 328ms, batched 165ms, 1.99x
Assuming 5000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 101ms, batched 86ms, 1.17x
        If 2000 extra key-values are migrated: unbatched 137ms, batched 99ms, 1.38x
        If 5000 extra key-values are migrated: unbatched 252ms, batched 151ms, 1.67x
        If 8000 extra key-values are migrated: unbatched 370ms, batched 204ms, 1.81x

Below, is the same test/benchmark but in this PR. i.e: using Insert(key, value) , what we called “unbatched” above. (i.e: no special APIs used):

Assuming 0 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 23ms
        If 2000 extra key-values are migrated: unbatched 38ms
        If 5000 extra key-values are migrated: unbatched 85ms
        If 8000 extra key-values are migrated: unbatched 132ms
Assuming 500 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 29ms
        If 2000 extra key-values are migrated: unbatched 44ms
        If 5000 extra key-values are migrated: unbatched 92ms
        If 8000 extra key-values are migrated: unbatched 139ms
Assuming 1000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 36ms
        If 2000 extra key-values are migrated: unbatched 52ms
        If 5000 extra key-values are migrated: unbatched 100ms
        If 8000 extra key-values are migrated: unbatched 146ms
Assuming 2000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 51ms
        If 2000 extra key-values are migrated: unbatched 66ms
        If 5000 extra key-values are migrated: unbatched 113ms
        If 8000 extra key-values are migrated: unbatched 162ms
Assuming 5000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 93ms
        If 2000 extra key-values are migrated: unbatched 110ms
        If 5000 extra key-values are migrated: unbatched 157ms
        If 8000 extra key-values are migrated: unbatched 207ms

Benchmark-style measurement

In the previous PR, I also introduced a more formal benchmark.

Before (using new-ish batched API) took ~101ms per op:

$ go test . -run=none -bench=BenchmarkBatchLeavesInsert -memprofile=mem.out -cpuprofile=cpu.out
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkBatchLeavesInsert-16                 10         100468352 ns/op        51978069 B/op     124714 allocs/op
PASS
ok      github.com/gballet/go-verkle    3.056s

After (this PR) the use of plain Insert(key, value) takes ~102ms per op:

$ go test . -run=none -bench=BenchmarkNewLeavesInsert -memprofile=mem.out -cpuprofile=cpu.out
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkNewLeavesInsert-16           12          98242143 ns/op        49037674 B/op      87698 allocs/op
PASS
ok      github.com/gballet/go-verkle    3.639s

TL;DR:

  • Synthetic benchmarks: as fast as the specialized API.
  • Benchmark: as fast as the specialized API, and some fewer allocs.
  • API surface reduced again to usual APIs.
  • Not only can overlay-tree key/val migration use the usual APIs, but block-execution cases of Insert(..., ...) that create new LeafNodes will have a faster tree Commit(...) performance.

I tested this branch in the replay benchmark and got a similar performance. I inspected why, and it’s because all block executions only have a few new leaves created, so this optimization of batching newly created leaves isn’t exploited that much. Probably when importing “heavier” blocks and being post-byzantium, we could appreciate the difference more. It might still be useful to double-check our current replay benchmark in the reference machine.

jsign added 6 commits April 26, 2023 12:51
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Signed-off-by: Ignacio Hagopian <[email protected]>
Comment on lines -248 to -279
cfg := GetConfig()

// C1.
var c1poly [NodeWidth]Fr
var c1 *Point
count := fillSuffixTreePoly(c1poly[:], values[:NodeWidth/2])
containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx &&
c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) &&
c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue)
if containsEmptyCodeHash {
// Clear out values of the cached point.
c1poly[EmptyCodeHashFirstHalfIdx] = FrZero
c1poly[EmptyCodeHashSecondHalfIdx] = FrZero
// Calculate the remaining part of c1 and add to the base value.
partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2)
c1 = new(Point)
c1.Add(&EmptyCodeHashPoint, partialc1)
} else {
c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count)
}

// C2.
var c2poly [NodeWidth]Fr
count = fillSuffixTreePoly(c2poly[:], values[NodeWidth/2:])
c2 := cfg.CommitToPoly(c2poly[:], NodeWidth-count)

// Root commitment preparation for calculation.
stem = stem[:StemSize] // enforce a 31-byte length
var poly [NodeWidth]Fr
poly[0].SetUint64(1)
StemFromBytes(&poly[1], stem)
toFrMultiple([]*Fr{&poly[2], &poly[3]}, []*Point{c1, c2})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stop doing work in NewLeafNode and allow c1, c2 and commitment to be nil.

Comment on lines +615 to +627
func (n *InternalNode) findNewLeafNodes(newLeaves []*LeafNode) []*LeafNode {
for idx := range n.cow {
child := n.children[idx]
if childInternalNode, ok := child.(*InternalNode); ok && len(childInternalNode.cow) > 0 {
newLeaves = childInternalNode.findNewLeafNodes(newLeaves)
} else if leafNode, ok := child.(*LeafNode); ok {
if leafNode.commitment == nil {
newLeaves = append(newLeaves, leafNode)
}
}
}
return newLeaves
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See L634 to understand the context for this, it's quite simple.

Comment on lines +634 to +639
// New leaf nodes.
newLeaves := make([]*LeafNode, 0, 64)
newLeaves = n.findNewLeafNodes(newLeaves)
if len(newLeaves) > 0 {
batchCommitLeafNodes(newLeaves)
}
Copy link
Collaborator Author

@jsign jsign Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so now in (*InternalNode).Commit() we can't continue assuming all LeafNodes are prepared.
We're responsible now to detect any newly created LeafNodes, and do the CPU-heavy work.
Note that I mention new leaf nodes. Existing LeafNodes were kept up to date with the usual diff-updating.

What findeNewLeafNodes(...) does is quite simple: simply walk the tree and look for LeafNode that have commitment == nil, which signals is a newly created leaf node.

The batchCommitLeafNodes(...) is a twist of the previous BatchNewLeafNodes(...) that we had in the specialized API that was removed. I'll comment on it later below.

Comment on lines +991 to +994
if n.commitment == nil {
n.values[index] = value
return
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we receive an update for a key in a leaf that was created before, we simply put the value and move on.
We don't do diff-updating for obvious reasons: there's no previous commitment.

This is good. If a new leaf node is created, and touched and re-touched in the same block execution, it's very cheap now.

Comment on lines +1020 to +1027
if n.commitment == nil {
for i, v := range values {
if len(v) != 0 && !bytes.Equal(v, n.values[i]) {
n.values[i] = v
}
}
return
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here when we update multiple keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gballet, done in 104f849.

Copy link
Collaborator Author

@jsign jsign May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like adding comments breaks tests? 🤷 ha
Looking...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gballet, go test ./... -race looks to be working fine in my machine. I wonder if this one of those weird CI things that happened before of pulling wrong code.

Do you mind doing go test ./... -race on your machine? To double check.

@@ -1297,6 +1307,7 @@ func (n *LeafNode) GetProofItems(keys keylist) (*ProofElements, []byte, [][]byte
// Serialize serializes a LeafNode.
// The format is: <nodeType><stem><bitlist><c1comm><c2comm><children...>
func (n *LeafNode) Serialize() ([]byte, error) {
n.Commit()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we always make sure that leaf node is committed.

Comment on lines +1542 to +1567
func batchCommitLeafNodes(leaves []*LeafNode) {
minBatchSize := 8
if len(leaves) < minBatchSize {
commitLeafNodes(leaves)
return
}

batchSize := len(leaves) / runtime.NumCPU()
if batchSize < minBatchSize {
batchSize = minBatchSize
}

var wg sync.WaitGroup
for start := 0; start < len(leaves); start += batchSize {
end := start + batchSize
if end > len(leaves) {
end = len(leaves)
}
wg.Add(1)
go func(leaves []*LeafNode) {
defer wg.Done()
commitLeafNodes(leaves)
}(leaves[start:end])
}
wg.Wait()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so if you remember this is what we called when we collected all the new leaf nodes in (*InternalNode).Commit().

The idea in this method is the following:

  • If we have less than 8 new leaves, avoid spinning up any new goroutine and do the work serially (but batching work; you'll see commitLeafNode(...) below).
  • If we have more than 8 new leaves, we split the work in groups of 8 leaves at a minimum.

The idea of the last point is that even if we have 16 cores and 16 leaves, it doesn't make sense to spin up 16 goroutines since that's a lot of goroutine overhead for the amount of work. What this logic will do is always make sure each goroutine is at least doing work for 8 leaf nodes, so the overhead of spinning up goroutines is well justified.

That's just a "good practice" of not always assuming that you should split up work in the number of cores, but just be sure each one will have enough work to do to justify the scheduling overhead.

Note that "8" is a magic number that I handwaved it considering what's usually the amount of CPU work. This is leaning more into going single core, and going multi-core (up to N cores in you machine) if clearly justified. We can also play moving this number and the rest will accommodate alone.

Comment on lines +1569 to +1613
func commitLeafNodes(leaves []*LeafNode) {
cfg := GetConfig()

c1c2points := make([]*Point, 2*len(leaves))
c1c2frs := make([]*Fr, 2*len(leaves))
for i, n := range leaves {
// C1.
var c1poly [NodeWidth]Fr
count := fillSuffixTreePoly(c1poly[:], n.values[:NodeWidth/2])
containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx &&
c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) &&
c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue)
if containsEmptyCodeHash {
// Clear out values of the cached point.
c1poly[EmptyCodeHashFirstHalfIdx] = FrZero
c1poly[EmptyCodeHashSecondHalfIdx] = FrZero
// Calculate the remaining part of c1 and add to the base value.
partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2)
n.c1 = new(Point)
n.c1.Add(&EmptyCodeHashPoint, partialc1)
} else {
n.c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count)
}

// C2.
var c2poly [NodeWidth]Fr
count = fillSuffixTreePoly(c2poly[:], n.values[NodeWidth/2:])
n.c2 = cfg.CommitToPoly(c2poly[:], NodeWidth-count)

c1c2points[2*i], c1c2points[2*i+1] = n.c1, n.c2
c1c2frs[2*i], c1c2frs[2*i+1] = new(Fr), new(Fr)
}

toFrMultiple(c1c2frs, c1c2points)

var poly [NodeWidth]Fr
poly[0].SetUint64(1)
for i, nv := range leaves {
StemFromBytes(&poly[1], nv.stem)
poly[2] = *c1c2frs[2*i]
poly[3] = *c1c2frs[2*i+1]

nv.commitment = cfg.CommitToPoly(poly[:], 252)
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what each goroutine (if spinned) will work for their batch. This code isn't new, but a twist/mixture of what we did previously, NewLeafNode and the previous BatchNewLeafNode.

Comment on lines -1161 to -1201
// ***Insert the key pairs with optimized strategy & methods***
rand = mRand.New(mRand.NewSource(42)) //skipcq: GSC-G404
tree = genRandomTree(rand, treeInitialKeyValCount)
randomKeyValues = genRandomKeyValues(rand, migrationKeyValueCount)

now = time.Now()
// Create LeafNodes in batch mode.
nodeValues := make([]BatchNewLeafNodeData, 0, len(randomKeyValues))
curr := BatchNewLeafNodeData{
Stem: randomKeyValues[0].key[:StemSize],
Values: map[byte][]byte{randomKeyValues[0].key[StemSize]: randomKeyValues[0].value},
}
for _, kv := range randomKeyValues[1:] {
if bytes.Equal(curr.Stem, kv.key[:StemSize]) {
curr.Values[kv.key[StemSize]] = kv.value
continue
}
nodeValues = append(nodeValues, curr)
curr = BatchNewLeafNodeData{
Stem: kv.key[:StemSize],
Values: map[byte][]byte{kv.key[StemSize]: kv.value},
}
}
// Append last remaining node.
nodeValues = append(nodeValues, curr)

// Create all leaves in batch mode so we can optimize cryptography operations.
newLeaves := BatchNewLeafNode(nodeValues)
if err := tree.(*InternalNode).InsertMigratedLeaves(newLeaves, nil); err != nil {
t.Fatalf("failed to insert key: %v", err)
}

batchedRoot := tree.Commit().Bytes()
if _, err := tree.(*InternalNode).BatchSerialize(); err != nil {
t.Fatalf("failed to serialize batched tree: %v", err)
}
batchedDuration += time.Since(now)

if unbatchedRoot != batchedRoot {
t.Fatalf("expected %x, got %x", unbatchedRoot, batchedRoot)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this is gone since now we don't have a base API vs batch API anymore; the fast version is the base API.
We can evaluate later if it makes sense to keep this test/benchmark; but for now can give useful information as shown in the PR description.

@@ -1232,7 +1190,7 @@ func genRandomKeyValues(rand *mRand.Rand, count int) []keyValue {
return ret
}

func BenchmarkBatchLeavesInsert(b *testing.B) {
func BenchmarkNewLeavesInsert(b *testing.B) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this benchmark uses the base API since should be the fastest way to insert new leaves; so got simpler.

@jsign jsign marked this pull request as ready for review April 26, 2023 17:17
@jsign jsign requested a review from gballet April 26, 2023 17:17
Copy link
Member

@gballet gballet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:

goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor            
BenchmarkBatchLeavesInsert-32    	      13	  85987042 ns/op	51667848 B/op	  114859 allocs/op
PASS
ok  	github.com/gballet/go-verkle	2.546s

After with 8:

goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor            
BenchmarkNewLeavesInsert-32    	      14	  86549506 ns/op	49047049 B/op	   87847 allocs/op
PASS
ok  	github.com/gballet/go-verkle	2.428s

So it doesn't seem to have a great performance impact on my machine. The reduction in allocations, however, is quite welcome.

Tweaking the batch size doesn't seem to have any impact. I will run it on the replay benchmark as soon as I fixed the bugs in my new conversion code.

Comment on lines +1020 to +1027
if n.commitment == nil {
for i, v := range values {
if len(v) != 0 && !bytes.Equal(v, n.values[i]) {
n.values[i] = v
}
}
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants