Batch NewLeaf node work in base API #349

jsign · 2023-04-26T16:53:21Z

This PR removes the new-ish specialized API for inserting new leaves that we created in #343, and does some internal refactors to have equal performance just using the base API (i.e: Insert(key, value)).

The high-level idea is to avoid doing CPU-heavy work in NewLeafNode(...) and let the commitment be nil. We add awareness in some code sections that case can happen, and simply means that’s an uncommited new leaf node.

In particular, when we call:

(*InternalNode).Commit(...): we stop assuming the leafs are prepared. Instead, we detect all uncommited leaves and do the batch-way of creating a set of uncommited leaves that I created in Overlay tree migration explorations #343 but with some twists (but the same idea).
(LeafNode).Commit(...): we only commit that LeafNode using the same batch method but with a single element; so the logic is reduced.

I also refined the way the new-leaf-node-batching works but creating hot-paths for single-leaf node creation and only going full steam using cores if there’s enough work to do. (More about this in PR comments).

Note that we aren’t doing exactly #314. There’s no COW here in leaves since we’re exclusively focusing on optimizing freshly created leaves which is exactly the case for overlay tree key/value migrations. But, this optimization will also be exploited under normal block-execution circumstances since new/fresh leaves will be created too.

The logic for updating values in existing leaves remains the same as today; doing diff updating. If we ever want to optimize this case, then it might make sense to introduce the COW idea. But that needs more justification since it introduces a new map and more logical complexity. This PR doesn’t introduce extra memory overhead.

Below are the comparisons between the current-newish batch API for key/value migration and the new version of Insert(key, value).

Synthetic insertion

As a refresher, this was a simulation of a pre-existing tree with X random key/values, and then how long it takes to insert another random Y key/values.

Below I repeat the output of our current master branch. Note that the important number is batched XXms since that is showing how fast the new APIs that we introduced runs:

Assuming 0 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 43ms, batched 23ms, 1.89x
        If 2000 extra key-values are migrated: unbatched 78ms, batched 39ms, 2.01x
        If 5000 extra key-values are migrated: unbatched 192ms, batched 87ms, 2.21x
        If 8000 extra key-values are migrated: unbatched 304ms, batched 137ms, 2.22x
Assuming 500 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 44ms, batched 29ms, 1.50x
        If 2000 extra key-values are migrated: unbatched 81ms, batched 44ms, 1.82x
        If 5000 extra key-values are migrated: unbatched 193ms, batched 95ms, 2.03x
        If 8000 extra key-values are migrated: unbatched 308ms, batched 144ms, 2.14x
Assuming 1000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 51ms, batched 34ms, 1.47x
        If 2000 extra key-values are migrated: unbatched 88ms, batched 52ms, 1.67x
        If 5000 extra key-values are migrated: unbatched 200ms, batched 102ms, 1.95x
        If 8000 extra key-values are migrated: unbatched 317ms, batched 150ms, 2.10x
Assuming 2000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 63ms, batched 47ms, 1.32x
        If 2000 extra key-values are migrated: unbatched 99ms, batched 63ms, 1.58x
        If 5000 extra key-values are migrated: unbatched 214ms, batched 113ms, 1.88x
        If 8000 extra key-values are migrated: unbatched 328ms, batched 165ms, 1.99x
Assuming 5000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 101ms, batched 86ms, 1.17x
        If 2000 extra key-values are migrated: unbatched 137ms, batched 99ms, 1.38x
        If 5000 extra key-values are migrated: unbatched 252ms, batched 151ms, 1.67x
        If 8000 extra key-values are migrated: unbatched 370ms, batched 204ms, 1.81x

Below, is the same test/benchmark but in this PR. i.e: using Insert(key, value) , what we called “unbatched” above. (i.e: no special APIs used):

Assuming 0 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 23ms
        If 2000 extra key-values are migrated: unbatched 38ms
        If 5000 extra key-values are migrated: unbatched 85ms
        If 8000 extra key-values are migrated: unbatched 132ms
Assuming 500 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 29ms
        If 2000 extra key-values are migrated: unbatched 44ms
        If 5000 extra key-values are migrated: unbatched 92ms
        If 8000 extra key-values are migrated: unbatched 139ms
Assuming 1000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 36ms
        If 2000 extra key-values are migrated: unbatched 52ms
        If 5000 extra key-values are migrated: unbatched 100ms
        If 8000 extra key-values are migrated: unbatched 146ms
Assuming 2000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 51ms
        If 2000 extra key-values are migrated: unbatched 66ms
        If 5000 extra key-values are migrated: unbatched 113ms
        If 8000 extra key-values are migrated: unbatched 162ms
Assuming 5000 key/values touched by block execution:
        If 1000 extra key-values are migrated: unbatched 93ms
        If 2000 extra key-values are migrated: unbatched 110ms
        If 5000 extra key-values are migrated: unbatched 157ms
        If 8000 extra key-values are migrated: unbatched 207ms

Benchmark-style measurement

In the previous PR, I also introduced a more formal benchmark.

Before (using new-ish batched API) took ~101ms per op:

$ go test . -run=none -bench=BenchmarkBatchLeavesInsert -memprofile=mem.out -cpuprofile=cpu.out
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkBatchLeavesInsert-16                 10         100468352 ns/op        51978069 B/op     124714 allocs/op
PASS
ok      github.com/gballet/go-verkle    3.056s

After (this PR) the use of plain Insert(key, value) takes ~102ms per op:

$ go test . -run=none -bench=BenchmarkNewLeavesInsert -memprofile=mem.out -cpuprofile=cpu.out
goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 7 3800XT 8-Core Processor            
BenchmarkNewLeavesInsert-16           12          98242143 ns/op        49037674 B/op      87698 allocs/op
PASS
ok      github.com/gballet/go-verkle    3.639s

TL;DR:

Synthetic benchmarks: as fast as the specialized API.
Benchmark: as fast as the specialized API, and some fewer allocs.
API surface reduced again to usual APIs.
Not only can overlay-tree key/val migration use the usual APIs, but block-execution cases of Insert(..., ...) that create new LeafNodes will have a faster tree Commit(...) performance.

I tested this branch in the replay benchmark and got a similar performance. I inspected why, and it’s because all block executions only have a few new leaves created, so this optimization of batching newly created leaves isn’t exploited that much. Probably when importing “heavier” blocks and being post-byzantium, we could appreciate the difference more. It might still be useful to double-check our current replay benchmark in the reference machine.

Signed-off-by: Ignacio Hagopian <[email protected]>

jsign · 2023-04-26T16:54:41Z

tree.go

-	cfg := GetConfig()
-
-	// C1.
-	var c1poly [NodeWidth]Fr
-	var c1 *Point
-	count := fillSuffixTreePoly(c1poly[:], values[:NodeWidth/2])
-	containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx &&
-		c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) &&
-		c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue)
-	if containsEmptyCodeHash {
-		// Clear out values of the cached point.
-		c1poly[EmptyCodeHashFirstHalfIdx] = FrZero
-		c1poly[EmptyCodeHashSecondHalfIdx] = FrZero
-		// Calculate the remaining part of c1 and add to the base value.
-		partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2)
-		c1 = new(Point)
-		c1.Add(&EmptyCodeHashPoint, partialc1)
-	} else {
-		c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count)
-	}
-
-	// C2.
-	var c2poly [NodeWidth]Fr
-	count = fillSuffixTreePoly(c2poly[:], values[NodeWidth/2:])
-	c2 := cfg.CommitToPoly(c2poly[:], NodeWidth-count)
-
-	// Root commitment preparation for calculation.
-	stem = stem[:StemSize] // enforce a 31-byte length
-	var poly [NodeWidth]Fr
-	poly[0].SetUint64(1)
-	StemFromBytes(&poly[1], stem)
-	toFrMultiple([]*Fr{&poly[2], &poly[3]}, []*Point{c1, c2})


We stop doing work in NewLeafNode and allow c1, c2 and commitment to be nil.

jsign · 2023-04-26T16:55:18Z

tree.go

+func (n *InternalNode) findNewLeafNodes(newLeaves []*LeafNode) []*LeafNode {
+	for idx := range n.cow {
+		child := n.children[idx]
+		if childInternalNode, ok := child.(*InternalNode); ok && len(childInternalNode.cow) > 0 {
+			newLeaves = childInternalNode.findNewLeafNodes(newLeaves)
+		} else if leafNode, ok := child.(*LeafNode); ok {
+			if leafNode.commitment == nil {
+				newLeaves = append(newLeaves, leafNode)
+			}
+		}
+	}
+	return newLeaves
+}


See L634 to understand the context for this, it's quite simple.

jsign · 2023-04-26T16:58:27Z

tree.go

+	// New leaf nodes.
+	newLeaves := make([]*LeafNode, 0, 64)
+	newLeaves = n.findNewLeafNodes(newLeaves)
+	if len(newLeaves) > 0 {
+		batchCommitLeafNodes(newLeaves)
+	}


OK, so now in (*InternalNode).Commit() we can't continue assuming all LeafNodes are prepared.
We're responsible now to detect any newly created LeafNodes, and do the CPU-heavy work.
Note that I mention new leaf nodes. Existing LeafNodes were kept up to date with the usual diff-updating.

What findeNewLeafNodes(...) does is quite simple: simply walk the tree and look for LeafNode that have commitment == nil, which signals is a newly created leaf node.

The batchCommitLeafNodes(...) is a twist of the previous BatchNewLeafNodes(...) that we had in the specialized API that was removed. I'll comment on it later below.

jsign · 2023-04-26T16:59:49Z

tree.go

+	if n.commitment == nil {
+		n.values[index] = value
+		return
+	}


If we receive an update for a key in a leaf that was created before, we simply put the value and move on.
We don't do diff-updating for obvious reasons: there's no previous commitment.

This is good. If a new leaf node is created, and touched and re-touched in the same block execution, it's very cheap now.

jsign · 2023-04-26T17:00:07Z

tree.go

+	if n.commitment == nil {
+		for i, v := range values {
+			if len(v) != 0 && !bytes.Equal(v, n.values[i]) {
+				n.values[i] = v
+			}
+		}
+		return
+	}


Same here when we update multiple keys.

that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991

@gballet, done in 104f849.

Looks like adding comments breaks tests? 🤷 ha
Looking...

@gballet, go test ./... -race looks to be working fine in my machine. I wonder if this one of those weird CI things that happened before of pulling wrong code.

Do you mind doing go test ./... -race on your machine? To double check.

jsign · 2023-04-26T17:01:24Z

tree.go

@@ -1297,6 +1307,7 @@ func (n *LeafNode) GetProofItems(keys keylist) (*ProofElements, []byte, [][]byte
 // Serialize serializes a LeafNode.
 // The format is: <nodeType><stem><bitlist><c1comm><c2comm><children...>
 func (n *LeafNode) Serialize() ([]byte, error) {
+	n.Commit()


Now we always make sure that leaf node is committed.

jsign · 2023-04-26T17:05:26Z

tree.go

+func batchCommitLeafNodes(leaves []*LeafNode) {
+	minBatchSize := 8
+	if len(leaves) < minBatchSize {
+		commitLeafNodes(leaves)
+		return
+	}
+
+	batchSize := len(leaves) / runtime.NumCPU()
+	if batchSize < minBatchSize {
+		batchSize = minBatchSize
+	}
+
+	var wg sync.WaitGroup
+	for start := 0; start < len(leaves); start += batchSize {
+		end := start + batchSize
+		if end > len(leaves) {
+			end = len(leaves)
+		}
+		wg.Add(1)
+		go func(leaves []*LeafNode) {
+			defer wg.Done()
+			commitLeafNodes(leaves)
+		}(leaves[start:end])
+	}
+	wg.Wait()
+}


OK, so if you remember this is what we called when we collected all the new leaf nodes in (*InternalNode).Commit().

The idea in this method is the following:

If we have less than 8 new leaves, avoid spinning up any new goroutine and do the work serially (but batching work; you'll see commitLeafNode(...) below).

If we have more than 8 new leaves, we split the work in groups of 8 leaves at a minimum.

The idea of the last point is that even if we have 16 cores and 16 leaves, it doesn't make sense to spin up 16 goroutines since that's a lot of goroutine overhead for the amount of work. What this logic will do is always make sure each goroutine is at least doing work for 8 leaf nodes, so the overhead of spinning up goroutines is well justified.

That's just a "good practice" of not always assuming that you should split up work in the number of cores, but just be sure each one will have enough work to do to justify the scheduling overhead.

Note that "8" is a magic number that I handwaved it considering what's usually the amount of CPU work. This is leaning more into going single core, and going multi-core (up to N cores in you machine) if clearly justified. We can also play moving this number and the rest will accommodate alone.

jsign · 2023-04-26T17:13:39Z

tree.go

+func commitLeafNodes(leaves []*LeafNode) {
+	cfg := GetConfig()
+
+	c1c2points := make([]*Point, 2*len(leaves))
+	c1c2frs := make([]*Fr, 2*len(leaves))
+	for i, n := range leaves {
+		// C1.
+		var c1poly [NodeWidth]Fr
+		count := fillSuffixTreePoly(c1poly[:], n.values[:NodeWidth/2])
+		containsEmptyCodeHash := len(c1poly) >= EmptyCodeHashSecondHalfIdx &&
+			c1poly[EmptyCodeHashFirstHalfIdx].Equal(&EmptyCodeHashFirstHalfValue) &&
+			c1poly[EmptyCodeHashSecondHalfIdx].Equal(&EmptyCodeHashSecondHalfValue)
+		if containsEmptyCodeHash {
+			// Clear out values of the cached point.
+			c1poly[EmptyCodeHashFirstHalfIdx] = FrZero
+			c1poly[EmptyCodeHashSecondHalfIdx] = FrZero
+			// Calculate the remaining part of c1 and add to the base value.
+			partialc1 := cfg.CommitToPoly(c1poly[:], NodeWidth-count-2)
+			n.c1 = new(Point)
+			n.c1.Add(&EmptyCodeHashPoint, partialc1)
+		} else {
+			n.c1 = cfg.CommitToPoly(c1poly[:], NodeWidth-count)
+		}
+
+		// C2.
+		var c2poly [NodeWidth]Fr
+		count = fillSuffixTreePoly(c2poly[:], n.values[NodeWidth/2:])
+		n.c2 = cfg.CommitToPoly(c2poly[:], NodeWidth-count)
+
+		c1c2points[2*i], c1c2points[2*i+1] = n.c1, n.c2
+		c1c2frs[2*i], c1c2frs[2*i+1] = new(Fr), new(Fr)
+	}
+
+	toFrMultiple(c1c2frs, c1c2points)
+
+	var poly [NodeWidth]Fr
+	poly[0].SetUint64(1)
+	for i, nv := range leaves {
+		StemFromBytes(&poly[1], nv.stem)
+		poly[2] = *c1c2frs[2*i]
+		poly[3] = *c1c2frs[2*i+1]
+
+		nv.commitment = cfg.CommitToPoly(poly[:], 252)
+	}
+}


This is what each goroutine (if spinned) will work for their batch. This code isn't new, but a twist/mixture of what we did previously, NewLeafNode and the previous BatchNewLeafNode.

jsign · 2023-04-26T17:16:26Z

tree_test.go

-				// ***Insert the key pairs with optimized strategy & methods***
-				rand = mRand.New(mRand.NewSource(42)) //skipcq: GSC-G404
-				tree = genRandomTree(rand, treeInitialKeyValCount)
-				randomKeyValues = genRandomKeyValues(rand, migrationKeyValueCount)
-
-				now = time.Now()
-				// Create LeafNodes in batch mode.
-				nodeValues := make([]BatchNewLeafNodeData, 0, len(randomKeyValues))
-				curr := BatchNewLeafNodeData{
-					Stem:   randomKeyValues[0].key[:StemSize],
-					Values: map[byte][]byte{randomKeyValues[0].key[StemSize]: randomKeyValues[0].value},
-				}
-				for _, kv := range randomKeyValues[1:] {
-					if bytes.Equal(curr.Stem, kv.key[:StemSize]) {
-						curr.Values[kv.key[StemSize]] = kv.value
-						continue
-					}
-					nodeValues = append(nodeValues, curr)
-					curr = BatchNewLeafNodeData{
-						Stem:   kv.key[:StemSize],
-						Values: map[byte][]byte{kv.key[StemSize]: kv.value},
-					}
-				}
-				// Append last remaining node.
-				nodeValues = append(nodeValues, curr)
-
-				// Create all leaves in batch mode so we can optimize cryptography operations.
-				newLeaves := BatchNewLeafNode(nodeValues)
-				if err := tree.(*InternalNode).InsertMigratedLeaves(newLeaves, nil); err != nil {
-					t.Fatalf("failed to insert key: %v", err)
-				}
-
-				batchedRoot := tree.Commit().Bytes()
-				if _, err := tree.(*InternalNode).BatchSerialize(); err != nil {
-					t.Fatalf("failed to serialize batched tree: %v", err)
-				}
-				batchedDuration += time.Since(now)
-
-				if unbatchedRoot != batchedRoot {
-					t.Fatalf("expected %x, got %x", unbatchedRoot, batchedRoot)
-				}


All this is gone since now we don't have a base API vs batch API anymore; the fast version is the base API.
We can evaluate later if it makes sense to keep this test/benchmark; but for now can give useful information as shown in the PR description.

jsign · 2023-04-26T17:16:51Z

tree_test.go

@@ -1232,7 +1190,7 @@ func genRandomKeyValues(rand *mRand.Rand, count int) []keyValue {
 	return ret
 }

-func BenchmarkBatchLeavesInsert(b *testing.B) {
+func BenchmarkNewLeavesInsert(b *testing.B) {


Now this benchmark uses the base API since should be the fastest way to insert new leaves; so got simpler.

gballet

Before:

goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor            
BenchmarkBatchLeavesInsert-32    	      13	  85987042 ns/op	51667848 B/op	  114859 allocs/op
PASS
ok  	github.com/gballet/go-verkle	2.546s

After with 8:

goos: linux
goarch: amd64
pkg: github.com/gballet/go-verkle
cpu: AMD Ryzen 9 5950X 16-Core Processor            
BenchmarkNewLeavesInsert-32    	      14	  86549506 ns/op	49047049 B/op	   87847 allocs/op
PASS
ok  	github.com/gballet/go-verkle	2.428s

So it doesn't seem to have a great performance impact on my machine. The reduction in allocations, however, is quite welcome.

Tweaking the batch size doesn't seem to have any impact. I will run it on the replay benchmark as soon as I fixed the bugs in my new conversion code.

gballet · 2023-05-22T13:13:16Z

tree.go

+	if n.commitment == nil {
+		for i, v := range values {
+			if len(v) != 0 && !bytes.Equal(v, n.values[i]) {
+				n.values[i] = v
+			}
+		}
+		return
+	}


that's great, but could you just add a comment so that we remember why that is when debugging later on? Same thing with L991

…updating Signed-off-by: Ignacio Hagopian <[email protected]>

jsign added 6 commits April 26, 2023 12:51

tree: allow batching new leaf nodes

72f2659

Signed-off-by: Ignacio Hagopian <[email protected]>

tree: simplify batching of new leaves creation

79f48a2

Signed-off-by: Ignacio Hagopian <[email protected]>

tree: fix insert new leaves test and benchmark

7697342

Signed-off-by: Ignacio Hagopian <[email protected]>

remove comment

8816d70

Signed-off-by: Ignacio Hagopian <[email protected]>

remove conversion file

33ba8ee

Signed-off-by: Ignacio Hagopian <[email protected]>

remove unused method

1ba504c

Signed-off-by: Ignacio Hagopian <[email protected]>

jsign commented Apr 26, 2023

View reviewed changes

jsign marked this pull request as ready for review April 26, 2023 17:17

jsign requested a review from gballet April 26, 2023 17:17

jsign mentioned this pull request May 19, 2023

TreeFromProof test hardening #356

Merged

gballet reviewed May 22, 2023

View reviewed changes

tree: add comment to explain different strategies on leaf node value …

104f849

…updating Signed-off-by: Ignacio Hagopian <[email protected]>

gballet closed this Jun 7, 2023

This was referenced Jun 7, 2023

tree: allow batching new leaf nodes #362

Closed

tree: allow batching new leaf nodes #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch NewLeaf node work in base API #349

Batch NewLeaf node work in base API #349

jsign commented Apr 26, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023 •

edited

Loading

jsign Apr 26, 2023

jsign Apr 26, 2023

gballet May 22, 2023

jsign May 22, 2023

jsign May 22, 2023 •

edited

Loading

jsign May 22, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023

jsign Apr 26, 2023

gballet left a comment

gballet May 22, 2023

Batch NewLeaf node work in base API #349

Batch NewLeaf node work in base API #349

Conversation

jsign commented Apr 26, 2023

Synthetic insertion

Benchmark-style measurement

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsign Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsign May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gballet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsign Apr 26, 2023 •

edited

Loading

jsign May 22, 2023 •

edited

Loading