Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Speed up symbol parsing by minimizing allocations #258

Merged
merged 2 commits into from
Dec 4, 2024

Conversation

varungandhi-src
Copy link
Contributor

@varungandhi-src varungandhi-src commented Jun 25, 2024

NOTE: The diff looks large, but the majority of that is because of scheme
changes due to a small comment that I added.

What

Changes the symbol parsing logic to minimize allocations. In particular,
when we only care about validating symbols (e.g. during document
canonicalization when ingesting uploads), there is really ~no need to
allocate any strings at all. Validation and parsing share most of the
underlying code -- the only change is we create "writer" types which
will discard writes (and hence any internal buffer growth) when we're
only in validation mode.

Why

Ideally, we want to validate all symbols that we enter into the DB,
(and we also want to have fast splitting of symbols) so it's valuable
to have the overhead be as low as possible. In the validation case,
we only make minimal heap allocations in the error case (there is a
test which makes sure we don't allocate in the non error cases).

Benchmarks

I ran some benchmarks with sample SCIP indexes located here: (Sourcegraph-internal)
https://drive.google.com/drive/folders/1z62Se7eHaa5T89a16-y7s0Z1qbRY4VCg

Once the indexes are decompressed into dev/sample_indexes, you can run

# Benchmarks
go run ./bindings/go/scip/speedtest
# Compatibility test with old parser
go test ./... -run TestParseCompat -tags asserts

Symbol parse (v1) represents the older symbol parsing logic;
Symbol parse (v2) represents the newer symbol parsing logic.
I also added a validation helper function on top of the newer parser;
that is also benchmarked separately.

symbol parse (v2) is noticeably slower than validation because of
allocations needed so that we can have symbol parse (v1) and (v2)
-- the old parsing logic would always return a new *scip.Symbol,
so it'd be an unfair comparison if we just pre-allocated everything
for benchmarking (v2). It is possible to get symbol parse (v2) to validation level
speed by pre-allocating arrays of scip.Package and scip.Descriptor
values up-front and essentially passing pointers to those inside those to
successive calls to ParseSymbolUTF8With.

Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/django-1.scip":
+-------------------------------+------------+-------+------------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000      | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+------------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.074µs/op | -     | 1.059µs/op | -     | 928ns/op | -     |
| Symbol parse (v2) - Speed     | 528ns/op   | 0.49x | 620ns/op   | 0.59x | 620ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 359ns/op   | 0.33x | 421ns/op   | 0.40x | 413ns/op | 0.45x |
| Symbol parse (v1) - Allocs    | 101535B/op | -     | 10153B/op  | -     | 1015B/op | -     |
| Symbol parse (v2) - Allocs    | 41218B/op  | 0.41x | 4121B/op   | 0.41x | 412B/op  | 0.41x |
| Symbol validate (v2) - Allocs | 4B/op      | 0.00x | 0B/op      | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+------------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/flink-1.scip":
+-------------------------------+------------+-------+------------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000      | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+------------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.334µs/op | -     | 1.216µs/op | -     | 962ns/op | -     |
| Symbol parse (v2) - Speed     | 665ns/op   | 0.50x | 686ns/op   | 0.56x | 694ns/op | 0.72x |
| Symbol validate (v2) - Speed  | 397ns/op   | 0.30x | 392ns/op   | 0.32x | 400ns/op | 0.42x |
| Symbol parse (v1) - Allocs    | 107481B/op | -     | 10748B/op  | -     | 1074B/op | -     |
| Symbol parse (v2) - Allocs    | 61517B/op  | 0.57x | 6151B/op   | 0.57x | 615B/op  | 0.57x |
| Symbol validate (v2) - Allocs | 0B/op      | 0.00x | 0B/op      | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+------------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/llvm-project-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 941ns/op  | -     | 651ns/op | -     | 516ns/op | -     |
| Symbol parse (v2) - Speed     | 387ns/op  | 0.41x | 442ns/op | 0.68x | 363ns/op | 0.70x |
| Symbol validate (v2) - Speed  | 229ns/op  | 0.24x | 230ns/op | 0.35x | 174ns/op | 0.34x |
| Symbol parse (v1) - Allocs    | 49424B/op | -     | 4942B/op | -     | 494B/op  | -     |
| Symbol parse (v2) - Allocs    | 33525B/op | 0.68x | 3352B/op | 0.68x | 335B/op  | 0.68x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/rust-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 664ns/op  | -     | 636ns/op | -     | 739ns/op | -     |
| Symbol parse (v2) - Speed     | 450ns/op  | 0.68x | 443ns/op | 0.70x | 498ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 278ns/op  | 0.42x | 288ns/op | 0.45x | 301ns/op | 0.41x |
| Symbol parse (v1) - Allocs    | 70200B/op | -     | 7020B/op | -     | 702B/op  | -     |
| Symbol parse (v2) - Allocs    | 38142B/op | 0.54x | 3814B/op | 0.54x | 381B/op  | 0.54x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/shopify-api-ruby-1.scip":
+-------------------------------+------------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000       | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+------------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 1.024µs/op | -     | 914ns/op | -     | 893ns/op | -     |
| Symbol parse (v2) - Speed     | 5.607µs/op | 5.48x | 349ns/op | 0.38x | 340ns/op | 0.38x | (*)
| Symbol validate (v2) - Speed  | 236ns/op   | 0.23x | 203ns/op | 0.22x | 203ns/op | 0.23x |
| Symbol parse (v1) - Allocs    | 65794B/op  | -     | 6579B/op | -     | 657B/op  | -     |
| Symbol parse (v2) - Allocs    | 30391B/op  | 0.46x | 3039B/op | 0.46x | 303B/op  | 0.46x |
| Symbol validate (v2) - Allocs | 0B/op      | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+------------+-------+----------+-------+----------+-------+
Benchmark for "/Users/varun/Code/scip/dev/sample_indexes/typescript-1.scip":
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| BENCHMARK / ITERATIONS        | 1000      | RATIO | 10000    | RATIO | 100000   | RATIO |
+-------------------------------+-----------+-------+----------+-------+----------+-------+
| Symbol parse (v1) - Speed     | 555ns/op  | -     | 597ns/op | -     | 572ns/op | -     |
| Symbol parse (v2) - Speed     | 450ns/op  | 0.81x | 389ns/op | 0.65x | 385ns/op | 0.67x |
| Symbol validate (v2) - Speed  | 189ns/op  | 0.34x | 225ns/op | 0.38x | 222ns/op | 0.39x |
| Symbol parse (v1) - Allocs    | 62000B/op | -     | 6200B/op | -     | 620B/op  | -     |
| Symbol parse (v2) - Allocs    | 37137B/op | 0.60x | 3713B/op | 0.60x | 371B/op  | 0.60x |
| Symbol validate (v2) - Allocs | 0B/op     | 0.00x | 0B/op    | 0.00x | 0B/op    | 0.00x |
+-------------------------------+-----------+-------+----------+-------+----------+-------+

There is one surprising case where parsing is much slower when only testing against
1000 occurrences with the newer parser, which I've marked with a (*) --
that seems to be somewhat reproducible. However, I haven't spent time on investigating
that because it seems like the speed improvements are present at higher occurrence counts.

Test plan

Added compatibility tests for the old parser vs the new parser and ran
that against a bunch of existing indexes.

@kritzcreek
Copy link
Contributor

I tried running the benchmarks against chromium-1.scip (to compare parsing the same symbols), but it looks like the parser is failing to parse the symbols:

@@ -175,7 +175,11 @@ func TestUtf8Validation(t *testing.T) {
 		var sym Symbol
 		for i := 0; i < b.N; i++ {
 			occ := allOccurrences[i]
-			_ = parsePartialSymbolV2(occ.Symbol, true, &sym)
+			err = parsePartialSymbolV2(occ.Symbol, true, &sym)
+			if err != nil {
+				panic(fmt.Sprintf("Failed to parse '%s' with %s", occ.Symbol, err))
+				// fmt.Printf("Failed to parse '%s' with %s", occ.Symbol, err)
+			}
 		}
 	}
 	stdUtf8ValidationOnly := func(b *simpleBenchmark) {
=== RUN   TestUtf8Validation
--- FAIL: TestUtf8Validation (17.96s)
panic: Failed to parse 'cxx . todo-pkg todo-version `apps/switches.h:6:9`!' with unrecognized descriptor "do-pkg " [recovered]
	panic: Failed to parse 'cxx . todo-pkg todo-version `apps/switches.h:6:9`!' with unrecognized descriptor "do-pkg "

@varungandhi-src varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from 0394303 to 7071041 Compare November 29, 2024 08:41
@varungandhi-src varungandhi-src changed the base branch from main to vg/fast-json November 29, 2024 09:11
@@ -0,0 +1,239 @@
package internal
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this code after the new symbol parser doesn't show any problems in practice.

package shared

func IsSimpleIdentifierCharacter(c rune) bool {
return c == '_' || c == '+' || c == '-' || c == '$' || ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || ('0' <= c && c <= '9')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the ordering of the comparisons doesn't seem to have any noticeable changes in benchmarks, so leaving this code as-is to match the order in scip.proto.

Base automatically changed from vg/fast-json to main November 30, 2024 08:33
@varungandhi-src varungandhi-src force-pushed the vg/fast-parse branch 4 times, most recently from e4ba021 to 3f8bc54 Compare November 30, 2024 13:18
}

func (x *Package) ID() string {
return fmt.Sprintf("%s %s %s", x.Manager, x.Name, x.Version)
func ValidateSymbolUTF8(symbol beaut.UTF8String) error {
Copy link
Contributor Author

@varungandhi-src varungandhi-src Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this helper function for better readability at call-sites since upload ingestion requires filtering out occurrences/SymbolInformation values with malformed symbols.

"strings"

"github.com/cockroachdb/errors"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is best reviewed in 'Split diff' view.

//
// Unlike ParseSymbol, this skips UTF-8 validation. To customize
// parsing behavior, use ParseSymbolUTF8With.
func ParseSymbolUTF8(symbol beaut.UTF8String) (*Symbol, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to add a new function here instead of modifying the signature of the existing ParseSymbol function to avoid gratuitously breaking callers, since it's not hard to maintain back-compat.

if s.current() == r {
s.index++
return nil
func ParseSymbolUTF8With(symbol beaut.UTF8String, options ParseSymbolOptions) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes a ParseSymbolOptions struct so that we can add more options in the future without breaking back compat.

@varungandhi-src varungandhi-src force-pushed the vg/fast-parse branch 8 times, most recently from b9583e7 to 85b9892 Compare December 1, 2024 12:57
@varungandhi-src varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from e143e74 to d93bbf0 Compare December 2, 2024 02:17
@varungandhi-src varungandhi-src marked this pull request as ready for review December 2, 2024 02:17
@varungandhi-src varungandhi-src changed the title Speed up ParseSymbol by avoiding allocations feat: Speed up symbol parsing and validation by minimizing allocations Dec 2, 2024
@varungandhi-src varungandhi-src force-pushed the vg/fast-parse branch 2 times, most recently from c269d2b to 57d9817 Compare December 2, 2024 02:49
@varungandhi-src varungandhi-src changed the title feat: Speed up symbol parsing and validation by minimizing allocations feat: Speed up symbol parsing by minimizing allocations Dec 2, 2024
Copy link
Contributor

@kritzcreek kritzcreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I think the performance wins are worth the extra complexity in the parser.

I added a couple comments/questions/suggestions, but nothing major.

@@ -1,4 +1,4 @@
golang 1.20.14
golang 1.22.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go to 1.23.x (3 at the moment) while were at it?

Copy link
Contributor Author

@varungandhi-src varungandhi-src Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's potentially too restrictive since we're providing a library. The Go toolchain typically provides support for ~2 major versions, and a bunch of OSS libraries do the same.

In principle, we could split the code into different modules so that we can aggressively bump the version for the CLI and leave the version bound for the bindings lower, but that would make things more complicated, so not doing that for now.

bindings/go/scip/internal/old_symbol_parser.go Outdated Show resolved Hide resolved
bindings/go/scip/symbol_parser.go Outdated Show resolved Hide resolved
bindings/go/scip/symbol_parser.go Show resolved Hide resolved
SymbolString string
byteIndex int
currentRune rune
bytesToNextRune int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be cheap to derive from currentRune, is there a particular reason why you're storing it separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect to be constructing lots of parser objects, so I'm not concerned about potential memory usage. However, I thought it didn't make sense to have an extra branch (even if it's well predicted) in the common case to identify the length from the rune, since we already have the value computed anyways.


// Pre-condition: string is well-formed UTF-8
// Pre-condition: byteIndex is in bounds
func findRuneAtIndex(s string, byteIndex int) (r rune, bytesRead int32) {
Copy link
Contributor

@kritzcreek kritzcreek Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see this function, with Go having support for string slices. Does manually tracking the byte offset, rather than continuously slicing the input string have noticeable performance impact? Otherwise we could be using https://pkg.go.dev/unicode/utf8#DecodeRune

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the function here, that's much more complicated than what we have as it's trying to also handle the invalid UTF-8 case.

https://sourcegraph.com/github.com/golang/go/-/blob/src/unicode/utf8/utf8.go?L205-243

I suspect it's probably slower given that it's doing more work (we just have 1 indexing operation + 1 comparison on the fastest path), but I have not benchmarked it.

errorCaseByteNotFound
)

// TODO: Enable https://github.com/nishanths/exhaustive in CI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional TODO for the future, or something you meant to do as part of this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, not this PR.

occ := allOccurrences[i]
_, err = internal.ParsePartialSymbolV1ToBeDeleted(occ.Symbol, true)
if err != nil {
//panic(fmt.Sprintf("v1: index path: %v: error: %v", path, err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we at least collect these errors, and check that both parsers errored on the same symbols?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is handled by TestParseCompat, we're deliberately dropping them here. Added a comment explaining that.

Copy link
Contributor

@kritzcreek kritzcreek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speeeeeeed! :)

@varungandhi-src varungandhi-src merged commit 12ff730 into main Dec 4, 2024
6 checks passed
@varungandhi-src varungandhi-src deleted the vg/fast-parse branch December 4, 2024 01:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants