Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to create my own raw marshal/unmarshal? #51

Open
tobowers opened this issue Sep 2, 2019 · 6 comments
Open

Possible to create my own raw marshal/unmarshal? #51

tobowers opened this issue Sep 2, 2019 · 6 comments

Comments

@tobowers
Copy link

tobowers commented Sep 2, 2019

We have a few objects where refmt serialization is the bottleneck in our app... taking up to 1ms to serialize an object.

I'm wondering if it's possible to specify a fast-path for these objects... here it seems to say I might be able to have my own machine:

 For the most esoteric needs, you can fall all the way back to providing a custom MarshalMachine
     (but avoid that; it's a lot of work, and one of these other transform methods should suffic

However, the MarshalMachine interface has unexported type expectations in the fields.

Any other suggestions for keeping refmt around but being able to register my own "fast-path" ?

@warpfork
Copy link
Member

warpfork commented Sep 3, 2019

Before anything else -- just in case you haven't synced up to master lately, try that first :) There's been a couple of huge performance fixes lately: #49 and #50 are likely to move the needle substantially. Like... really substantially.

@warpfork
Copy link
Member

warpfork commented Sep 3, 2019

Adding more 'machines' is tricky, unfortunately, Wish it wasn't, but it's not just some unexported fields: the constraints of going fast and being flexible seem to conflict hereabouts.

In order to avoid an allocation for every object encountered, the obj package has this concept of a "slab", where it just allocates a frankly excessive amount of space which contains all the working memory that could possibly be needed to handle an object in various ways:

refmt/obj/marshalSlab.go

Lines 21 to 30 in 3d65705

type marshalSlabRow struct {
ptrDerefDelegateMarshalMachine
marshalMachinePrimitive
marshalMachineWildcard
marshalMachineMapWildcard
marshalMachineSliceWildcard
marshalMachineStructAtlas
marshalMachineTransform
marshalMachineUnionKeyed
The purpose of this is then we can grow a whole slice of those at once, and keep reusing them, thus massively amortizing down the number of allocations needed and the gc pressure created. It's basically a userland "stack", and it works pretty well. The downside of this is... there's really no way to keep that property except for to mandate that all the machines have a compile-time reserved space in the struct.

And there's no way to do that in golang in an extensible / package-boundary-crossing way.

(So in other words... that comment about custom marshal machines might be out of date :/ I wanted to support that, but it was an early idea and I don't think I had realized how it wouldn't play out well with the whole slab concept.)

If there's a cleverer way to do this, it's gonna need a lot of thought.

Or possible it's time for some sizably different angle of attack: the whole "user-land stack" thing is a pretty all-in design choice in the obj package, and it dictates a lot of constraints.


So, are there other ways to build faster paths and still reuse stuff? Yeah! For that matter, The entire obj package isn't blessed -- There's zero direct reaches between json<->obj or cbor<->obj, and so we can write totally different object<->token mapping systems, and still reuse all the codecs completely since they only know about tokens.

It's just finding something that's incrementally adoptable and composable that's the tricky bit.

@warpfork
Copy link
Member

warpfork commented Sep 4, 2019

So about those options for progress that's in non-incremental territory.

There's probably more than a few possible development trajectories, because "non-incremental" kind of opens the floodgates. There are lots, and lots, and lots of different ways one could write object<->token mapping code, and still reuse the codecs for token<->serial mapping.

But there's a few I've looked at, so I'll try to comment on those here. (There's a lot of work being pursued in these directions going on within ipld/go-ipld-prime, btw... but being non-incremental approaches, it'll take a while to show fruit. And I haven't been trying to port subsets of that work back into refmt while it's going on.)

There are two big things that are costly about the way that the refmt/obj package does business:

  • There's lots of reflection going on -- even when atlases are used, they're still just configuration for reflection, rather than a way out of using reflection. Reflection isn't cheap.
  • The whole "step function" pattern, though really elegant, is just really tricky to optimize. The whole 'slab' concept is all to satisfy constraints of the step-func pattern, and though it's now reasonably efficient, it's also still distinctly unwieldy (and difficult to extend, as we saw above)... and also, honestly, just plain isn't pleasant to program for.

So what can we do about either one, or both of those?

  • well, go-ipld-prime is introducing the concept of a Node -- which can be backed by reflectionless access

    • Could make Node implementations backed by atlas-like/reflecty features... or...
    • Or codegen things (!)...
    • Or have have "generic" implementations that store anything, but are less directly relatable to regular Go code.
    • Worth noting this is a fairly wildly different direction than refmt/obj for lots of reasons: it's a pretty heavy duty abstraction over all; and the existence of that third mode where it's both reflectionless and also without codegen, but in that mode can only really be accessed with "generic" (and think erasive monomorphization, not nice generics) methods... it really can't be used directly by familiar golang code at all... so this is not at all what refmt does now and it's unclear if it should/would even if it could.
    • This could still work together with the current slab system! There'd pretty much just be one new slab member that knows what state this system would need. Once written, this approach would also be something that could be used on a per-type basis, so it would be fairly incrementally adoptable.
  • Give up on the stepfunc and userland-stack, and just write regular dang code.

    • Means we can't compose an obj<->obj pairing in quite the same way. But that's... perhaps fine.
    • Could do this with similar-to-current reflection, or, with Node idea above.
      • go-ipld-prime is doing the latter.
    • Hard to say exactly how sizable the differences would be in leaning more heavily on a regular stack. Might make it easier to engage other compilers optimizations. Might be totally irrelevant, because realistically the difference between load-effective-address instructions for the native stack and our userland stack... isn't. Only way is to write it (all of it) and see.
  • A third, farthest lunge is to cut through all abstractions and have marshalling and unmarshalling functions that are bound directly to golang structs using source with no reflection, and calling directly the encoders (or even drilling through that abstraction, possibly, and doing raw bytes directly) while keeping the stack associated with the struct. This will be the fastest, almost certainly. It's also a very large amount of work and more or less absurdly unmaintainable unless it's implemented via codegen.

So you can see how there's many options. But none of the choices are trivial. If you wanted to pursue some of these, I'd say "go for it" and try to be helpful, but in a lot of cases there's no nice resting point in the middle of implementing it; one just has to do the whole dang thing. And then see if it got faster or not, because there's very little meaningful testing and benchmarking one can do before having the whole, holistic thing to benchmark as a unit.

To re-summarize what I alluded to at the top and a bit throughout: go-ipld-prime is trying the 'Node' approach, and it's doing a complete alternative to the 'obj' package based on that, while reusing the codecs and token interfaces. I'm also doing the codegen approach over there, but optionally. While that's a very large body of work, some parts of it are seeming close to paying off now. So you might want to keep on eye on how that evolves.

@warpfork
Copy link
Member

warpfork commented Sep 4, 2019

And one more "P.S." -- I'm not sure how deep and what directions your own investigations into your bottlenecks have gone, but fwiw, I've recently been finding that pprof output files are amazingly useful; especially once benchmarks aren't able to provide precise enough guidance for what to look at next. The time pprofs are good; the mem and alloc ones often even better. The tools for inspecting them have also gotten radically more awesome in the last couple years. Profiling outputs are especially valuable compared to benchmarks for the kind of stuff we experience in refmt, because the performance profile of an operation is intensely data-dependent.

I've also started using assembly dumps a lot recently to make sense of what the compiler is actually doing and thus to make sure my microbenchmarks aren't telling exotic lies, etc, and that's turned out to be a lot more relevant than I would've expected. (It's really easy to make a microbenchmark that lies.) The '-gcflags -S' incantation (works on most of the go tools) is an entrypoint to getting that content if you want to try it but haven't before.

If you wanna have a quick call sometime to talk more about ways to gather data like this I'd be happy to :) Some of the recent major perf improvements I mentioned earlier were almost a direct result of someone throwing some pprof files at me from non-trivial prod usage, so, yeah... they're precious.

@tobowers
Copy link
Author

tobowers commented Sep 4, 2019

Thank you so much for your insanely thorough and thoughtful answer (and so quick)! Given that we're working with IPLD cbor objects nearly exclusively it seems like the ipld-prime route is probably the best place for me to look... last time I checked it out it didn't seem like I could just drop it into a production system and expect it to work :).

I've been using the pprof tools a lot (mostly CPU/memory). Skipping obj altogether might be interesting for some fast path things... looks like Node is doing that over in ipld-prime with "Marshal" as opposed to "Encode"

@warpfork
Copy link
Member

warpfork commented Sep 4, 2019

Yeah, don't quite wanna claim that the go-ipld-prime stuff is drop-in yet, and the profiling effort on that is also so far... minimal. It's getting close to ready, though. And a couple of early benchmarks seem to be indicating it's roughly on par with refmt already, before serious optimization work, so... seems likely there's good things to come there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants