Possible to create my own raw marshal/unmarshal? #51

tobowers · 2019-09-02T08:19:40Z

We have a few objects where refmt serialization is the bottleneck in our app... taking up to 1ms to serialize an object.

I'm wondering if it's possible to specify a fast-path for these objects... here it seems to say I might be able to have my own machine:

 For the most esoteric needs, you can fall all the way back to providing a custom MarshalMachine
     (but avoid that; it's a lot of work, and one of these other transform methods should suffic

However, the MarshalMachine interface has unexported type expectations in the fields.

Any other suggestions for keeping refmt around but being able to register my own "fast-path" ?

The text was updated successfully, but these errors were encountered:

warpfork · 2019-09-03T20:36:01Z

Before anything else -- just in case you haven't synced up to master lately, try that first :) There's been a couple of huge performance fixes lately: #49 and #50 are likely to move the needle substantially. Like... really substantially.

warpfork · 2019-09-03T23:04:24Z

Adding more 'machines' is tricky, unfortunately, Wish it wasn't, but it's not just some unexported fields: the constraints of going fast and being flexible seem to conflict hereabouts.

In order to avoid an allocation for every object encountered, the obj package has this concept of a "slab", where it just allocates a frankly excessive amount of space which contains all the working memory that could possibly be needed to handle an object in various ways:

refmt/obj/marshalSlab.go

Lines 21 to 30 in 3d65705

    
           type marshalSlabRow struct { 
        
           	ptrDerefDelegateMarshalMachine 
        
           	marshalMachinePrimitive 
        
           	marshalMachineWildcard 
        
           	marshalMachineMapWildcard 
        
           	marshalMachineSliceWildcard 
        
           	marshalMachineStructAtlas 
        
           	marshalMachineTransform 
        
           	marshalMachineUnionKeyed

The purpose of this is then we can grow a whole slice of those at once, and keep reusing them, thus massively amortizing down the number of allocations needed and the gc pressure created. It's basically a userland "stack", and it works pretty well. The downside of this is... there's really no way to keep that property except for to mandate that all the machines have a compile-time reserved space in the struct.

And there's no way to do that in golang in an extensible / package-boundary-crossing way.

(So in other words... that comment about custom marshal machines might be out of date :/ I wanted to support that, but it was an early idea and I don't think I had realized how it wouldn't play out well with the whole slab concept.)

If there's a cleverer way to do this, it's gonna need a lot of thought.

Or possible it's time for some sizably different angle of attack: the whole "user-land stack" thing is a pretty all-in design choice in the obj package, and it dictates a lot of constraints.

So, are there other ways to build faster paths and still reuse stuff? Yeah! For that matter, The entire obj package isn't blessed -- There's zero direct reaches between json<->obj or cbor<->obj, and so we can write totally different object<->token mapping systems, and still reuse all the codecs completely since they only know about tokens.

It's just finding something that's incrementally adoptable and composable that's the tricky bit.

warpfork · 2019-09-04T02:22:45Z

So about those options for progress that's in non-incremental territory.

There's probably more than a few possible development trajectories, because "non-incremental" kind of opens the floodgates. There are lots, and lots, and lots of different ways one could write object<->token mapping code, and still reuse the codecs for token<->serial mapping.

But there's a few I've looked at, so I'll try to comment on those here. (There's a lot of work being pursued in these directions going on within ipld/go-ipld-prime, btw... but being non-incremental approaches, it'll take a while to show fruit. And I haven't been trying to port subsets of that work back into refmt while it's going on.)

There are two big things that are costly about the way that the refmt/obj package does business:

There's lots of reflection going on -- even when atlases are used, they're still just configuration for reflection, rather than a way out of using reflection. Reflection isn't cheap.
The whole "step function" pattern, though really elegant, is just really tricky to optimize. The whole 'slab' concept is all to satisfy constraints of the step-func pattern, and though it's now reasonably efficient, it's also still distinctly unwieldy (and difficult to extend, as we saw above)... and also, honestly, just plain isn't pleasant to program for.

So what can we do about either one, or both of those?

well, go-ipld-prime is introducing the concept of a Node -- which can be backed by reflectionless access
- Could make Node implementations backed by atlas-like/reflecty features... or...
- Or codegen things (!)...
- Or have have "generic" implementations that store anything, but are less directly relatable to regular Go code.
- Worth noting this is a fairly wildly different direction than refmt/obj for lots of reasons: it's a pretty heavy duty abstraction over all; and the existence of that third mode where it's both reflectionless and also without codegen, but in that mode can only really be accessed with "generic" (and think erasive monomorphization, not nice generics) methods... it really can't be used directly by familiar golang code at all... so this is not at all what refmt does now and it's unclear if it should/would even if it could.
- This could still work together with the current slab system! There'd pretty much just be one new slab member that knows what state this system would need. Once written, this approach would also be something that could be used on a per-type basis, so it would be fairly incrementally adoptable.
Give up on the stepfunc and userland-stack, and just write regular dang code.
- Means we can't compose an obj<->obj pairing in quite the same way. But that's... perhaps fine.
- Could do this with similar-to-current reflection, or, with Node idea above.
  - go-ipld-prime is doing the latter.
- Hard to say exactly how sizable the differences would be in leaning more heavily on a regular stack. Might make it easier to engage other compilers optimizations. Might be totally irrelevant, because realistically the difference between load-effective-address instructions for the native stack and our userland stack... isn't. Only way is to write it (all of it) and see.
A third, farthest lunge is to cut through all abstractions and have marshalling and unmarshalling functions that are bound directly to golang structs using source with no reflection, and calling directly the encoders (or even drilling through that abstraction, possibly, and doing raw bytes directly) while keeping the stack associated with the struct. This will be the fastest, almost certainly. It's also a very large amount of work and more or less absurdly unmaintainable unless it's implemented via codegen.

So you can see how there's many options. But none of the choices are trivial. If you wanted to pursue some of these, I'd say "go for it" and try to be helpful, but in a lot of cases there's no nice resting point in the middle of implementing it; one just has to do the whole dang thing. And then see if it got faster or not, because there's very little meaningful testing and benchmarking one can do before having the whole, holistic thing to benchmark as a unit.

To re-summarize what I alluded to at the top and a bit throughout: go-ipld-prime is trying the 'Node' approach, and it's doing a complete alternative to the 'obj' package based on that, while reusing the codecs and token interfaces. I'm also doing the codegen approach over there, but optionally. While that's a very large body of work, some parts of it are seeming close to paying off now. So you might want to keep on eye on how that evolves.

warpfork · 2019-09-04T02:29:58Z

And one more "P.S." -- I'm not sure how deep and what directions your own investigations into your bottlenecks have gone, but fwiw, I've recently been finding that pprof output files are amazingly useful; especially once benchmarks aren't able to provide precise enough guidance for what to look at next. The time pprofs are good; the mem and alloc ones often even better. The tools for inspecting them have also gotten radically more awesome in the last couple years. Profiling outputs are especially valuable compared to benchmarks for the kind of stuff we experience in refmt, because the performance profile of an operation is intensely data-dependent.

I've also started using assembly dumps a lot recently to make sense of what the compiler is actually doing and thus to make sure my microbenchmarks aren't telling exotic lies, etc, and that's turned out to be a lot more relevant than I would've expected. (It's really easy to make a microbenchmark that lies.) The '-gcflags -S' incantation (works on most of the go tools) is an entrypoint to getting that content if you want to try it but haven't before.

If you wanna have a quick call sometime to talk more about ways to gather data like this I'd be happy to :) Some of the recent major perf improvements I mentioned earlier were almost a direct result of someone throwing some pprof files at me from non-trivial prod usage, so, yeah... they're precious.

tobowers · 2019-09-04T08:03:46Z

Thank you so much for your insanely thorough and thoughtful answer (and so quick)! Given that we're working with IPLD cbor objects nearly exclusively it seems like the ipld-prime route is probably the best place for me to look... last time I checked it out it didn't seem like I could just drop it into a production system and expect it to work :).

I've been using the pprof tools a lot (mostly CPU/memory). Skipping obj altogether might be interesting for some fast path things... looks like Node is doing that over in ipld-prime with "Marshal" as opposed to "Encode"

warpfork · 2019-09-04T14:05:43Z

Yeah, don't quite wanna claim that the go-ipld-prime stuff is drop-in yet, and the profiling effort on that is also so far... minimal. It's getting close to ready, though. And a couple of early benchmarks seem to be indicating it's roughly on par with refmt already, before serious optimization work, so... seems likely there's good things to come there :)

tobowers mentioned this issue Sep 4, 2019

upgrade refmt to master quorumcontrol/go-hamt-ipld#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to create my own raw marshal/unmarshal? #51

Possible to create my own raw marshal/unmarshal? #51

tobowers commented Sep 2, 2019

warpfork commented Sep 3, 2019

warpfork commented Sep 3, 2019 •

edited

Loading

warpfork commented Sep 4, 2019

warpfork commented Sep 4, 2019 •

edited

Loading

tobowers commented Sep 4, 2019

warpfork commented Sep 4, 2019

Possible to create my own raw marshal/unmarshal? #51

Possible to create my own raw marshal/unmarshal? #51

Comments

tobowers commented Sep 2, 2019

warpfork commented Sep 3, 2019

warpfork commented Sep 3, 2019 • edited Loading

warpfork commented Sep 4, 2019

warpfork commented Sep 4, 2019 • edited Loading

tobowers commented Sep 4, 2019

warpfork commented Sep 4, 2019

warpfork commented Sep 3, 2019 •

edited

Loading

warpfork commented Sep 4, 2019 •

edited

Loading