Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential optimizations using System.Span, System.Memory, and stackalloc #22

Open
lostromb opened this issue May 31, 2018 · 8 comments
Open

Comments

@lostromb
Copy link
Owner

lostromb commented May 31, 2018

Concentus interpreted almost all C pointers as a combination of an array and an integer offset, e.g. "int[] x, int x_ptr". These could potentially all be replaced by new Span constructs in .NetCore 2.0 for better optimization. Furthermore, most stack array allocations are done on the heap in C#, which also yields bad performance (except where they were hand-inlined inside the FFT)
See https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

@prepconcede
Copy link

prepconcede commented Sep 15, 2021

Hello, sir. How can i help? Not really good at low-level stuff, but ready to try where I can

@lostromb
Copy link
Owner Author

Uh, yeah, sure. You can start with some of the tight-loop kernels such as in here. Replace the byte[] array, int array_offset, int array_length function signatures with Span<byte> and then work upwards until the build passes again. Run the parity test console for a while to ensure there's no regressions. There are also places that allocate temporary heap arrays which could be switched to stackalloc, such as here in the NLSF encoder. You will probably have to reference the original C code to determine safe fixed-length array sizes since you can't do variable-length stackalloc.

@lostromb
Copy link
Owner Author

lostromb commented Oct 1, 2021

I took a quick shot of this because I was curious if .Net 6 carries an measurable performance benefits for this application. Porting the obvious places in kernels, xcorr, VQ, and NLSF yielded a slight (1-2%) benefit. Interestingly, when I ported over the MDCT code to use spans, the performance decreased by about 2%. My hunch is that the way that pointers get incremented in C (such as in kf_bfly5()) run faster when they are interpreted as incrementing the array index variables Fout_ptr++ rather than doing Fout = Fout.Slice(1). So it seems likely that the hand-inlined code is already quite optimized in some places.

@AraHaan
Copy link

AraHaan commented Mar 15, 2022

Performance-wise, the current build runs about 40-50% as fast as its equivalent libopus build, mostly due to the lack of stack arrays and vectorized intrinsics in managed languages.

How ironic I was going to file this issue also as for c# this is no longer the case. Also I think Tanner Gooding knows of some intrinsics to increase perf on this as well.
Just mention cc Tanner @tannergooding (sorry for the ping tanner).

@AraHaan
Copy link

AraHaan commented Mar 15, 2022

Uh, yeah, sure. You can start with some of the tight-loop kernels such as in here. Replace the byte[] array, int array_offset, int array_length function signatures with Span<byte> and then work upwards until the build passes again. Run the parity test console for a while to ensure there's no regressions. There are also places that allocate temporary heap arrays which could be switched to stackalloc, such as here in the NLSF encoder. You will probably have to reference the original C code to determine safe fixed-length array sizes since you can't do variable-length stackalloc.

for VL stackalloc I would rather rent an Memory<T> if that is possible currently.

@in0finite
Copy link

Hey @lostromb ,

I noticed that library generates a LOT of garbage when decoding. Even for smallest chunk (10 ms), it allocates 37 KB of memory.

I took a quick look at the code and spotted some obvious places where Span can be used instead of arrays.

Also, ArrayPool could probably be used on some places.

Is there any plan to do this, or do you accept Pull Requests ?

@lostromb
Copy link
Owner Author

My current "plan" is to chip away at this branch and overhaul the entire code to use unsafe pointers and unmanaged allocation, which should run a lot faster, have better parity with the original code, make SIMD easier to implement, and have zero allocations. I just haven't had time to work on it for a bit.

@in0finite
Copy link

It's good to see that there is some progress.
I don't mind having pointers in code, but still ArrayPools and stackallocs seem better than unmanaged memory.

Plus, I am using this library on WebAssembly, so compatibility is also a concern for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants