Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LZ4 transport compression #33

Merged
merged 19 commits into from
Oct 22, 2022

Conversation

DylanModesitt
Copy link
Contributor

@DylanModesitt DylanModesitt commented Sep 18, 2022

PR for #8. Opening here in case there are interim comments, suggestions, or requirements. I have seen the @athre0z branch and will incorporate as necessary. Plans are to finish remaining work as time permits.

  • Develop a CityHash128 library
  • Test CityHash128 against ch hash values (only been hand tested)
  • Implement new CompressedBlock or alternative refactoring

@DylanModesitt DylanModesitt marked this pull request as draft September 18, 2022 00:04
@DylanModesitt DylanModesitt marked this pull request as ready for review October 2, 2022 16:29
@DylanModesitt
Copy link
Contributor Author

DylanModesitt commented Oct 2, 2022

If time permits, I'd like to solicit some feedback on this implementation.

I can add more tests explicitly around reading/writing compressed blocks if desired. I stuck with @athre0z's sketch for chread and chwrite on the same Block structure instead of adding some more types & doing value dispatch on the compression method or something along those lines -- but happy to refactor as desired.


As an aside, this is just one of the features that I wanted for my own use, but my current plans are to implement the DBInterface next. This will require something akin to #9. clickhouse-driver seems to have a simple implementation of that which I plan to follow. Also think we ought to do #12. After that, I'll beef up the docs.

I also think we should maybe tear out the explicit query building on insert ("INSERT INTO $(table) VALUES") and instead overload execute to take an iter wherein the ddl_query would be passed from the user as "INSERT INTO {{table}} VALUES" which I believe is more safer, more flexible, and more intuitive for users. We can move these discussions to issues but just wanted to jot down some thoughts.

src/tcp/DataBlocks.jl Outdated Show resolved Hide resolved
Copy link
Member

@athre0z athre0z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks a lot for the PR!

I did a first quick pass and the overall implementation looks solid. I left a few comments and will do another reading pass later. It might make sense to move the CityHash128 implementation into a separate project / repository in order to allow other projects to use it as well, but I'd also okay with having it here for the time being.

We can move these discussions to issues but just wanted to jot down some thoughts.

Yes, I think that'd be best for readability!

test/cityhash.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Outdated Show resolved Hide resolved
src/tcp/CityHash128.jl Outdated Show resolved Hide resolved
src/tcp/CityHash128.jl Outdated Show resolved Hide resolved
@DylanModesitt DylanModesitt requested a review from athre0z October 8, 2022 20:54
@DylanModesitt
Copy link
Contributor Author

city_hash_128 performance is now appropriate due to not allocating in fetch64 and friends. Benchmark for a 523001 length string.

Before:

BenchmarkTools.Trial: 260 samples with 1 evaluation.
 Range (min … max):  18.340 ms … 22.842 ms  ┊ GC (min … max): 0.00% … 16.17%
 Time  (median):     18.621 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.274 ms ±  1.365 ms  ┊ GC (mean ± σ):  3.52% ±  6.18%

    ▂██                                                        
  ▂▄███▇▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▄▃▂ ▂
  18.3 ms         Histogram: frequency by time        22.2 ms <

 Memory estimate: 10.48 MiB, allocs estimate: 245180.

After:

BenchmarkTools.Trial: 6759 samples with 1 evaluation.
 Range (min … max):  684.250 μs …   3.343 ms  ┊ GC (min … max): 0.00% … 78.02%
 Time  (median):     717.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   738.717 μs ± 202.423 μs  ┊ GC (mean ± σ):  2.70% ±  7.25%

                ▁▅▇█▇▄▂▁                                         
  ▂▄▃▂▂▂▂▂▂▂▂▂▃▅████████▆▆▅▄▄▃▃▃▃▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▂▂ ▃
  684 μs           Histogram: frequency by time          795 μs <

 Memory estimate: 510.98 KiB, allocs estimate: 5.

The other suggested changes were also implemented (thanks)

@DylanModesitt
Copy link
Contributor Author

@athre0z anything else you would like for this PR?

@athre0z
Copy link
Member

athre0z commented Oct 15, 2022

Thanks for the update and your patience -- my time has unfortunately been rather limited in the past few days and will likely continue to be so for the next two weeks. I'll try to carve out some time tomorrow for a final review pass.

In the mean time: I noticed that the DelimitedFiles package is being used in the tests while not being listed in Project.toml, causing the tests to fail unless manually installed up front. We'll either want to add this as a test-dependency or stop using this package; adding a general dependency just for the tests seems a bit overkill. I don't really have a preference here: feel free to pick what is easier for you.

src/Connect.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Outdated Show resolved Hide resolved
src/tcp/Compression.jl Show resolved Hide resolved
@DylanModesitt
Copy link
Contributor Author

Thanks for the update and your patience -- my time has unfortunately been rather limited in the past few days and will likely continue to be so for the next two weeks. I'll try to carve out some time tomorrow for a final review pass.

No worries -- appreciate your thoughtful review! The comments you've made have been resolved. I will also see if I can squeeze a PR into CodecLz4.jl to fix the GC issue upstream and then later remove it here in a new PR.

In the mean time: I noticed that the DelimitedFiles package is being used in the tests while not being listed in Project.toml, causing the tests to fail unless manually installed up front. We'll either want to add this as a test-dependency or stop using this package; adding a general dependency just for the tests seems a bit overkill. I don't really have a preference here: feel free to pick what is easier for you.

Resolved - unnecessary dep.

Copy link
Member

@athre0z athre0z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you for tackling this!

@athre0z athre0z changed the title (WIP) Transport Compression Implement LZ4 transport compression Oct 22, 2022
@athre0z athre0z merged commit acf1583 into JuliaDatabases:master Oct 22, 2022
@athre0z athre0z mentioned this pull request Oct 22, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants