-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98
Comments
Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library. |
I assume mawk might not do line buffering in this case. The original code I had is actually decompressing gziiped files and writing out gzipped files via those |
This seems to be one of the main issues outlined in #98
I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though. |
Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html |
As suggested in #98. This doesn't appear to help performance too much, but it is a good wrapper around the standard library routines.
Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a |
Probably it is not related to reading from a pipe, but just getline that is slow.
Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool. |
Getline only seems to be slow when reading from a pipe or a a redirected file. # Read and print all input lines.
$ time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null
real 0m8.452s
user 0m8.514s
sys 0m0.289s
$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { print $0 } }' > /dev/null
real 0m7.271s
user 0m7.335s
sys 0m0.277s
# Count number of input lines:
$ time yes | head -n 100000000 | frawk 'END { print NR }'
100000000
real 0m3.919s
user 0m3.968s
sys 0m0.255s
$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { line_count += 1;} print line_count}'
100000000
real 0m2.656s
user 0m2.707s
sys 0m0.233s
$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline < "/dev/stdin") > 0 ) { line_count += 1;} print line_count}'
100000000
real 0m20.239s
user 0m20.232s
sys 0m0.296s
$ time yes | head -n 100000000 | frawk 'BEGIN { while ( ("cat -" | getline) > 0 ) { line_count += 1;} print line_count}'
100000000
real 0m20.051s
user 0m20.069s
sys 0m0.336s |
frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.
Reading 2 input files from a pipe and writing to stdout:
Reading 2 input files from a pipe and no writing:
Reading 2 input files from a pipe and writing output to one pipe:
In the last case frawk uses even 200% CPU when it needs to read and write to a pipe openen by the awk process.
The text was updated successfully, but these errors were encountered: