Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

Open
ghuls opened this issue Feb 28, 2023 · 7 comments

Comments

@ghuls
Copy link
Contributor

ghuls commented Feb 28, 2023

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.

Reading 2 input files from a pipe and writing to stdout:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,143s
user    0m17,215s
sys     0m0,184s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,341s
user    0m17,395s
sys     0m0,223s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

    while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m7,453s
user    0m7,511s
sys     0m0,175s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

     while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m20,671s
user    0m20,621s
sys     0m0,091s

Reading 2 input files from a pipe and no writing:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m12,966s
user    0m13,049s
sys     0m0,168s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m13,009s
user    0m13,075s
sys     0m0,181s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m4,260s
user    0m4,298s
sys     0m0,201s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m14,207s
user    0m14,169s
sys     0m0,023s

Reading 2 input files from a pipe and writing output to one pipe:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m38,797s
user    0m28,519s
sys     0m37,103s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m37,999s
user    0m27,634s
sys     0m36,721s


❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m8,415s
user    0m8,442s
sys     0m0,305s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m28,593s
user    0m40,432s
sys     0m15,867s

In the last case frawk uses even 200% CPU when it needs to read and write to a pipe openen by the awk process.

@ezrosent
Copy link
Owner

ezrosent commented Mar 2, 2023

Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library.

@ghuls
Copy link
Contributor Author

ghuls commented Mar 2, 2023

I assume mawk might not do line buffering in this case.

The original code I had is actually decompressing gziiped files and writing out gzipped files via those *_cmd commands:
https://github.com/aertslab/single_cell_toolkit/blob/master/barcode_10x_scatac_fastqs.sh

ezrosent added a commit that referenced this issue Mar 21, 2023
This seems to be one of the main issues outlined in #98
@ezrosent
Copy link
Owner

I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though.

@ghuls
Copy link
Contributor Author

ghuls commented Jul 25, 2023

Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html

ezrosent added a commit that referenced this issue Jul 26, 2023
As suggested in #98. This doesn't appear to help performance too much, but it is
a good wrapper around the standard library routines.
@ezrosent
Copy link
Owner

Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a BufRead doesn't seem to help either, unfortunately).

@ghuls
Copy link
Contributor Author

ghuls commented Jul 26, 2023

Probably it is not related to reading from a pipe, but just getline that is slow.
When reading from a premade file directly (with getline) instead of a piped filehandle, the slowdown is the same.

❯  time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null

real    0m8.219s
user    0m8.313s
sys     0m0.421s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 } }' > /dev/null

real    0m23.011s
user    0m23.014s
sys     0m0.491s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 > "/dev/null" } }'

real    0m26.739s
user    0m26.760s
sys     0m0.512s


❯  time yes | head -n 100000000 | frawk 'BEGIN { write_cmd = "cat > /dev/null"; while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 | write_cmd } }'

real    0m26.507s
user    0m26.519s
sys     0m0.564s

# Create file first.
❯  yes | head -n 100000000 > 100000000.txt

❯  time frawk 'BEGIN { while ( (getline line1 < "100000000.txt") > 0 ) { print line1 } }' > /dev/null

real    0m23.025s
user    0m22.778s
sys     0m0.148s

Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool.

@ghuls
Copy link
Contributor Author

ghuls commented Sep 20, 2024

Getline only seems to be slow when reading from a pipe or a a redirected file.
It seems to be faster than the normal implicit for loop of awk otherwise.

# Read and print all input lines.
$ time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null

real    0m8.452s
user    0m8.514s
sys     0m0.289s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { print $0 } }' > /dev/null

real    0m7.271s
user    0m7.335s
sys     0m0.277s


# Count number of input lines:
$ time yes | head -n 100000000 | frawk 'END { print NR }'
100000000

real    0m3.919s
user    0m3.968s
sys     0m0.255s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m2.656s
user    0m2.707s
sys     0m0.233s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline < "/dev/stdin") > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m20.239s
user    0m20.232s
sys     0m0.296s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( ("cat -" | getline) > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m20.051s
user    0m20.069s
sys     0m0.336s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants