frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

ghuls · 2023-02-28T14:55:38Z

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process.

Reading 2 input files from a pipe and writing to stdout:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,143s
user    0m17,215s
sys     0m0,184s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


     while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m17,341s
user    0m17,395s
sys     0m0,223s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

    while ( (read_cmd1 | getline line1) > 0 ) {
         if ( (read_cmd2 | getline line2) > 0 ) {
             print line1 "\t" line2
         }
     }
}' > /dev/null

real    0m7,453s
user    0m7,511s
sys     0m0,175s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";

     while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m20,671s
user    0m20,621s
sys     0m0,091s

Reading 2 input files from a pipe and no writing:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m12,966s
user    0m13,049s
sys     0m0,168s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m13,009s
user    0m13,075s
sys     0m0,181s

❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m4,260s
user    0m4,298s
sys     0m0,201s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";


    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            #print line1 "\t" line2
        }
    }
}' > /dev/null

real    0m14,207s
user    0m14,169s
sys     0m0,023s

Reading 2 input files from a pipe and writing output to one pipe:

❯ time gawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m38,797s
user    0m28,519s
sys     0m37,103s

❯ time gawk -b '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m37,999s
user    0m27,634s
sys     0m36,721s


❯ time mawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m8,415s
user    0m8,442s
sys     0m0,305s

❯ time frawk '
BEGIN {
    read_cmd1 = "yes | head -n 100000000";
    read_cmd2 = "yes | head -n 100000000";
    write_cmd = "cat > /dev/null";

    while ( (read_cmd1 | getline line1) > 0 ) {
        if ( (read_cmd2 | getline line2) > 0 ) {
            print line1 "\t" line2 | write_cmd
        }
    }
}'

real    0m28,593s
user    0m40,432s
sys     0m15,867s

In the last case frawk uses even 200% CPU when it needs to read and write to a pipe openen by the awk process.

ezrosent · 2023-03-02T07:08:49Z

Fascinating! Thanks for filing this issue. I think mawk, frawk, and gawk may all be buffering their file IO a bit differently. I can try to take a look at what mawk is doing and compare it with the Rust standard library.

ghuls · 2023-03-02T09:37:13Z

I assume mawk might not do line buffering in this case.

The original code I had is actually decompressing gziiped files and writing out gzipped files via those *_cmd commands:
https://github.com/aertslab/single_cell_toolkit/blob/master/barcode_10x_scatac_fastqs.sh

This seems to be one of the main issues outlined in #98

ezrosent · 2023-03-21T05:21:05Z

I definitely think that line buffering on output was a big issue in the last benchmark. That's fixed in the latest commit; reading is still slower though.

ghuls · 2023-07-25T09:12:01Z

Could CommanReader be used for reading from pipes to solve this issue? https://docs.rs/grep-cli/latest/grep_cli/struct.CommandReader.html

As suggested in #98. This doesn't appear to help performance too much, but it is a good wrapper around the standard library routines.

ezrosent · 2023-07-26T05:55:43Z

Feel free to try things out on that latest commit: I don't notice any improvement (and wrapping in a BufRead doesn't seem to help either, unfortunately).

ghuls · 2023-07-26T09:21:26Z

Probably it is not related to reading from a pipe, but just getline that is slow.
When reading from a premade file directly (with getline) instead of a piped filehandle, the slowdown is the same.

❯  time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null

real    0m8.219s
user    0m8.313s
sys     0m0.421s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 } }' > /dev/null

real    0m23.011s
user    0m23.014s
sys     0m0.491s


❯  time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 > "/dev/null" } }'

real    0m26.739s
user    0m26.760s
sys     0m0.512s


❯  time yes | head -n 100000000 | frawk 'BEGIN { write_cmd = "cat > /dev/null"; while ( (getline line1 < "/dev/stdin") > 0 ) { print line1 | write_cmd } }'

real    0m26.507s
user    0m26.519s
sys     0m0.564s

# Create file first.
❯  yes | head -n 100000000 > 100000000.txt

❯  time frawk 'BEGIN { while ( (getline line1 < "100000000.txt") > 0 ) { print line1 } }' > /dev/null

real    0m23.025s
user    0m22.778s
sys     0m0.148s

Also now that CommandReader is used, it should be relatively straightforward to be able to handle compressed text files automagically if requested by constructing a CommandReader with the correct decompression tool.

ghuls · 2024-09-20T12:33:36Z

Getline only seems to be slow when reading from a pipe or a a redirected file.
It seems to be faster than the normal implicit for loop of awk otherwise.

# Read and print all input lines.
$ time yes | head -n 100000000 | frawk '{ print $0 }' > /dev/null

real    0m8.452s
user    0m8.514s
sys     0m0.289s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { print $0 } }' > /dev/null

real    0m7.271s
user    0m7.335s
sys     0m0.277s


# Count number of input lines:
$ time yes | head -n 100000000 | frawk 'END { print NR }'
100000000

real    0m3.919s
user    0m3.968s
sys     0m0.255s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline ) > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m2.656s
user    0m2.707s
sys     0m0.233s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( (getline < "/dev/stdin") > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m20.239s
user    0m20.232s
sys     0m0.296s

$ time yes | head -n 100000000 | frawk 'BEGIN { while ( ("cat -" | getline) > 0 ) { line_count += 1;} print line_count}'
100000000

real    0m20.051s
user    0m20.069s
sys     0m0.336s

ezrosent added a commit that referenced this issue Mar 21, 2023

Don't line-buffer when writing to commands

3f37438

This seems to be one of the main issues outlined in #98

ezrosent added a commit that referenced this issue Jul 26, 2023

Swap CommandReader in for ChildStdout

3288069

As suggested in #98. This doesn't appear to help performance too much, but it is a good wrapper around the standard library routines.

zamazan4ik mentioned this issue Oct 7, 2023

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #103

Open

ghuls mentioned this issue Sep 24, 2024

getline is much slower when reading from a pipe or redirected file compared with mawk. linux-china/zawk#4

Open

ghuls mentioned this issue Oct 18, 2024

Can the executable size be made smaller #113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

ghuls commented Feb 28, 2023 •

edited

Loading

ezrosent commented Mar 2, 2023

ghuls commented Mar 2, 2023

ezrosent commented Mar 21, 2023

ghuls commented Jul 25, 2023

ezrosent commented Jul 26, 2023

ghuls commented Jul 26, 2023 •

edited

Loading

ghuls commented Sep 20, 2024

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

frawk is 3 times slower than mawk when reading/writing to piped commands opened by the awk process. #98

Comments

ghuls commented Feb 28, 2023 • edited Loading

ezrosent commented Mar 2, 2023

ghuls commented Mar 2, 2023

ezrosent commented Mar 21, 2023

ghuls commented Jul 25, 2023

ezrosent commented Jul 26, 2023

ghuls commented Jul 26, 2023 • edited Loading

ghuls commented Sep 20, 2024

ghuls commented Feb 28, 2023 •

edited

Loading

ghuls commented Jul 26, 2023 •

edited

Loading