du: our implementation is slower than GNU's #6845

sylvestre · 2024-11-05T22:12:00Z

On firefox source code (lot of files and directories):

# with 0.0.17:

$ hyperfine "/usr/bin/du " "/usr/lib/cargo/bin/coreutils/du" 
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      3.804 s ±  0.051 s    [User: 1.174 s, System: 2.606 s]
  Range (min … max):    3.722 s …  3.884 s    10 runs

Benchmark 2: /usr/lib/cargo/bin/coreutils/du
  Time (mean ± σ):      4.822 s ±  0.056 s    [User: 1.608 s, System: 3.190 s]
  Range (min … max):    4.754 s …  4.921 s    10 runs

Summary
  '/usr/bin/du ' ran
    1.27 ± 0.02 times faster than '/usr/lib/cargo/bin/coreutils/du '

seems that we regressed:

# with 0.0.26:

hyperfine "/usr/bin/du " "/usr/lib/cargo/bin/coreutils/du "
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      3.980 s ±  0.075 s    [User: 1.181 s, System: 2.772 s]
  Range (min … max):    3.887 s …  4.087 s    10 runs

Benchmark 2: /usr/lib/cargo/bin/coreutils/du
  Time (mean ± σ):      6.947 s ±  0.096 s    [User: 2.533 s, System: 5.479 s]
  Range (min … max):    6.800 s …  7.104 s    10 runs

Summary
  '/usr/bin/du ' ran
    1.75 ± 0.04 times faster than '/usr/lib/cargo/bin/coreutils/du '

The text was updated successfully, but these errors were encountered:

sylvestre · 2024-11-06T10:16:10Z

@jesseschalken has been working on it

#6839
#6840
#6841

sylvestre · 2024-11-06T10:25:52Z

With the three PR, we are almost the same level:

$ hyperfine "/usr/bin/du" "/home/sylvestre/dev/debian/coreutils/target/release/coreutils du"
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      4.108 s ±  0.113 s    [User: 1.237 s, System: 2.843 s]
  Range (min … max):    3.885 s …  4.312 s    10 runs

Benchmark 2: /home/sylvestre/dev/debian/coreutils/target/release/coreutils du
  Time (mean ± σ):      4.256 s ±  0.073 s    [User: 1.954 s, System: 3.315 s]
  Range (min … max):    4.100 s …  4.333 s    10 runs

Summary
  '/usr/bin/du' ran
    1.04 ± 0.03 times faster than '/home/sylvestre/dev/debian/coreutils/target/release/coreutils du'

sylvestre · 2024-11-06T10:36:37Z

a profile with the 3 PR https://share.firefox.dev/4fdZpVX

ycd · 2025-01-14T20:51:45Z

Hey! It's been almost 3 years since the last time I contributed to this repository, I wanted to get familiar with the repository again, so I decided to tackle this issue.

I made some modifications, but I'm not sure, adding more parallelism in du is a good idea or not. (See: #4362)

My optimizations/changes was mostly based around this idea:

Introducing a parallel directory traversal with a new thread pool implementation (and re-using the same allocated threads instead of creating new ones on each traversal).
Converting data structures to thread-safe ones using Arc and Mutex
- For example to track seen_inodes, the HashSet<FileInfo> becomes Arc<Mutex<HashSet<FileInfo>>>
Maintaining the filesystem traversal order even when traversing FS in parallel
Respect to the system(s) available parallelism using thread::available_parallelism.

With these changes, I was able to make it faster than the GNU version.

$ hyperfine "/opt/homebrew/bin/gdu" "/Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du"
Benchmark 1: /opt/homebrew/bin/gdu
  Time (mean ± σ):      5.606 s ±  0.163 s    [User: 0.187 s, System: 3.504 s]
  Range (min … max):    5.296 s …  5.941 s    10 runs

Benchmark 2: /Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du
  Time (mean ± σ):      2.379 s ±  0.223 s    [User: 0.549 s, System: 7.324 s]
  Range (min … max):    2.089 s …  2.641 s    10 runs

Summary
  /Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du ran
    2.36 ± 0.23 times faster than /opt/homebrew/bin/gdu

I ran the tests on gecko-dev repository.

Device: Apple M1 Max
GNU Core Utilities Version: stable 9.5

Also a good thing that all the tests are still passing without any modification to them:

...
test test_du::test_du_with_posixly_correct ... ok
test test_du::test_du_threshold ... ok
test test_du::test_du_no_dereference ... ok
test test_du::test_human_size ... ok
test result: ok. 70 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.22s

I haven't fully polished and documented the changes yet, also I wasn't sure of this optimization is something we want so I didn't create a PR, yet.

Also I'm still unsure about the output ordering, in the current implementation we were already printing output in a separate thread, now we are also traversing in multiple threads, the ordering of the output is not deterministic, could it be harmful?

sylvestre · 2025-01-14T22:13:14Z

Impressive!

I think it should be ordered but not 100% sure

Arcterus · 2025-01-21T13:46:56Z

IIRC at minimum the directories given on the command-line need to be done in-order. I believe the actual output other than that is pretty much random (filesystem order basically), but it goes through each directory recursively before moving on to the next one.

If you've got:

src/
    some/
        data/
            foo.txt
        bar.txt
target/
    baz.txt

The output order will be something like:

src
src/some
src/some/data
src/some/data/foo.txt
src/some/bar.tx
target
target/baz.txt

Note that the directories src and target could be swapped for instance depending on what the filesystem decides to give you.

Assuming what I remember here is actually the case, I imagine there are probably some scripts that actually depend on the output drilling down through an entire directory before moving on. Depending on how much memory is acceptable to consume, all the paths could be stored in a list or something and sorted before being displayed, since I'm pretty sure the actual order of the files within a given directory doesn't matter.

ycd · 2025-01-21T15:35:10Z

I believe the actual output other than that is pretty much random (filesystem order basically)

Exactly, whatever the FTS (gnulibfts.c) returns, it does most of the heavy-lifting for du actually!

I imagine there are probably some scripts that actually depend on the output drilling down through an entire directory before moving on

Your reasoning is also right on this, but the output is quite different than that, if I create this:

src/
    some/
        data/
            foo.txt
        bar.txt
target/
    baz.txt

It actually returns this:

$ gdu -a
4       ./target/baz.txt
4       ./target
4       ./src/some/bar.txt
4       ./src/some/data/foo.txt
4       ./src/some/data
8       ./src/some
8       ./src
12      .

I was sure that the output order is deterministic, but I wasn't sure that it had a particular order, so I did a little digging.

It turns out there is an order indeed for the output; it is doing a post-order hierarchy walk and printing in that order. It is also about the efficiency as you said.

So the actual flow goes like this:

First processes all subdirectories recursively and accumulates the size for it's parent dir
Then processes the current directory itself

The reason behind this is; we need the sizes of all subdirectories before we can compute the total size of the parent directory.

In my implementation(also the current implementation), we're already doing a form of post-order traversal since we must process subdirectories recursively before adding their sizes to the parent directory, if we run on single thread it still matches the GNU output 1:1, but we already have another thread for printing in the current implementation, so when I send data there from several different threads, we can't preserve an order by default, so I tried few ways to keep the optimizations while keeping the output in the same format, it become quite cumbersome, either the code becomes a mess or we lose optimizations due to thread contention.

all the paths could be stored in a list or something and sorted before being displayed

I think buffering things before displaying would be a bottleneck in that scenario because we still need to rely on single thread to display all the buffered files, also time to display first byte would not be optimal since there is going to be an initial delay.

Also I'm wondering if this is ordering is really that important, since:

It is not a breaking change, all tests are passing
Probably no one uses the output du without piping it to sort or something.

Another idea that comes to my mind, can this be added as a new feature as a replacement for GNU Parallel for du particularly, instead of using parallel du, then we can have something like a --parallel flag.

sylvestre · 2025-01-21T15:41:26Z

Probably no one uses the output du without piping it to sort or something.

I would not bet on this ;)

we can have something like a --parallel flag.

yeah, i think adding an option would be fine

ycd · 2025-01-21T18:04:16Z

I just realized that, even running parallel du ::: . returns in the post-order traversal order but, it also buffers the file(s) before displaying (similar to @Arcterus's approach).

sylvestre added good first issue For newcomers! U - du I - Performance (Speed) labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

du: our implementation is slower than GNU's #6845

du: our implementation is slower than GNU's #6845

sylvestre commented Nov 5, 2024

sylvestre commented Nov 6, 2024

sylvestre commented Nov 6, 2024

sylvestre commented Nov 6, 2024

ycd commented Jan 14, 2025

sylvestre commented Jan 14, 2025

Arcterus commented Jan 21, 2025

ycd commented Jan 21, 2025

sylvestre commented Jan 21, 2025

ycd commented Jan 21, 2025 •

edited

Loading

du: our implementation is slower than GNU's #6845

du: our implementation is slower than GNU's #6845

Comments

sylvestre commented Nov 5, 2024

sylvestre commented Nov 6, 2024

sylvestre commented Nov 6, 2024

sylvestre commented Nov 6, 2024

ycd commented Jan 14, 2025

sylvestre commented Jan 14, 2025

Arcterus commented Jan 21, 2025

ycd commented Jan 21, 2025

sylvestre commented Jan 21, 2025

ycd commented Jan 21, 2025 • edited Loading

ycd commented Jan 21, 2025 •

edited

Loading