Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

du: our implementation is slower than GNU's #6845

Open
sylvestre opened this issue Nov 5, 2024 · 9 comments
Open

du: our implementation is slower than GNU's #6845

sylvestre opened this issue Nov 5, 2024 · 9 comments

Comments

@sylvestre
Copy link
Contributor

On firefox source code (lot of files and directories):

# with 0.0.17:

$ hyperfine "/usr/bin/du " "/usr/lib/cargo/bin/coreutils/du" 
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      3.804 s ±  0.051 s    [User: 1.174 s, System: 2.606 s]
  Range (min … max):    3.722 s …  3.884 s    10 runs

Benchmark 2: /usr/lib/cargo/bin/coreutils/du
  Time (mean ± σ):      4.822 s ±  0.056 s    [User: 1.608 s, System: 3.190 s]
  Range (min … max):    4.754 s …  4.921 s    10 runs

Summary
  '/usr/bin/du ' ran
    1.27 ± 0.02 times faster than '/usr/lib/cargo/bin/coreutils/du '

seems that we regressed:

# with 0.0.26:

hyperfine "/usr/bin/du " "/usr/lib/cargo/bin/coreutils/du "
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      3.980 s ±  0.075 s    [User: 1.181 s, System: 2.772 s]
  Range (min … max):    3.887 s …  4.087 s    10 runs

Benchmark 2: /usr/lib/cargo/bin/coreutils/du
  Time (mean ± σ):      6.947 s ±  0.096 s    [User: 2.533 s, System: 5.479 s]
  Range (min … max):    6.800 s …  7.104 s    10 runs

Summary
  '/usr/bin/du ' ran
    1.75 ± 0.04 times faster than '/usr/lib/cargo/bin/coreutils/du '
@sylvestre
Copy link
Contributor Author

@jesseschalken has been working on it

#6839
#6840
#6841

@sylvestre
Copy link
Contributor Author

With the three PR, we are almost the same level:

$ hyperfine "/usr/bin/du" "/home/sylvestre/dev/debian/coreutils/target/release/coreutils du"
Benchmark 1: /usr/bin/du
  Time (mean ± σ):      4.108 s ±  0.113 s    [User: 1.237 s, System: 2.843 s]
  Range (min … max):    3.885 s …  4.312 s    10 runs

Benchmark 2: /home/sylvestre/dev/debian/coreutils/target/release/coreutils du
  Time (mean ± σ):      4.256 s ±  0.073 s    [User: 1.954 s, System: 3.315 s]
  Range (min … max):    4.100 s …  4.333 s    10 runs

Summary
  '/usr/bin/du' ran
    1.04 ± 0.03 times faster than '/home/sylvestre/dev/debian/coreutils/target/release/coreutils du'

@sylvestre
Copy link
Contributor Author

a profile with the 3 PR https://share.firefox.dev/4fdZpVX

@ycd
Copy link
Contributor

ycd commented Jan 14, 2025

Hey! It's been almost 3 years since the last time I contributed to this repository, I wanted to get familiar with the repository again, so I decided to tackle this issue.

I made some modifications, but I'm not sure, adding more parallelism in du is a good idea or not. (See: #4362)

My optimizations/changes was mostly based around this idea:

  • Introducing a parallel directory traversal with a new thread pool implementation (and re-using the same allocated threads instead of creating new ones on each traversal).
  • Converting data structures to thread-safe ones using Arc and Mutex
    • For example to track seen_inodes, the HashSet<FileInfo> becomes Arc<Mutex<HashSet<FileInfo>>>
  • Maintaining the filesystem traversal order even when traversing FS in parallel
  • Respect to the system(s) available parallelism using thread::available_parallelism.

With these changes, I was able to make it faster than the GNU version.

$ hyperfine "/opt/homebrew/bin/gdu" "/Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du"
Benchmark 1: /opt/homebrew/bin/gdu
  Time (mean ± σ):      5.606 s ±  0.163 s    [User: 0.187 s, System: 3.504 s]
  Range (min … max):    5.296 s …  5.941 s    10 runs

Benchmark 2: /Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du
  Time (mean ± σ):      2.379 s ±  0.223 s    [User: 0.549 s, System: 7.324 s]
  Range (min … max):    2.089 s …  2.641 s    10 runs

Summary
  /Users/yagizdegirmenci/personal/coreutils/target/release/coreutils du ran
    2.36 ± 0.23 times faster than /opt/homebrew/bin/gdu

I ran the tests on gecko-dev repository.

Device: Apple M1 Max
GNU Core Utilities Version: stable 9.5

Also a good thing that all the tests are still passing without any modification to them:

...
test test_du::test_du_with_posixly_correct ... ok
test test_du::test_du_threshold ... ok
test test_du::test_du_no_dereference ... ok
test test_du::test_human_size ... ok
test result: ok. 70 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.22s

I haven't fully polished and documented the changes yet, also I wasn't sure of this optimization is something we want so I didn't create a PR, yet.

Also I'm still unsure about the output ordering, in the current implementation we were already printing output in a separate thread, now we are also traversing in multiple threads, the ordering of the output is not deterministic, could it be harmful?

@sylvestre
Copy link
Contributor Author

Impressive!

I think it should be ordered but not 100% sure

@Arcterus
Copy link
Collaborator

IIRC at minimum the directories given on the command-line need to be done in-order. I believe the actual output other than that is pretty much random (filesystem order basically), but it goes through each directory recursively before moving on to the next one.

If you've got:

src/
    some/
        data/
            foo.txt
        bar.txt
target/
    baz.txt

The output order will be something like:

src
src/some
src/some/data
src/some/data/foo.txt
src/some/bar.tx
target
target/baz.txt

Note that the directories src and target could be swapped for instance depending on what the filesystem decides to give you.

Assuming what I remember here is actually the case, I imagine there are probably some scripts that actually depend on the output drilling down through an entire directory before moving on. Depending on how much memory is acceptable to consume, all the paths could be stored in a list or something and sorted before being displayed, since I'm pretty sure the actual order of the files within a given directory doesn't matter.

@ycd
Copy link
Contributor

ycd commented Jan 21, 2025

I believe the actual output other than that is pretty much random (filesystem order basically)

Exactly, whatever the FTS (gnulibfts.c) returns, it does most of the heavy-lifting for du actually!

I imagine there are probably some scripts that actually depend on the output drilling down through an entire directory before moving on

Your reasoning is also right on this, but the output is quite different than that, if I create this:

src/
    some/
        data/
            foo.txt
        bar.txt
target/
    baz.txt

It actually returns this:

$ gdu -a
4       ./target/baz.txt
4       ./target
4       ./src/some/bar.txt
4       ./src/some/data/foo.txt
4       ./src/some/data
8       ./src/some
8       ./src
12      .

I was sure that the output order is deterministic, but I wasn't sure that it had a particular order, so I did a little digging.

It turns out there is an order indeed for the output; it is doing a post-order hierarchy walk and printing in that order. It is also about the efficiency as you said.

So the actual flow goes like this:

  • First processes all subdirectories recursively and accumulates the size for it's parent dir
  • Then processes the current directory itself

The reason behind this is; we need the sizes of all subdirectories before we can compute the total size of the parent directory.

In my implementation(also the current implementation), we're already doing a form of post-order traversal since we must process subdirectories recursively before adding their sizes to the parent directory, if we run on single thread it still matches the GNU output 1:1, but we already have another thread for printing in the current implementation, so when I send data there from several different threads, we can't preserve an order by default, so I tried few ways to keep the optimizations while keeping the output in the same format, it become quite cumbersome, either the code becomes a mess or we lose optimizations due to thread contention.

all the paths could be stored in a list or something and sorted before being displayed

I think buffering things before displaying would be a bottleneck in that scenario because we still need to rely on single thread to display all the buffered files, also time to display first byte would not be optimal since there is going to be an initial delay.


Also I'm wondering if this is ordering is really that important, since:

  • It is not a breaking change, all tests are passing
  • Probably no one uses the output du without piping it to sort or something.

Another idea that comes to my mind, can this be added as a new feature as a replacement for GNU Parallel for du particularly, instead of using parallel du, then we can have something like a --parallel flag.

@sylvestre
Copy link
Contributor Author

Probably no one uses the output du without piping it to sort or something.

I would not bet on this ;)

we can have something like a --parallel flag.

yeah, i think adding an option would be fine

@ycd
Copy link
Contributor

ycd commented Jan 21, 2025

I just realized that, even running parallel du ::: . returns in the post-order traversal order but, it also buffers the file(s) before displaying (similar to @Arcterus's approach).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants