Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch live rootfs from squashfs to EROFS #1852

Open
jlebon opened this issue Dec 17, 2024 · 43 comments
Open

Switch live rootfs from squashfs to EROFS #1852

jlebon opened this issue Dec 17, 2024 · 43 comments

Comments

@jlebon
Copy link
Member

jlebon commented Dec 17, 2024

Currently, we ship a rootfs.img CPIO (both as a separate artifact and as part of the live ISO) which contains the rootfs as a squashfs image.

Let's switch it over to use EROFS instead. Since EROFS is already in use by composefs, this reduces the number of read-only filesystem image formats we have to care about.

This work would happen in osbuild since that's where our live ISO is now being built.

@AdamWill
Copy link

AdamWill commented Dec 19, 2024

@jlebon asked me to drop a link to @Conan-Kudo 's work on Fedora Kiwi-built lives here, so see https://pagure.io/fedora-kiwi-descriptions/pull-request/105 . we do also already have an erofs image in the main compose, one of the FEX images for Asahi - https://pagure.io/fedora-kiwi-descriptions/blob/rawhide/f/teams/asahi.xml#_6 .

@Conan-Kudo
Copy link

Note that erofs is nowhere near as capable or performant as squashfs for compression right now, and erofs image builds take a lot longer than squashfs to reach closer to equivalent storage sizes (almost 3x the time).

@hsiangkao
Copy link

hsiangkao commented Dec 20, 2024

Note that erofs is nowhere near as capable or performant as squashfs for compression right now, and erofs image builds take a lot longer than squashfs to reach closer to equivalent storage sizes (almost 3x the time).

By the way, could you test -zlzma,6, -C1048576, -Eall-fragments (note that without -Ededupe) to compare with Squashfs xz -b 1m (because it seems the current case in kiwi)?
This combination is already multi-threaded, and except for BCJ, it's already comparable to Squashfs.

I wonder the image sizes and build speed out of this combination.

@Conan-Kudo
Copy link

I've just added a commit to test it in my pull request. Let's see how it goes.

@hsiangkao
Copy link

I've just added a commit to test it in my pull request. Let's see how it goes.

I know that -zlzma,6 -C131072 -Eall-fragments,dedupe has been tried, yet I guess -C1048576, -Eall-fragments may produce the similar to -C131072 -Eall-fragments,dedupe and it's already multi-threaded now.

@Conan-Kudo
Copy link

Unfortunately it looks like this one is taking way too long.

@hsiangkao
Copy link

Unfortunately it looks like this one is taking way too long.

It seems it finishes?

@Conan-Kudo
Copy link

@hsiangkao
Copy link

hsiangkao commented Dec 20, 2024

No, it timed out: https://artifacts.dev.testing-farm.io/277af8a0-bf57-4499-98c1-e90531d0b43d/

ok, I didn't know how to parse the raw result. btw, what's the current squashfs build time, do you have some log so I could check too?
Also I'm not sure if I need to bother you try more combinations, but I still wonder if
-zlzma,level=6,dictsize=524288 -C524288 -Eall-fragments could finish in time and the image sizes. By default, it uses 8*-C but it could slow down the speed (but squashfs uses the same value as its block size, its dict size is the same as the block size set)
Also could I have a way to test locally too?

@Conan-Kudo
Copy link

Conan-Kudo commented Dec 20, 2024

From a recent job with the current settings: https://artifacts.dev.testing-farm.io/18600fee-4f88-4c4e-940b-97c98960f752/

It took 15 minutes based on this log.

To test it locally, you can do so on any Fedora 41+ system:

$ sudo dnf install kiwi-cli kiwi-systemdeps git
$ git clone --branch reapply-erofs-live https://pagure.io/fedora-kiwi-descriptions.git
$ cd fedora-kiwi-descriptions
$ sudo ./kiwi-build --output-dir=$PWD/tmpoutput --image-type=iso --image-profile=KDE-Desktop-Live --image-release=0 --debug

If you want to test with squashfs, just switch to the rawhide branch.

@hsiangkao
Copy link

but I still wonder if -zlzma,level=6,dictsize=524288 -C524288 -Eall-fragments could finish in time and the image sizes.

Because I'm afraid -C1048576 will still time out, so try -C524288 might be better, also I wonder like to know the dict size impact.

From a recent job with the current settings: https://artifacts.dev.testing-farm.io/18600fee-4f88-4c4e-940b-97c98960f752/

It took 15 minutes based on this log.

Ok.

To test it locally, you can do so on any Fedora 41+ system:

$ sudo dnf install kiwi-cli kiwi-systemdeps git
$ git clone --branch reapply-erofs-live https://pagure.io/fedora-kiwi-descriptions.git
$ cd fedora-kiwi-descriptions
$ sudo ./kiwi-build --output-dir=$PWD/tmpoutput --image-type=iso --image-profile=KDE-Desktop-Live --image-release=0 --debug

If you want to test with squashfs, just switch to the rawhide branch.

Let me try, thanks.

@hsiangkao
Copy link

To test it locally, you can do so on any Fedora 41+ system:

$ sudo dnf install kiwi-cli kiwi-systemdeps git
$ git clone --branch reapply-erofs-live https://pagure.io/fedora-kiwi-descriptions.git
$ cd fedora-kiwi-descriptions
$ sudo ./kiwi-build --output-dir=$PWD/tmpoutput --image-type=iso --image-profile=KDE-Desktop-Live --image-release=0 --debug

If you want to test with squashfs, just switch to the rawhide branch.

Let me try, thanks.

btw, can it work in a container (e.g. docker) or VM?

@Conan-Kudo
Copy link

Conan-Kudo commented Dec 20, 2024

VM yes, Docker-style container environment no.

@hsiangkao
Copy link

hsiangkao commented Dec 20, 2024

From a recent job with the current settings: https://artifacts.dev.testing-farm.io/18600fee-4f88-4c4e-940b-97c98960f752/

It took 15 minutes based on this log.

Another question is that I wonder how many CPUs was this job used? I couldn't find any hint in the log though.

[ DEBUG   ]: 13:47:43 | Looking for mkfs.erofs in /root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
[ DEBUG   ]: 13:47:43 | EXEC: [mkfs.erofs -Eall-fragments -C524288 -z lzma,level=6,dictsize=524288 /var/tmp/kiwi_bthrnl2n /root/fedora-kiwi-descriptions/tmpoutput-build/build/image-root/]
[ DEBUG   ]: 14:34:42 | Creating directory /root/fedora-kiwi-descriptions/tmpoutput-build/live-media.gcmhugj0/LiveOS

I've tried this configuration with a virtual cloud server of Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz * 32 cores, and the result is

2.9G(3097276416) Dec 20 14:35 Fedora.x86_64-Rawhide.iso

I'm trying -C1048576 -z lzma,level=6,dictsize=1048576 now but it shouldn't be significant smaller.

It seems the main bottleneck is although the main process of -Eall-fragments is multi-threaded, but the preprocess to move fragments into the special inode is still single-threaded for now and it takes nearly 20 mins on my test environment.

I think the build performance won't be improved with the latest mkfs soon, I have to work out a full multi-threaded fragments and dedupe first and try again.

@Conan-Kudo
Copy link

Well, it took over an hour and a half on my Framework 16, which has an AMD Ryzen 9 7940HS (16 cores).

@hsiangkao
Copy link

hsiangkao commented Dec 21, 2024

Well, it took over an hour and a half on my Framework 16, which has an AMD Ryzen 9 7940HS (16 cores).

I've fixed a bug which could cause slow image building if there is much incompressible data, and the time reduced to 20 mins on my test environment (Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz * 32 cores):

-Eall-fragments -C524288 -z lzma,level=6,dictsize=524288 :

[ INFO    ]: 06:45:32 | Packing system into dracut live ISO type: dmsquash
[ DEBUG   ]: 06:45:32 | Looking for mkfs.erofs in /root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
[ DEBUG   ]: 06:45:32 | EXEC: [mkfs.erofs -Eall-fragments -C524288 -z lzma,level=6,dictsize=524288 /var/tmp/kiwi_daobb3zm /root/fedora-kiwi-descriptions/tmpoutput-bui
ld/build/image-root/]
[ DEBUG   ]: 07:04:40 | Creating directory /root/fedora-kiwi-descriptions/tmpoutput-build/live-media.jcb_9tq9/LiveOS
[ INFO    ]: 07:04:44 | Creating live ISO image

The result of image size is

2.9G(3096842240) Dec 21 07:05 Fedora.x86_64-Rawhide.iso

The mkfs version can be checked out as
git clone git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experiemental
https://lore.kernel.org/r/[email protected]

need ./configure --enable-multithreading

If you're interested, you could try out too. There is still some trick to reduce time even without working on multi-threaded fragments and dedupe even further. I will address this weekend.

@hsiangkao
Copy link

@Conan-Kudo how do I check if the produced data is correct?
tmpoutput-build/build/image-root should be the same as the data inside LiveOS/squashfs.img?

@Conan-Kudo
Copy link

Yes.

@hsiangkao
Copy link

Yes.

But I could find a file(usr/lib/sysimage/rpm/rpmdb.sqlite-shm) is different in tmpoutput-build/build/image-root, the other files are the same:

b54a9455555aaa64b81c40eaaee9d805  mnt/usr/lib/sysimage/rpm/rpmdb.sqlite-shm < --- erofs one
b7c14ec6110fa820ca6b65f5aec85911  tmpoutput-build/build/image-root/usr/lib/sysimage/rpm/rpmdb.sqlite-shm

and the timestamp of rpmdb.sqlite-shm is the same as Dec 21 07:05 Fedora.x86_64-Rawhide.iso, but it seems impossible if it's not be updated.

-rw-r--r--. 1 root root 120836096 Dec 21 06:44 rpmdb.sqlite
-rw-r--r--. 1 root root     32768 Dec 21 07:05 rpmdb.sqlite-shm
-rw-r--r--. 1 root root         0 Dec 21 06:44 rpmdb.sqlite-wal

Can this file be changed?

@Conan-Kudo
Copy link

The -shm and -wal files are cache files that change after each access of the rpmdb. Since in-tree rpm commands are used after the erofs image is created, those files wind up changing.

@Conan-Kudo
Copy link

To state simply, it's nothing to worry about. 😄

@hsiangkao
Copy link

...

The mkfs version can be checked out as git clone git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experiemental https://lore.kernel.org/r/[email protected]

need ./configure --enable-multithreading

If you're interested, you could try out too. There is still some trick to reduce time even without working on multi-threaded fragments and dedupe even further. I will address this weekend.

Hi @Conan-Kudo! With the latest experimental branch(git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git -b experiemental) and -Eall-fragments,^fragdedupe -C524288 -z lzma,level=6,dictsize=524288 .

I could produce the following results, does it already meets the kiwi build requirements?

[ DEBUG   ]: 17:18:24 | Looking for mkfs.erofs in /root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
[ DEBUG   ]: 17:18:24 | EXEC: [mkfs.erofs -Eall-fragments,^fragdedupe -C524288 -z lzma,level=6,dictsize=524288 /var/tmp/kiwi_e8e5e0q7 /root/fedora-kiwi-descriptions/t
mpoutput-build/build/image-root/]
[ DEBUG   ]: 17:20:41 | Creating directory /root/fedora-kiwi-descriptions/tmpoutput-build/live-media.dqgml94k/LiveOS
[ INFO    ]: 17:20:45 | Creating live ISO image

The result of image size is

2969956352(2.8G) Dec 23 17:21 Fedora.x86_64-Rawhide.iso

If it can be tested on your side (if possible, many thanks! note that --enable-multithreading should be enabled) and it already looks good to you, I wonder if I need to release a quick fix erofs-utils version for those use cases anyway.

@Conan-Kudo
Copy link

I've verified the improvements with the suggested options using HEAD of the experimental branch. Could you please make a release with that so that we can get it into Fedora and RHEL?

@hsiangkao
Copy link

I've verified the improvements with the suggested options using HEAD of the experimental branch. Could you please make a release with that so that we can get it into Fedora and RHEL?

Ok, as long as it's useful, I will try to do more tests and release this week.
Later, after -Ededupe is multi-threaded, I guess it may be more smaller than the current status...

@Conan-Kudo
Copy link

That would be amazing. Making it even smaller would be great.

@hsiangkao
Copy link

I've verified the improvements with the suggested options using HEAD of the experimental branch. Could you please make a release with that so that we can get it into Fedora and RHEL?

I released erofs-utils 1.8.4 yesterday, it seems it has been shifted to fedora 40, 41.
Could you take another try with the new option -E^fragdedupe?

Also I introduced another option -Efragdedupe=inode(https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git/commit/?id=06875b3f2182eab24b81083dfde542f778b201cc) and it seems saving several MiBs.
I've applied it in the debian package. If it really useful, fedora could apply it too.

@Conan-Kudo
Copy link

I'm backporting it to the Fedora package too.

@jlebon
Copy link
Member Author

jlebon commented Jan 7, 2025

@hsiangkao Thanks! Those improvements look great!

Ideally this change happens across the board with the other Fedora live ISOs though I'm not sure if this constitutes a System-Wide Change (for which the deadline has already passed). @Conan-Kudo were you thinking of submitting something for that? If so, we can be co-owners to own the FCOS part of it. Otherwise, we might end up submitting a Change just scoped to FCOS.

@Conan-Kudo
Copy link

I'm planning on making a Self-Contained Change for this. Now that I'm satisfied, I can start writing up one.

@dustymabe
Copy link
Member

@Conan-Kudo Are you interested in some Co-Owners?

@Conan-Kudo
Copy link

Sure.

@Conan-Kudo
Copy link

Here's what I've got so far: https://fedoraproject.org/wiki/Changes/EROFSforLiveMedia

@Conan-Kudo
Copy link

Someone interested in being part of the FCOS side of this can add themselves as co-owners and add their own relevant bits to the Change document.

@Conan-Kudo
Copy link

FYI @supakeen to add osbuild relevant stuff to the change

@hsiangkao
Copy link

hsiangkao commented Jan 8, 2025

@hsiangkao Thanks! Those improvements look great!

Another thing is to mention that currently fsck.erofs --extract doesn't support fragment cache, so it will extract slowly (I suggest to use mount and cp for extraction for now), I will find time to complete it later. But if anyone is interested, it's much appreciated to help on this too. So many other things are working on my side.
Also -C# and -zlzma,dictsize=# will impact the memory usage, if the small sizes work almost the same, it'd be better to use smaller numbers.

@supakeen
Copy link

supakeen commented Jan 8, 2025

FYI @supakeen to add osbuild relevant stuff to the change

Thank you, relevant people are: @dustymabe @jlebon @ravanelli which are all in this thread.

If CoreOS wants to switch to EroFS. As far as I'm aware we do have all the bits and bobs available in osbuild to do so though perhaps some more options need to be piped through.

Perhaps @bcl is interested too since they're working on a similar change for RHEL 10: osbuild/images#1117

@jlebon jlebon added the meeting topics for meetings label Jan 8, 2025
@bcl
Copy link

bcl commented Jan 8, 2025

If CoreOS wants to switch to EroFS. As far as I'm aware we do have all the bits and bobs available in osbuild to do so though perhaps some more options need to be piped through.

osbuild should be all ready to go with the release of the next release (v138), images still needs osbuild/images@870e45f from osbuild/images#1117 but that could be split out from the RHEL10 changes (I don't want to switch RHEL10 until the boot.iso has been built with it for a bit).

@yasminvalim
Copy link
Contributor

FYI: During the FCOS community meeting, @jlebon brought up this issue for our awareness, and we discussed the proposal and the benefits of this change. You can know more about it in the meeting logs.

@Conan-Kudo
Copy link

@hsiangkao Thanks! Those improvements look great!

Another thing is to mention that currently fsck.erofs --extract doesn't support fragment cache, so it will extract slowly (I suggest to use mount and cp for extraction for now), I will find time to complete it later. But if anyone is interested, it's much appreciated to help on this too. So many other things are working on my side. Also -C# and -zlzma,dictsize=# will impact the memory usage, if the small sizes work almost the same, it'd be better to use smaller numbers.

I can hold off on adjusting Calamares to use the erofs extract method until after you've implemented things to speed it up.

@jlebon
Copy link
Member Author

jlebon commented Jan 8, 2025

Was playing around with this using our FCOS live rootfs content. Some results:

Command mksquashfs mkfs.erofs mkfs.erofs -Eall-fragments,fragdedupe=inode -C 1048576
Time 22s 1m30s 1m6s
Size 939M 1.1G 977M
Extraction Time 3.5s 14s 6m

So with the new options, the sizes are definitely comparable. It's still about 5 times slower than mksquashfs but in absolute terms 1m is not at all something I'm worried about for our use case.

Some additional notes:

  • The mkfs.erofs commands above used -zlzma,level=6. I tried to use -zzstd,level=15 to make it equivalent to what we were using with squashfs for comparison, but it seems like zstd support is experimental currently (mkfs.erofs outputs a warning) and it ends up taking longer and being larger anyway.
  • As mentioned above by @hsiangkao, extracting the EROFS image using the new options is incredibly slow as you can see in the table. This mostly doesn't matter, except that...
  • ... unlike unsquashfs, there is no way currently to extract just a single file from the image. Some users of our root squashfs currently need this in an unprivileged context where they can't mount. They could extract the whole image for now, but that's a lot more expensive if we want to use -Eall-fragments (and even if that becomes faster, they'd still need to pay for the storage cost).
  • Looks like there's a bug in mkfs.erofs. It hard crashes instead of cleanly erroring out if there's a file for which it doesn't have permissions to read (e.g. /etc/gshadow has mode 000).

@jlebon jlebon removed the meeting topics for meetings label Jan 8, 2025
@hsiangkao
Copy link

Was playing around with this using our FCOS live rootfs content. Some results:

Command mksquashfs mkfs.erofs mkfs.erofs -Eall-fragments,fragdedupe=inode -C 1048576
Time 22s 1m30s 1m6s
Size 939M 1.1G 977M
Extraction Time 3.5s 14s 6m
So with the new options, the sizes are definitely comparable. It's still about 5 times slower than mksquashfs but in absolute terms 1m is not at all something I'm worried about for our use case.

Could you share a way with me to reproduce that too? I will look into that.

  • Looks like there's a bug in mkfs.erofs. It hard crashes instead of cleanly erroring out if there's a file for which it doesn't have permissions to read (e.g. /etc/gshadow has mode 000).

Will check, thanks for reporting.

@jlebon
Copy link
Member Author

jlebon commented Jan 10, 2025

Was playing around with this using our FCOS live rootfs content. Some results:
Command mksquashfs mkfs.erofs mkfs.erofs -Eall-fragments,fragdedupe=inode -C 1048576
Time 22s 1m30s 1m6s
Size 939M 1.1G 977M
Extraction Time 3.5s 14s 6m
So with the new options, the sizes are definitely comparable. It's still about 5 times slower than mksquashfs but in absolute terms 1m is not at all something I'm worried about for our use case.

Could you share a way with me to reproduce that too? I will look into that.

You can download the live rootfs from https://fedoraproject.org/coreos/download?stream=stable#download_section, unpack it using cpio -id, and then you'll find the root squashfs in there. I unsquashfs'ed it, and then played with that rootfs tree.

@hsiangkao
Copy link

hsiangkao commented Jan 21, 2025

Was playing around with this using our FCOS live rootfs content. Some results:
Command mksquashfs mkfs.erofs mkfs.erofs -Eall-fragments,fragdedupe=inode -C 1048576
Time 22s 1m30s 1m6s
Size 939M 1.1G 977M
Extraction Time 3.5s 14s 6m
So with the new options, the sizes are definitely comparable. It's still about 5 times slower than mksquashfs but in absolute terms 1m is not at all something I'm worried about for our use case.

Could you share a way with me to reproduce that too? I will look into that.

You can download the live rootfs from https://fedoraproject.org/coreos/download?stream=stable#download_section, unpack it using cpio -id, and then you'll find the root squashfs in there. I unsquashfs'ed it, and then played with that rootfs tree.

sigh.. I've fixed an issue which causes unexpected larger image sizes, and commit it to -expermental for testing.

Command mksquashfs -comp zstd mkfs.erofs -zzstd,15 mksquashfs -comp xz mkfs.erofs -zlzma,6
Size 732811264 747646976 701059072 687501312

mksquashfs command line: -comp xz -b 131072 -noappend or -comp xz -b 131072 -Xcompression-level 15 -noappend
mkfs.erofs command line: -C131072 -Eall-fragments,fragdedupe=inode

Anyway, just some update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants