Note: after upgrading from ZFS 0.8.4 to ZFS 2.1.4 (+kernel 5.15.0-52-generic) the noted ZFS performance issues have gone away. There appears to be almost no performance penalty of using ZFS+mergerfs.
These are my notes about performance of this setup (and some experiments with autotier
mergerfs competitor).
Performance is measured by this fio
command. It's intended to test sequential writes with 1MB block size. Imitates write backup activity or large file copies (HD tv or movies).
fio --name=fiotest --filename=/mnt/samsung/zfscache/file123 --size=16Gb --rw=write --bs=1M --direct=1 --numjobs=8 --ioengine=libaio --iodepth=8 --group_reporting --runtime=60 --startdelay=60
root@nas:/home/gfm# lsscsi
[0:0:0:0] disk ATA WDC WD180EDGZ-11 0A85 /dev/sdc
[0:0:1:0] disk ATA WDC WD180EDGZ-11 0A85 /dev/sdd
[0:0:2:0] disk ATA WDC WD80EFZX-68U 0A83 /dev/sde
[1:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda
[1:0:0:1] disk QEMU QEMU HARDDISK 2.5+ /dev/sdb
[3:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.5+ /dev/sr0
[N:0:1:1] disk Samsung SSD 950 PRO 256GB__1 /dev/nvme0n1
root@nas:/home/gfm# df -h
Filesystem Size Used Avail Use% Mounted on
udev 1.9G 0 1.9G 0% /dev
tmpfs 390M 2.8M 387M 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 60G 9.8G 48G 18% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
mergerfs 231G 1.0M 231G 1% /mnt/cached
mergerfs 24T 121G 24T 1% /mnt/slow-storage
/dev/sda2 2.0G 107M 1.7G 6% /boot
/dev/sda1 1.1G 5.3M 1.1G 1% /boot/efi
/dev/sdc 17T 51G 17T 1% /mnt/disk2
/dev/sdc 17T 51G 17T 1% /mnt/snapraid-content/disk2
cache 231G 1.0M 231G 1% /cache
/dev/sde 7.3T 71G 7.3T 1% /mnt/snapraid-content/disk1
/dev/sde 7.3T 71G 7.3T 1% /mnt/disk1
/dev/loop1 64M 64M 0 100% /snap/core20/1634
/dev/loop3 47M 47M 0 100% /snap/snapd/16292
/dev/loop2 48M 48M 0 100% /snap/snapd/17336
/dev/loop0 64M 64M 0 100% /snap/core20/1623
/dev/loop4 68M 68M 0 100% /snap/lxd/22753
/dev/sdd1 17T 117G 17T 1% /mnt/parity1
tmpfs 390M 0 390M 0% /run/user/1000
root@nas:/home/gfm# zpool status
pool: cache
state: ONLINE
scan: resilvered 12.0M in 0 days 00:00:00 with 0 errors on Thu Nov 3 23:34:42 2022
config:
NAME STATE READ WRITE CKSUM
cache ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
errors: No known data errors
Update: 11/05/22 - After upgrade from ZFS 0.8.4 to ZFS 2.1.4 the below performance issues don't exist.
NOTE: ZFS filesystem uses memory catching L2ARC and other things that may 'inflate' results
I have done tests on this same nvm0n1 disk and max writes are around 900MB/s (if you google for Samsung 950 256GB drives like mine you will find same benchmark results).
Jobs: 8 (f=8): [W(8)][100.0%][w=344MiB/s][w=344 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=17709: Thu Nov 3 23:58:41 2022
write: IOPS=1240, BW=1240MiB/s (1300MB/s)(72.7GiB/60017msec); 0 zone resets
slat (usec): min=82, max=103024, avg=6447.15, stdev=11814.99
clat (usec): min=2, max=548443, avg=45148.97, stdev=80640.79
lat (usec): min=1015, max=617530, avg=51596.52, stdev=91759.09
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 7], 10.00th=[ 8], 20.00th=[ 8],
| 30.00th=[ 8], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 9],
| 70.00th=[ 16], 80.00th=[ 64], 90.00th=[ 159], 95.00th=[ 234],
| 99.00th=[ 380], 99.50th=[ 418], 99.90th=[ 472], 99.95th=[ 498],
| 99.99th=[ 542]
bw ( MiB/s): min= 111, max= 6468, per=99.93%, avg=1239.24, stdev=193.05, samples=960
iops : min= 107, max= 6468, avg=1238.93, stdev=193.07, samples=960
lat (usec) : 4=0.01%, 10=0.01%, 1000=0.01%
lat (msec) : 2=0.24%, 4=1.84%, 10=63.16%, 20=6.46%, 50=6.25%
lat (msec) : 100=6.65%, 250=11.16%, 500=4.17%, 750=0.05%
cpu : usr=0.70%, sys=2.84%, ctx=671054, majf=0, minf=92
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,74429,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=1240MiB/s (1300MB/s), 1240MiB/s-1240MiB/s (1300MB/s-1300MB/s), io=72.7GiB (78.0GB), run=60017-60017msec
# dpkg -l | grep zfs
ii libzfs4linux 2.1.4-0ubuntu0.1 amd64 OpenZFS filesystem library for Linux - general support
ii zfs-zed 2.1.4-0ubuntu0.1 amd64 OpenZFS Event Daemon
ii zfsutils-linux 2.1.4-0ubuntu0.1 amd64 command-line tools to manage OpenZFS filesystems
# fio --name=fiotest --filename=/mnt/cached/speed --size=16Gb --rw=write --bs=1M --direct=1 --numjobs=8 --ioengine=libaio --iodepth=8 --group_reporting --runtime=60 --startdelay=60
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 8 (f=0): [f(8)][100.0%][w=707MiB/s][w=707 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=93346: Sat Nov 5 01:56:46 2022
write: IOPS=944, BW=944MiB/s (990MB/s)(55.3GiB/60039msec); 0 zone resets
slat (usec): min=12, max=81965, avg=8457.02, stdev=8863.16
clat (usec): min=177, max=349728, avg=59322.78, stdev=55613.28
lat (usec): min=198, max=392259, avg=67780.92, stdev=63148.08
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 9], 10.00th=[ 14], 20.00th=[ 18],
| 30.00th=[ 23], 40.00th=[ 31], 50.00th=[ 40], 60.00th=[ 51],
| 70.00th=[ 65], 80.00th=[ 97], 90.00th=[ 142], 95.00th=[ 184],
| 99.00th=[ 245], 99.50th=[ 266], 99.90th=[ 305], 99.95th=[ 326],
| 99.99th=[ 338]
bw ( KiB/s): min=198656, max=5643635, per=99.82%, avg=964905.73, stdev=88839.34, samples=952
iops : min= 194, max= 5511, avg=942.20, stdev=86.75, samples=952
lat (usec) : 250=0.01%, 500=0.15%, 750=0.20%, 1000=0.18%
lat (msec) : 2=0.39%, 4=0.67%, 10=5.83%, 20=18.35%, 50=34.41%
lat (msec) : 100=20.43%, 250=18.56%, 500=0.81%
cpu : usr=0.97%, sys=0.64%, ctx=106614, majf=0, minf=107
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,56678,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=944MiB/s (990MB/s), 944MiB/s-944MiB/s (990MB/s-990MB/s), io=55.3GiB (59.4GB), run=60039-60039msec
Performance 990MB/s on RAID1 ZFS of dual nvme. Goal achieved.
These HDDs provide about 170MB/s max write speeds. The older 8TB drive may give 140MB/s.
The mergerfs /etc/fstab
for this mount looks like:
/mnt/disk* /mnt/slow-storage fuse.mergerfs defaults,nonempty,allow_other,use_ino,category.create=eplus,cache.files=off,moveonenospc=true,dropcacheonclose=true,minfreespace=300G,fsname=mergerfs 0 0
results
fiotest: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=8
...
fio-3.16
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 8 (f=3): [f(8)][100.0%][w=241MiB/s][w=241 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=23616: Fri Nov 4 00:03:59 2022
write: IOPS=184, BW=185MiB/s (194MB/s)(10.8GiB/60076msec); 0 zone resets
slat (usec): min=19, max=728722, avg=43252.77, stdev=36616.24
clat (msec): min=23, max=1335, avg=302.65, stdev=93.08
lat (msec): min=23, max=1376, avg=345.90, stdev=99.29
clat percentiles (msec):
| 1.00th=[ 165], 5.00th=[ 205], 10.00th=[ 224], 20.00th=[ 245],
| 30.00th=[ 262], 40.00th=[ 275], 50.00th=[ 288], 60.00th=[ 305],
| 70.00th=[ 321], 80.00th=[ 347], 90.00th=[ 397], 95.00th=[ 435],
| 99.00th=[ 542], 99.50th=[ 835], 99.90th=[ 1284], 99.95th=[ 1318],
| 99.99th=[ 1334]
bw ( KiB/s): min=51200, max=296960, per=100.00%, avg=189680.43, stdev=4517.78, samples=952
iops : min= 50, max= 290, avg=184.69, stdev= 4.43, samples=952
lat (msec) : 50=0.17%, 100=0.18%, 250=22.55%, 500=75.15%, 750=1.44%
lat (msec) : 1000=0.17%, 2000=0.33%
cpu : usr=0.12%, sys=0.14%, ctx=22235, majf=0, minf=92
IO depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.5%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,11087,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=10.8GiB (11.6GB), run=60076-60076msec
TL;DR performance penalty on mergerfs
WRITE: bw=374MiB/s (392MB/s), 374MiB/s-374MiB/s (392MB/s-392MB/s), io=21.9GiB (23.6GB),
vs. without mergerfs (pure zpool)
WRITE: bw=1240MiB/s (1300MB/s), 1240MiB/s-1240MiB/s (1300MB/s-1300MB/s), io=72.7GiB (78.0GB), run=60017-60017msec
mergerfs /etc/fstab
is
/cache /mnt/cached fuse.mergerfs nonempty,allow_other,use_ino,cache.files=off,category.create=lfs,moveonenospc=true,dropcacheonclose=true,minfreespace=4G,fsname=mergerfs 0 0
results
Jobs: 8 (f=3): [f(2),W(2),f(1),W(1),f(2)][25.7%][w=557MiB/s][w=556 IOPS][eta 05m:49s]
fiotest: (groupid=0, jobs=8): err= 0: pid=25212: Fri Nov 4 00:08:15 2022
write: IOPS=373, BW=374MiB/s (392MB/s)(21.9GiB/60084msec); 0 zone resets
slat (usec): min=10, max=1081.4k, avg=21349.93, stdev=56883.73
clat (usec): min=4, max=2673.2k, avg=149732.07, stdev=290258.09
lat (usec): min=287, max=2795.8k, avg=171082.91, stdev=321499.99
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 11], 10.00th=[ 16], 20.00th=[ 24],
| 30.00th=[ 33], 40.00th=[ 43], 50.00th=[ 54], 60.00th=[ 69],
| 70.00th=[ 100], 80.00th=[ 176], 90.00th=[ 347], 95.00th=[ 642],
| 99.00th=[ 1636], 99.50th=[ 2005], 99.90th=[ 2400], 99.95th=[ 2500],
| 99.99th=[ 2668]
bw ( KiB/s): min=16374, max=2783526, per=100.00%, avg=392628.80, stdev=58773.48, samples=931
iops : min= 14, max= 2717, avg=382.85, stdev=57.39, samples=931
lat (usec) : 10=0.01%, 500=0.35%, 750=0.11%, 1000=0.10%
lat (msec) : 2=0.36%, 4=0.80%, 10=2.36%, 20=11.42%, 50=31.95%
lat (msec) : 100=22.79%, 250=15.45%, 500=7.80%, 750=2.37%, 1000=1.41%
lat (msec) : 2000=2.22%, >=2000=0.49%
cpu : usr=0.27%, sys=0.23%, ctx=30547, majf=0, minf=95
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,22467,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=374MiB/s (392MB/s), 374MiB/s-374MiB/s (392MB/s-392MB/s), io=21.9GiB (23.6GB), run=60084-60084msec
zpool destroy cache
mkfs.btrfs -f -L cachebtrfs /dev/nvme0n1
btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.
Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication.
Label: cachebtrfs
UUID: 53afb172-2ac8-43be-98e0-d749217bf129
Node size: 16384
Sector size: 4096
Filesystem size: 238.47GiB
Block group profiles:
Data: single 8.00MiB
Metadata: single 8.00MiB
System: single 4.00MiB
SSD detected: yes
Incompat features: extref, skinny-metadata
Checksum: crc32c
Number of devices: 1
Devices:
ID SIZE PATH
1 238.47GiB /dev/nvme0n1
root@nas:/home/gfm# mkdir /cache
root@nas:/home/gfm# mount /dev/nvme0n1 /cache
/dev/nvme0n1 on /cache type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
Let's test.
root@nas:/home/gfm# mkfs.btrfs -f -L cached-mirror -m raid1 -d raid1 /dev/nvme0n1 /dev/sdb btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.
Label: cached-mirror
UUID: 0c4241e9-e4ea-41b6-9dab-a3cc4b936edb
Node size: 16384
Sector size: 4096
Filesystem size: 476.96GiB
Block group profiles:
Data: RAID1 1.00GiB
Metadata: RAID1 1.00GiB
System: RAID1 8.00MiB
SSD detected: yes
Incompat features: extref, skinny-metadata
Checksum: crc32c
Number of devices: 2
Devices:
ID SIZE PATH
1 238.47GiB /dev/nvme0n1
2 238.49GiB /dev/sdb
root@nas:/home/gfm# mount /dev/nvme0n1 /cache/
fio-3.16
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 7 (f=7): [W(1),_(1),W(6)][75.7%][eta 00m:44s]
fiotest: (groupid=0, jobs=8): err= 0: pid=14619: Fri Nov 4 01:48:09 2022
write: IOPS=438, BW=438MiB/s (459MB/s)(32.5GiB/76061msec); 0 zone resets
slat (usec): min=28, max=44590k, avg=7783.12, stdev=487469.23
clat (usec): min=329, max=44735k, avg=134311.57, stdev=1290330.14
lat (usec): min=532, max=44735k, avg=142095.86, stdev=1378714.43
clat percentiles (usec):
| 1.00th=[ 644], 5.00th=[ 13304], 10.00th=[ 17957],
| 20.00th=[ 29492], 30.00th=[ 43254], 40.00th=[ 53740],
| 50.00th=[ 67634], 60.00th=[ 83362], 70.00th=[ 101188],
| 80.00th=[ 122160], 90.00th=[ 181404], 95.00th=[ 231736],
| 99.00th=[ 383779], 99.50th=[ 463471], 99.90th=[17112761],
| 99.95th=[17112761], 99.99th=[17112761]
bw ( KiB/s): min=163819, max=2094826, per=100.00%, avg=739932.10, stdev=58373.76, samples=727
iops : min= 159, max= 2045, avg=721.74, stdev=56.99, samples=727
lat (usec) : 500=0.05%, 750=1.04%, 1000=0.50%
lat (msec) : 2=0.62%, 4=0.34%, 10=0.54%, 20=9.06%, 50=24.23%
lat (msec) : 100=33.28%, 250=26.57%, 500=3.46%, 750=0.17%, 1000=0.01%
lat (msec) : >=2000=0.15%
cpu : usr=0.43%, sys=0.45%, ctx=46384, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,33328,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=438MiB/s (459MB/s), 438MiB/s-438MiB/s (459MB/s-459MB/s), io=32.5GiB (34.9GB), run=76061-76061msec
BTRFS raid1 performance is really... poor wow. This isn't even with mergerfs enabled.
Let's see about it. results are 15% performance penalty in line with past mergerfs tests on btrfs. Outcome: RAID1 on btrfs is probably not a good idea; lost 50% of raw performance before even mergerfs comes into play.
Run status group 0 (all jobs):
WRITE: bw=296MiB/s (311MB/s), 296MiB/s-296MiB/s (311MB/s-311MB/s), io=17.4GiB (18.7GB), run=60172-60172msec
root@nas:/home/gfm# sgdisk -Z /dev/nvme0n1
Creating new GPT entries in memory.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
root@nas:/home/gfm# sgdisk -Z /dev/sdb
Creating new GPT entries in memory.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
root@nas:/home/gfm# mdadm --create /dev/md/cache /dev/nvme0n1 /dev/sdb --level=1 --raid-devices=2
mdadm: Note: this array has metadata at the start and
may not be suitable as a boot device. If you plan to
store '/boot' on this device please ensure that
your boot-loader understands md/v1.x metadata, or use
--metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md/cache started.
root@nas:/home/gfm# mdadm --detail /dev/md/cache
/dev/md/cache:
Version : 1.2
Creation Time : Fri Nov 4 02:02:11 2022
Raid Level : raid1
Array Size : 249926976 (238.35 GiB 255.93 GB)
Used Dev Size : 249926976 (238.35 GiB 255.93 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Nov 4 02:02:38 2022
State : clean, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Resync Status : 2% complete
Name : nas:cache (local to host nas)
UUID : dda209ab:ace57985:25895a5b:f3d95068
Events : 4
Number Major Minor RaidDevice State
0 259 0 0 active sync /dev/nvme0n1
1 8 16 1 active sync /dev/sdb
root@nas:/home/gfm# mkfs.ext4 /dev/md/cache
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 62481744 4k blocks and 15622144 inodes
Filesystem UUID: 9bda5776-f50e-40fa-a826-8b2424de3f07
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
root@nas:/home/gfm# mount /dev/md/cache /cache/
After waiting for resync to be complete. IO test. Maybe ZFS is just better at RAID1 w/o performance impacts.
Run status group 0 (all jobs):
WRITE: bw=478MiB/s (502MB/s), 478MiB/s-478MiB/s (502MB/s-502MB/s), io=28.1GiB (30.1GB), run=60069-60069msec
Disk stats (read/write):
md127: ios=0/228818, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/230087, aggrmerge=0/27, aggrticks=0/6898163, aggrin_queue=6647102, aggrutil=94.28%
nvme0n1: ios=0/230087, merge=0/27, ticks=0/13717574, in_queue=13258980, util=54.05%
sdb: ios=0/230087, merge=0/27, ticks=0/78753, in_queue=35224, util=94.28%
https://raid.wiki.kernel.org/index.php/Write-intent_bitmap https://louwrentius.com/the-impact-of-the-mdadm-bitmap-on-raid-performance.html
Write intent bitman may be screwing write performance. Let's disable
mdadm /dev/md127 --grow --bitmap=none
mdadm --detail /dev/md/cache
# mount
/dev/md127 on /cache type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)
# fio results
Run status group 0 (all jobs):
WRITE: bw=540MiB/s (567MB/s), 540MiB/s-540MiB/s (567MB/s-567MB/s), io=31.7GiB (34.0GB), run=60032-60032msec
Interestingly, performance starts at peak speeds. Then CPU utilization jumps to 100% dropping performance.
WRITE: bw=568MiB/s (596MB/s), 568MiB/s-568MiB/s (596MB/s-596MB/s), io=33.3GiB (35.8GB), run=60034-60034msec
Try something else but didn't help pefromance.
mdadm --grow --bitmap=internal --bitmap-chunk=131072 /dev/md127
Run status group 0 (all jobs):
WRITE: bw=329MiB/s (345MB/s), 329MiB/s-329MiB/s (345MB/s-345MB/s), io=19.4GiB (20.8GB), run=60263-60263msec
Kill mdadm array
mdadm -S /dev/md127
mdadm --zero-superblock /dev/sdb /dev/nvme0n1
As expected, ~900 MB/s writes. Matches observations in unraid trial for the same hardware.
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 4 (f=0): [_(2),f(3),_(1),f(1),_(1)][100.0%][w=894MiB/s][w=893 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=53864: Fri Nov 4 00:44:23 2022
write: IOPS=901, BW=902MiB/s (946MB/s)(52.9GiB/60059msec); 0 zone resets
slat (usec): min=434, max=202119, avg=1705.52, stdev=5436.46
clat (msec): min=3, max=263, avg=69.19, stdev=34.71
lat (msec): min=3, max=277, avg=70.90, stdev=35.30
clat percentiles (msec):
| 1.00th=[ 14], 5.00th=[ 20], 10.00th=[ 24], 20.00th=[ 32],
| 30.00th=[ 50], 40.00th=[ 61], 50.00th=[ 70], 60.00th=[ 78],
| 70.00th=[ 87], 80.00th=[ 102], 90.00th=[ 111], 95.00th=[ 126],
| 99.00th=[ 161], 99.50th=[ 174], 99.90th=[ 207], 99.95th=[ 222],
| 99.99th=[ 243]
bw ( KiB/s): min=442249, max=2527361, per=99.97%, avg=923157.66, stdev=47906.96, samples=960
iops : min= 431, max= 2467, avg=901.15, stdev=46.79, samples=960
lat (msec) : 4=0.01%, 10=0.04%, 20=6.16%, 50=23.95%, 100=49.08%
lat (msec) : 250=20.76%, 500=0.01%
cpu : usr=0.30%, sys=5.05%, ctx=59733, majf=0, minf=88
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,54162,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=902MiB/s (946MB/s), 902MiB/s-902MiB/s (946MB/s-946MB/s), io=52.9GiB (56.8GB), run=60059-60059msec
TL;DR Surprising results, w/o ZFS. Performance penalty is ~15%!
root@nas:/mnt# mount /mnt/cached/
root@nas:/mnt# df -h /mnt/cached/
Filesystem Size Used Avail Use% Mounted on
mergerfs 239G 17G 222G 7% /mnt/cached
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 3 (f=3): [_(3),f(1),_(2),f(2)][100.0%][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=55377: Fri Nov 4 00:48:28 2022
write: IOPS=770, BW=771MiB/s (808MB/s)(45.2GiB/60022msec); 0 zone resets
slat (usec): min=16, max=80166, avg=10360.79, stdev=5295.21
clat (msec): min=2, max=203, avg=72.59, stdev=13.58
lat (msec): min=2, max=219, avg=82.95, stdev=14.65
clat percentiles (msec):
| 1.00th=[ 40], 5.00th=[ 61], 10.00th=[ 63], 20.00th=[ 69],
| 30.00th=[ 70], 40.00th=[ 70], 50.00th=[ 71], 60.00th=[ 72],
| 70.00th=[ 73], 80.00th=[ 74], 90.00th=[ 83], 95.00th=[ 96],
| 99.00th=[ 132], 99.50th=[ 144], 99.90th=[ 165], 99.95th=[ 171],
| 99.99th=[ 190]
bw ( KiB/s): min=571253, max=913408, per=99.87%, avg=788216.07, stdev=7550.71, samples=960
iops : min= 557, max= 892, avg=769.35, stdev= 7.39, samples=960
lat (msec) : 4=0.01%, 10=0.09%, 20=0.11%, 50=1.72%, 100=93.88%
lat (msec) : 250=4.20%
cpu : usr=0.28%, sys=1.29%, ctx=89094, majf=0, minf=89
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,46262,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=771MiB/s (808MB/s), 771MiB/s-771MiB/s (808MB/s-808MB/s), io=45.2GiB (48.5GB), run=60022-60022msec
TL;DR Worst performance out of all (50% mergerfs performance). Unmaintained project by 45Drives.
Unrelated to mergerfs and just for fun. https://github.com/45Drives/autotier feels to be an abandoned
project (I based this on a lack of response by the owner on open issues and lack of updates since 2021), still this project is another FUSE solution that seems to natively integrate the "move files between storage tiers for me" ideals.
Let's kick the tires on it on my setup. I expect poor performance here: 45Drives/autotier#38
Filesystem is mounted manually via these options:
autotierfs /mnt/autotier -o allow_other,default_permissions
The configuration of it:
# cat /etc/autotier.conf
# autotier config
[Global] # global settings
Log Level = 1 # 0 = none, 1 = normal, 2 = debug
Tier Period = 1000 # number of seconds between file move batches
Copy Buffer Size = 1 MiB # size of buffer for moving files between tiers
[Tier 1] # tier name (can be anything)
Path = /cache # full path to tier storage pool
Quota = 20 % # absolute or % usage to keep tier under
# Quota format: x ( % | [K..T][i]B )
# Example: Quota = 5.3 TiB
[Tier 2]
Path = /mnt/slow-storage
Quota = 100 %
Results, poor as expected (below results using ZFS)
Starting 8 processes
Jobs: 8 (f=8): [W(2),f(3),W(2),f(1)][15.2%][w=215MiB/s][w=215 IOPS][eta 11m:16s]
fiotest: (groupid=0, jobs=8): err= 0: pid=43270: Fri Nov 4 00:17:35 2022
write: IOPS=183, BW=184MiB/s (193MB/s)(10.8GiB/60112msec); 0 zone resets
slat (usec): min=101, max=854743, avg=43446.05, stdev=34704.34
clat (msec): min=23, max=1337, avg=304.13, stdev=85.28
lat (msec): min=23, max=1341, avg=347.57, stdev=92.01
clat percentiles (msec):
| 1.00th=[ 171], 5.00th=[ 211], 10.00th=[ 228], 20.00th=[ 249],
| 30.00th=[ 266], 40.00th=[ 279], 50.00th=[ 296], 60.00th=[ 313],
| 70.00th=[ 330], 80.00th=[ 351], 90.00th=[ 384], 95.00th=[ 418],
| 99.00th=[ 493], 99.50th=[ 919], 99.90th=[ 1217], 99.95th=[ 1250],
| 99.99th=[ 1301]
bw ( KiB/s): min=67571, max=274432, per=100.00%, avg=189065.93, stdev=4192.32, samples=952
iops : min= 65, max= 268, avg=184.20, stdev= 4.11, samples=952
lat (msec) : 50=0.01%, 100=0.09%, 250=20.19%, 500=78.80%, 750=0.41%
lat (msec) : 1000=0.06%, 2000=0.44%
cpu : usr=0.08%, sys=0.34%, ctx=34828, majf=0, minf=98
IO depths : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.5%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,11051,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=10.8GiB (11.6GB), run=60112-60112msec
Let's use btrfs here for this test. Same hardware, this time I made btrfs a RAID1. Re-mounted, same options. autotier did not work with btrfs filesystem.
mkfs.btrfs -f -L cachebtrfs -m raid1 -d raid1 /dev/sdb /dev/nvme0n1
debug btrfs
dmesg | grep BTRFS | egrep 'error|warning|failed'
root@nas:/mnt# btrfs fi df /cache/
Data, RAID1: total=33.00GiB, used=31.78GiB
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=17.23MiB
GlobalReserve, single: total=17.12MiB, used=0.00B
Autotier on BTRFS did not work. Process was getting hung. Let's use ext4
filesystem instead.
root@nas:/home/gfm# umount /cache/
umount: /cache/: target is busy.
root@nas:/home/gfm# ps aux | grep autotier
root 9511 6.0 0.2 832460 11180 ? Ssl 01:33 0:13 autotierfs /mnt/autotier -o allow_other,default_permissions
root 10949 0.0 0.0 6432 724 pts/0 S+ 01:37 0:00 grep --color=auto autotier
root@nas:/home/gfm# kill -9 9511
root@nas:/home/gfm# umount /cache/
root@nas:/home/gfm# rm /var/lib/autotier/5685251811202329732/
adhoc.socket conflicts.log db/
root@nas:/home/gfm# rm /var/lib/autotier/5685251811202329732/adhoc.socket
root@nas:/home/gfm# mkfs -t ext4 /dev/nvme0n1
mke2fs 1.45.5 (07-Jan-2020)
/dev/nvme0n1 contains a btrfs file system labelled 'testme'
Proceed anyway? (y,N) y
Discarding device blocks: done
Creating filesystem with 62514774 4k blocks and 15630336 inodes
Filesystem UUID: ce2eed9e-8e10-4e0c-ab06-d11f17eefe2d
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks):
done
Writing superblocks and filesystem accounting information: done
fio-3.16
Starting 8 processes
fiotest: Laying out IO file (1 file / 16384MiB)
Jobs: 8 (f=8): [W(8)][100.0%][w=640MiB/s][w=639 IOPS][eta 00m:00s]
fiotest: (groupid=0, jobs=8): err= 0: pid=12306: Fri Nov 4 01:41:48 2022
write: IOPS=657, BW=658MiB/s (689MB/s)(38.5GiB/60030msec); 0 zone resets
slat (usec): min=45, max=276076, avg=12158.02, stdev=19153.35
clat (usec): min=828, max=562034, avg=85134.29, stdev=60544.23
lat (usec): min=1052, max=573771, avg=97292.87, stdev=64876.67
clat percentiles (msec):
| 1.00th=[ 40], 5.00th=[ 46], 10.00th=[ 51], 20.00th=[ 55],
| 30.00th=[ 58], 40.00th=[ 62], 50.00th=[ 65], 60.00th=[ 68],
| 70.00th=[ 73], 80.00th=[ 87], 90.00th=[ 155], 95.00th=[ 213],
| 99.00th=[ 355], 99.50th=[ 447], 99.90th=[ 535], 99.95th=[ 542],
| 99.99th=[ 558]
bw ( KiB/s): min=94154, max=1044480, per=99.88%, avg=672475.66, stdev=22655.00, samples=960
iops : min= 90, max= 1020, avg=656.37, stdev=22.13, samples=960
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.06%, 50=9.76%
lat (msec) : 100=70.81%, 250=17.09%, 500=1.93%, 750=0.32%
cpu : usr=0.36%, sys=0.87%, ctx=121217, majf=0, minf=92
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,39471,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=658MiB/s (689MB/s), 658MiB/s-658MiB/s (689MB/s-689MB/s), io=38.5GiB (41.4GB), run=60030-60030msec
2/3 of the drive's raw performance. mergerfs
still much better. The only benefit to autotier
would be its automatic promoting of files between tiers based on age and usage.
I'm a little uneasy on placing a depedency on autotier
given that it doesn't seem to be maintained. IMO - mergerfs + btrfs
is the winner combination.