digging into BeeGFS striping
I did some work today figuring out how BeeGFS actually writes its data to disk. I shudder to think that we’d actually use this knowledge; but I still found it interesting, so I want to share.
First, I created a simple striped file in the rcops allocation.
[root@boss2 rcops]# beegfs-ctl --createfile testfile --numtargets=2 --storagepoolid=2 Operation succeeded.
This file will stripe across two targets (chosen by BeeGFS at random)
and is using the default 1M chunksize for the rcops storage pool. You
can see this with beegfs-ctl --getentryinfo
.
[root@boss2 rcops]# beegfs-ctl --getentryinfo /mnt/beegfs/rcops/testfile --verbose EntryID: 9-5F7E8E87-1 Metadata buddy group: 1 Current primary metadata node: bmds1 [ID: 1] Stripe pattern details: + Type: RAID0 + Chunksize: 1M + Number of storage targets: desired: 2; actual: 2 + Storage targets: + 826 @ boss1 [ID: 1] + 834 @ boss2 [ID: 2] Chunk path: uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 Dentry path: 50/4/0-5BEDEB51-1/
I write an easily-recognized dataset to the file: 1M of A
to the
file; then 1M of B
and so-on.
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("A"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("B"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("C"*(1024*1024))' >>testfile [root@boss2 rcops]# python -c 'import sys; sys.stdout.write("D"*(1024*1024))' >>testfile
This gives me a 4M file, precisely 1024*1024*4=4194304 bytes.
[root@boss2 rcops]# du --bytes --apparent-size testfile 4194304 testfile
Those two chunk files, as identified by beegfs-ctl --getentryinfo
,
are at
/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1
and
/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1
. (boss106/rcops
doesn’t have a storage directory as part of an experiment to see how
difficult it would be to remove them. I guess we never put it back.)
the boss1
target, 826
, is first in the list, so that’s where
the file starts.
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none AAAAA
if we skip 1M (1024*1024 bytes) we see that that’s where the file
changes to C
.
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024))) count=5 status=none
And we can see that actually is precisely where it starts by stepping back a little.
[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024)-2)) count=5 status=none AACCC
Cool. So we’ve found the end of the first chunk (made of A
) and
the start of the third chunk (made of C
). That means the second
and fourth chunks are over in 834. Which they are.
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none BBBBB
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 skip=$(((1024*1024-2))) status=none BBDDD
So, in theory, if we wanted to bypass BeeGFS and re-construct files from their chunks, we could do that. It sounds like a nightmare, but we could do it. In a worst-case scenario.
It’s this kind of transparency and inspectability that still makes me really like BeeGFS, despite everything we’ve been through with it.