digging into BeeGFS striping

I did some work today figuring out how BeeGFS actually writes its data to disk. I shudder to think that we’d actually use this knowledge; but I still found it interesting, so I want to share.

First, I created a simple striped file in the rcops allocation.

[root@boss2 rcops]# beegfs-ctl --createfile testfile --numtargets=2 --storagepoolid=2
Operation succeeded.

This file will stripe across two targets (chosen by BeeGFS at random) and is using the default 1M chunksize for the rcops storage pool. You can see this with beegfs-ctl --getentryinfo.

[root@boss2 rcops]# beegfs-ctl --getentryinfo /mnt/beegfs/rcops/testfile --verbose
EntryID: 9-5F7E8E87-1
Metadata buddy group: 1
Current primary metadata node: bmds1 [ID: 1]
Stripe pattern details:
+ Type: RAID0
+ Chunksize: 1M
+ Number of storage targets: desired: 2; actual: 2
+ Storage targets:
  + 826 @ boss1 [ID: 1]
  + 834 @ boss2 [ID: 2]
Chunk path: uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1
Dentry path: 50/4/0-5BEDEB51-1/

I write an easily-recognized dataset to the file: 1M of A to the file; then 1M of B and so-on.

[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("A"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("B"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("C"*(1024*1024))' >>testfile
[root@boss2 rcops]# python -c 'import sys; sys.stdout.write("D"*(1024*1024))' >>testfile

This gives me a 4M file, precisely 1024*1024*4=4194304 bytes.

[root@boss2 rcops]# du --bytes --apparent-size testfile
4194304     testfile

Those two chunk files, as identified by beegfs-ctl --getentryinfo, are at /data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 and /data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1. (boss106/rcops doesn’t have a storage directory as part of an experiment to see how difficult it would be to remove them. I guess we never put it back.) the boss1 target, 826, is first in the list, so that’s where the file starts.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none
AAAAA

if we skip 1M (1024*1024 bytes) we see that that’s where the file changes to C.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024))) count=5 status=none

And we can see that actually is precisely where it starts by stepping back a little.

[root@boss1 ~]# dd if=/data/boss106/rcops/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 skip=$(((1024 * 1024)-2)) count=5 status=none
AACCC

Cool. So we’ve found the end of the first chunk (made of A) and the start of the third chunk (made of C). That means the second and fourth chunks are over in 834. Which they are.

[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 status=none
BBBBB
[root@boss2 rcops]# dd if=/data/boss207/rcops/storage/chunks/uF4240/5BED/E/0-5BEDEB51-1/9-5F7E8E87-1 bs=1 count=5 skip=$(((1024*1024-2))) status=none
BBDDD

So, in theory, if we wanted to bypass BeeGFS and re-construct files from their chunks, we could do that. It sounds like a nightmare, but we could do it. In a worst-case scenario.

It’s this kind of transparency and inspectability that still makes me really like BeeGFS, despite everything we’ve been through with it.