How GlusterFS Distribution Works
The defining feature of any scale-out system is its ability to distribute work
or data among many servers. Accordingly, people in the distributed-system
community have developed many powerful techniques to perform such distribution,
but those techniques often remain little known or understood even among other
members of the file system and database communities that benefit. This
confusion is represented even in the name of the GlusterFS component that
performs distribution - DHT, which stands for Distributed Hash Table but is not
actually a DHT as that term is most commonly used or defined. The way
GlusterFS’s DHT works is based on a few basic principles:
All operations are driven by clients, which are all equal. There are no
special nodes with special knowledge of where files are or should be.Directories exist on all subvolumes (bricks or lower-level aggregations of
bricks); files exist on only one.Files are assigned to subvolumes based on consistent hashing, and even
more specifically a form of consistent hashing exemplified by Amazon’s
Dynamo.
The result of all this is that users are presented with a set of files that is
the union of the files present on all subvolumes. The following sections
describe how this “uniting” process actually works.
Layouts
The conceptual basis of Dynamo-style consistent hashing is of numbers around a
circle, like a clock. First, the circle is divided into segments and those
segments are assigned to bricks. (For the sake of simplicity we’ll use
“bricks” hereafter even though they might actually be replicated/striped
subvolumes.) Several factors guide this assignment.
Assignments are done separately for each directory.
Historically, segments have all been the same size. However, this can lead
to smaller bricks becoming full while plenty of space remains on larger
ones. If the cluster.weighted-rebalance option is set, segments sizes
will be proportional to brick sizes.Assignments need not include all bricks in the volume. If the
cluster.subvols-per-directory option is set, only that many bricks will
receive assignments for that directory.
However these assignments are done, they collectively become what we call a
layout for a directory. This layout is then stored using extended
attributes, with each brick’s copy of that extended attribute on that directory
consisting of four 32-bit fields.
A version, which might be DHT_HASH_TYPE_DM to represent an assignment as
described above, or DHT_HASH_TYPE_DM_USER to represent an assignment made
manually by the user (or external script).A “commit hash” which will be described later.
The first number in the assigned range (segment).
The last number in the assigned range.
For example, the extended attributes representing a weighted assignment between
three bricks, one twice as big as the others, might look like this.
Brick A (the large one): DHT_HASH_TYPE_DM 1234 0 0x7ffffff
Brick B: DHT_HASH_TYPE_DM 1234 0x80000000 0xbfffffff
Brick C: DHT_HASH_TYPE_DM 1234 0xc0000000 0xffffffff
Placing Files
To place a file in a directory, we first need a layout for that directory - as
described above. Next, we calculate a hash for the file. To minimize
collisions either between files in the same directory with different names or
between files in different directories with the same name, this hash is
generated using both the (containing) directory’s unique GFID and the file’s
name. This hash is then matched to one of the layout assignments, to yield
what we call a hashed location. For example, consider the layout shown
above. The hash 0xabad1dea is between 0x80000000 and 0xbfffffff, so the
corresponding file’s hashed location would be on Brick B. A second file with a
hash of 0xfaceb00c would be assigned to Brick C by the same reasoning.
Looking Up Files
Because layout assignments might change, especially as bricks are added or
removed, finding a file involves more than calculating its hashed location and
looking there. That is in fact the first step, and works most of the time -
i.e. the file is found where we expected it to be - but there are a few more
steps when that’s not the case. Historically, the next step has been to look
for the file everywhere - i.e. to broadcast our lookup request to all
subvolumes. If the file isn’t found that way, it doesn’t exist. At this
point, an open that requires the file’s presence will fail, or a create/mkdir
that requires its absence will be allowed to continue.
Regardless of whether a file is found at its hashed location or elsewhere, we
now know its cached location. As the name implies, this is stored within DHT
to satisfy future lookups. If it’s not the same as the hashed location, we
also take an extra step. This step is the creation of a linkfile, which is a
special stub left at the hashed location pointing to the cached
location. Therefore, if a client naively looks for a file at its hashed
location and finds a linkfile instead, it can use that linkfile to look up the
file where it really is instead of needing to inquire everywhere.
Rebalancing
As bricks are added or removed, or files are renamed, many files can end up
somewhere other than at their hashed locations. When this happens, the volumes
need to be rebalanced. This process consists of two parts.
Calculate new layouts, according to the current set of bricks (and possibly
their characteristics). We call this the “fix-layout” phase.Migrate any “misplaced” files to their correct (hashed) locations, and
clean up any linkfiles which are no longer necessary. We call this the
“migrate-data” phase.
Usually, these two phases are done together. (In fact, the code for them is
somewhat intermingled.) However, the migrate-data phase can involve a lot of
I/O and be very disruptive, so users can do just the fix-layout phase and defer
migrate-data until a more convenient time. This allows new files to be placed
on new bricks, even though old files might still be in the “wrong” place.
When calculating a new layout to replace an old one, DHT specifically tries to
maximize overlap of the assigned ranges, thus minimizing data movement. This
difference can be very large. For example, consider the case where our example
layout from earlier is updated to add a new double-sided brick. Here’s a very
inefficient way to do that.
Brick A (the large one): 0x00000000 to 0x55555555
Brick B: 0x55555556 to 0x7fffffff
Brick C: 0x80000000 to 0xaaaaaaaa
Brick D (the new one): 0xaaaaaaab to 0xffffffff
This would cause files in the following ranges to be migrated:
0x55555556 to 0x7fffffff (from A to B)
0x80000000 to 0xaaaaaaaa (from B to C)
0xaaaaaaab to 0xbfffffff (from B to D)
0xc0000000 to 0xffffffff (from C to D)
As an historical note, this is exactly what we used to do, and in this case it
would have meant moving 7/12 of all files in the volume. Now let’s consider a
new layout that’s optimized to maximize overlap with the old one.
Brick A: 0x00000000 to 0x55555555
Brick D: 0x55555556 to 0xaaaaaaaa <- optimized insertion point
Brick B: 0xaaaaaaab to 0xd5555554
Brick C: 0xd5555555 to 0xffffffff
In this case we only need to move 5/12 of all files. In a volume with millions
or even billions of files, reducing data movement by 1/6 of all files is a
pretty big improvement. In the future, DHT might use “virtual node IDs” or
multiple hash rings to make rebalancing even more efficient.
Rename Optimizations
With the file-lookup mechanisms we already have in place, it’s not necessary to
move a file from one brick to another when it’s renamed - even across
directories. It will still be found, albeit a little less efficiently. The
first client to look for it after the rename will add a linkfile, which every
other client will follow from then on. Also, every client that has found the
file once will continue to find it based on its cached location, without any
network traffic at all. Because the extra lookup cost is small, and the
movement cost might be very large, DHT renames the file “in place” on its
current brick instead (taking advantage of the fact that directories exist
everywhere).
This optimization is further extended to handle cases where renames are very
common. For example, rsync and similar tools often use a “write new then
rename” idiom in which a file “xxx” is actually written as “.xxx.1234” and then
moved into place only after its contents have been fully written. To make this
process more efficient, DHT uses a regular expression to separate the permanent
part of a file’s name (in this case “xxx”) from what is likely to be a
temporary part (the leading “.” and trailing “.1234”). That way, after the
file is renamed it will be in its correct hashed location - which it wouldn’t
be otherwise if “xxx” and “.xxx.1234” hash differently - and no linkfiles or
broadcast lookups will be necessary.
In fact, there are two regular expressions available for this purpose -
cluster.rsync-hash-regex and cluster.extra-hash-regex. As its name
implies, rsync-hash-regex defaults to the pattern that regex uses, while
extra-hash-regex can be set by the user to support a second tool using the
same temporary-file idiom.
Commit Hashes
A very recent addition to DHT’s algorithmic arsenal is intended to reduce the
number of “broadcast” lookups the it issues. If a volume is completely in
balance, then no file could exist anywhere but at its hashed location.
Therefore, if we’ve already looked there and not found it, then looking
elsewhere would be pointless (and wasteful). The commit hash mechanism is
used to detect this case. A commit hash is assigned to a volume, and
separately to each directory, and then updated according to the following
rules.
The volume commit hash is changed whenever actions are taken that might
cause layout assignments across all directories to become invalid - i.e.
bricks being added, removed, or replaced.The directory commit hash is changed whenever actions are taken that might
cause files to be “misplaced” - e.g. when they’re renamed.The directory commit hash is set to the volume commit hash when the
directory is created, and whenever the directory is fully rebalanced so that
all files are at their hashed locations.
In other words, whenever either the volume or directory commit hash is changed
that creates a mismatch. In that case we revert to the “pessimistic”
broadcast-lookup method described earlier. However, if the two hashes match
then we can with skip the broadcast lookup and return a result immediately.
This has been observed to cause a 3x performance improvement in workloads that
involve creating many small files across many bricks.