Adventures in GlusterFS
What is GlusterFS?
GlusterFS is a distributed network filesystem. It acts as a file system which can be attached to a server for storage like any other, but using network sockets it can be configured in a variety of ways to distribute the files across a number of disparate servers simultaneously and in a fashion which is transparent to users.
It can be deployed to replicate the same data across multiple servers (sort of like RAID mirroring), or bring together the free space of multiple servers into a single large volume (sort of like RAID striping), or distribute data across a WAN using "geo-replcation".
A Gluster filesystem is called a Volume. Servers participating in the Gluster Volume are called Bricks. Additional Bricks can be added to the Volume later if needed.
A server then mounts the Gluster Volume, and the Gluster daemon `glusterd` manages the data across the Bricks in whatever configuration you have chosen when the Volume was created.
Sharing Disk Space Across Servers
Everything was going along just fine until one day a server ran out of space... I wrote to the hosting company and asked for a bigger disk, but I was told that the particular server configuration we have chosen cannot be changed. I sighed. Then I realized there was a whole bunch of free space on the server right next to it, sharing a Gbps LAN connection. The easiest solution which has worked for *nix systems administrators for decades is simply to install NFS and mount an exported directory from the server with lots of free disk space. But the application I'm using which needs the space also needs it to all be available in a single directory, not a subdirectory. I chose GlusterFS to solve this conundrum.
In the default configuration, `gluster volume create ...` will create a new DISTRIBUTED volume, meaning that the files stored on the volume will be distributed across all of the bricks; There will be no redundancy, but all free space on both filesystems within the Gluster volume will be available. Any loss of a Brick's storage for the Volume means complete destruction of the Volume, and it is important to understand this because without another source of this data such as external backups, it is in a precarious position. The advantage is that if additional space is needed, it can be added on-the-fly to the Volume, and clients mounting that Volume will have the new space available immediately.
The following outlines one potential configuration of a Gluster Volume, and the issues I encountered while bring the Volume online:
1. Install GlusterFS Packages
The servers I configured GlusterFS on for distributed Volume storage are running Ubuntu 12.04 LTS (codenamed "precise"), so I obtained the Ubuntu PPA for GlusterFS from here: https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-3.6
(The current latest version, 3.7, does not have Ubuntu packages available for 12.04 so I decided to use a previous version)
Much of the GlusterFS documentation revolves around Fedora / RedHat Enterprise Linux (RHEL) because RedHat is the central authority for Gluster development. That said, you should follow installation instructions for your Linux distribution as appropriate.
Install the server package ("glusterfs-server" in Ubuntu) on all servers you wish to participate in the storage.
Install the client ("glusterfs-client") on all systems where the Gluster Volume is to be used to store data (This might be the same servers which are serving the Gluster Volume, or maybe not).
2. Start The Gluster Service
With the packages installed, ensure eeach server you want to participate in the Volume storage has the gluster daemon started.
RHEL: `service glusterd status`
Ubuntu: `service glusterfs-server status`
3. Create A Directory For Bricks To Hold Their Data
Each server participating in the storage of a Gluster Volume, called "Bricks", needs a place on its own filesystem to hold this data. These directories must be specified when creating the Volume in the next step so you may wish to use the same directory structure on each Brick to make things easier.
`mkdir -p /srv/gluster-vol` (or whatever location works for you)
3. Create A New Gluster Volume
If you require data redundancy, you should create a new "replica" Volume. If you require a Volume to be distributed/replicated across a WAN, read up on geo-replication which is beyond the scope of this blog post.
I needed to harness the available free space across two servers without the need for data redundancy, so I did not specify a type, causing the `gluster volume create ...` command to create a new Distributed Volume.
3a. Ensure Your Environment Is Prepared; Use Consistent Hostnames
You should ensure that you have consistent hostnames available on all Brick servers so that each server refers to the others by a hostname which has the same IP address everywhere.
At first I tried referring to each server by its LAN IP address. The Volume was successfully created on each Brick, but it wouldn't start (come online). Reviewing the logs I saw
Commit of operation 'Volume Start' failed on localhost
which baffled me. I never once asked for a Brick to be used with a hostname of "localhost", nor did I ever specify a "127.x.x.x" address, so how could this happen?
My two servers have LAN IP addresses 10.0.0.1 (server "alpha") and 10.0.0.2 (server "beta"), and my storage space is at /srv/gluster-vol on each, so I issued this command to create the Volume:
`gluster volume create mygluster 10.0.0.1:/srv/gluster-vol 10.0.0.2:/srv/gluster-vol`
Everything seemed to create fine and there was no mention of `localhost` anywhere. Issuing a `status` command on the Gluster Volume tipped me off to the issue. On server beta:
root@beta:~# gluster volume status mygluster Status of volume: mygluster Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.0.0.1:/srv/gluster-vol 49153 Y 127581 Brick 10.0.0.2:/srv/gluster-vol 49153 Y 121695 NFS Server on localhost 2049 Y 127593 NFS Server on alpha N/A N N/A
Wait, what?? I'm not planning to mount this volume using NFS, but who said "alpha"? It turns out I had the following configuration on each server's `/etc/hosts` file:
127.0.0.1 localhost alpha 10.0.0.2 beta
127.0.0.1 localhost beta 10.0.0.1 alpha
So it seems that beta decided to look up a hostname for IP address 10.0.0.1 which is shared with the other Bricks. This hostname is "alpha". When alpha looked that hostname up to obtain an IP address it obtained 127.0.0.1! Okay, so I fixed up the hosts files:
127.0.0.1 localhost alpha 10.0.0.1 gluster1 10.0.0.2 beta gluster2
127.0.0.1 localhost beta 10.0.0.1 alpha gluster1 10.0.0.2 gluster2
Now the volume can be created using hostnames which will resolve to the same IP address everywhere they are used ("gluster1" and "gluster2"). So I recreated the volume using these:
gluster volume create mygluster gluster1:/srv/gluster-vol gluster2:/srv/gluster-vol
But still it would not start...
3b. Firewall Rules For GLusterFS
Being a security conscious systems administrator, I have firewall rules on servers to throw away dangerous or malicious packets, packets appearing to come from wrong IP addresses on wrong interfaces, and I limit the available network connections to necessary services for necessary clients.
Reviewing the logs further I discovered a number of "Connection refused" messages while attempting to contact a server's own LAN IP address. Server alpha got "Connection refused" attempting to connect to 10.0.0.1:24007. How could this be? Each server seemed to have firewall rules permissive enough for GlusterFS to function, which requires TCP/24007 and TCP/49152-49252. The firewall rules on each server specified the following, in plain English:
Allow all connections on interface "lo" if the source IP address is 127.*
Allow all connections on interface "eth0" if the source IP address is 10.*
Deny all other connections
This was taught to me years ago as sensible restrictions. A localhost network packet with IP address 127.0.0.1 should never appear on interface eth0; If it does it should be discarded. Likewise the box should never accept any packets to/from the 10.x.x.x network on the localhost interface. Surely these packets have been maliciously placed there to affect some undesired functionality...
But that's what Gluster did, as shown by tcpdump:
root@alpha:~# tcpdump -n -i lo net 10 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes 14:07:50.094031 IP 10.0.0.1.24007 > 10.0.0.1.1023: Flags [.], ack 4166831461, win 256, options [nop,nop,TS val 3289703104 ecr 3289700600], length 0 14:07:50.094060 IP 10.0.0.1.1023 > 10.0.0.1.24007: Flags [.], ack 1, win 256, options [nop,nop,TS val 3289703104 ecr 3289700592], length 0
There is no good reason this should ever happen, yet there it is staring me in the face. Two adjustments to the firewall rules fixes this in a strict way which allows this strange behavious without allowing too much strangeness to pass through. In plain English:
Allow packets from 10.0.0.1 port tcp/24007 received on interface "lo"
Allow packets to 10.0.0.1 port tcp/24007 sent on interface "lo"
And the same two rules added to beta, using its IP 10.0.0.2 instead.
4. Mount The Volume
Once your Gluster Volume is started, and you are able to get some successful happy output from the status command
gluster volume status mygluster
it is ready to be used! For clients without GlusterFS support, such as non-Linux *nix or *BSD systems with NFS support, Gluster Volumes may be mounted as type "nfs". However if your Linux system has the glusterfs-client package installed, you should mount it as type "glusterfs" using FUSE.
10.0.0.1:/mygluster /path/to/mountpoint glusterfs defaults 0 0
Since I am using a distributed volume, there is no need to specify backup volume servers; If any of the Volume servers ("Bricks") go offline, my distributed Volume will be unavailable.
However it is worth noting that if you are using a "replica"-type Volume where data is mirrored across multiple volume servers, you may wish to use the mount option "backup-volfile-servers":
10.0.0.1:/replicavol /path/to/mountpoint glusterfs defaults,backup-volfile-servers=10.0.0.2:/replicavol,10.0.0.3:/replicavol,10.0.0.4:/replicavol 0 0
In this way, if the primary server goes offline, the replicated data will remain available to the client, which would then connect to a backup-volfile-server in order to retain the availability of the data.
In this case I have used GlusterFS to provide some additional disk space, as a distributed filesystem across servers on a LAN, because one server was dangerously low on space; It's large data files have been placed on the Gluster Volume and between both servers a lot more space is now available for use.
In a replicated configuration ("replica"), Gluster can provide data redundancy and distribution, so that if one server fails, the data on the Gluster Volume remains available for use, and when that server comes back online it will be synchronized to the state of the others, with the changes made during its downtime being updated before the Volume becomes available. We have investigated this usage to provide a High Availability web server cluster, where newly uploaded files to a web site are immediately available across all web servers in the cluster. This configuration will be detailed in a future blog post.