Highly Available Static Fileservers – Heartbeat, Nginx, Lsyncd, and RR-DNS

by Benjamin

Introduction

After being tasked with finding a possible replacement for our dieing imageserver (running out of diskspace, no real replication, high load, and so on…), I have spent quite some time studying possible solutions. We have come up with a solution which is fast, scalable, domain independent, has high availability, and is redundant. Which are exactly the kind of qualities you are looking for if you are designing an architecture for a company that has multiple high traffic projects and has these qualities as requirements. The technologies used for the creation of these static file clusters are all opensource and should be available for most relatively recent *nix distributions. The technologies used in this setup are in no particular order: nginx, lsyncd, heartbeat, RR-DNS.

Although the focus of this post is heavily geared towards the distribution of images, it should be clear that the information disclosed in this document can be used to distribute and load balance any static files across any number of clusters. Please also note that I have attempted to explain everything in great detail and that this unfortunately has the consequence that the complexity of the descriptions and terminology used in this post is geared towards technical people with an understanding about *nix system administration.

 

 

A cluster in this post is a pair of servers, configured in a specific way to provide all of the above requirements. I will start of with explaining the file structure used in our image cluster. It will pretty soon become clear that this is central in the distribution of files across clusters. I will then describe the cluster in great detail. For convenience, this post is composed out of 2 main sections or levels with multiple subsections. The main sections in order are: “Image Storage Structure”, and “Cluster Structure”.

Image Storage Structure

The image storage structure is composed out of a user id and an md5 hash of the filename plus the file’s extension. The image filename thus becomes uid-md5.ext. The md5 hash of the image filename is responsible for multiple facets of the distribution, the first 3 hexadecimal numbers of the hash will decide the image parent folder structure and even distribution, the distribution will become clear later. I would like to illustrate the file/folder structure on the filesystem with the following schema:

Image File/Folder Structure

As we can derive from the above schema, the first 3 numbers of the hash decide where in the folder structure the image will be stored. Also note that this folder structure requires the images to be stored per domain on the root level, for example “/srv/images/domain1.com/image/” and “/srv/images/domain2.com/image/”.

Cluster Structure

Rather than considering each server individually the maximum granularity will be confined to a cluster, the rationale for this is that for the system to work as intended, machines can only be added in pairs. The reason that we can only add machines in pairs is redundancy and performance, which will be discussed later in this topic. We will also discuss the individual server internals in this topic as it is important to understand how the cluster works.

DNS Level

The DNS level plays an important part in distributing the load evenly across the machines, we make use of standard DNS Round-Robin load balancing. The domains for the image servers are formed in the following way:

*.image.domain.tld

The ‘*’ here depicts a hexadecimal number from 0 to f. As mentioned before, servers will only be added in pairs, this is done to ensure that everything is available at all times and has the side effect of a practically unpenalized performance optimization. The following list contains the current mapping of IPs to domains:

  • 0.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • 1.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • 2.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • 3.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • 4.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • 5.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • 6.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • 7.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • 8.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • 9.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • a.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • b.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • c.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • d.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)
  • e.image.domain.tld -> (xxx.xxx.xxx.25, xxx.xxx.xxx.26)
  • f.image.domain.tld -> (xxx.xxx.xxx.39, xxx.xxx.xxx.63)

As can we can derive from the above list, anything in the DNS address range between 0 and f will always resolve to 2 of these image servers(physical servers not clusters).

Server Level

Another important part to discuss and expand on is the server level. What services are running on the server, how are these services configured, what is the responsibility of these services. This subsection will be split into the following parts; Relevant Processes Launched on Boot, Kernel Optimizations, High-Availability, Replication Mechanisms, and Webserver Configuration.

Relevant Processes Launched on Boot

On boot there are several important processes that are started. The processes discussed in this section are inherent to the cluster setup and to the correct operation of the cluster. On each of the servers in the clusters, we have the following base processes to monitor:

  • Heartbeat – responsible for high-availability and failover,
  • Lsyncd – responsible for filesystem replication,
  • Nginx – responsible for serving the images and provides a DAV interface for filesystem manipulation.

Kernel Optimizations

For the good operation of the cluster’s servers, it is required to make certain changes to the way the kernel operates in /etc/sysctl.conf and /etc/security/limits.conf

In the /etc/sysctl.conf we have made multiple changes for various reasons, some are changes not normally necessary on *nix systems, but quintessential to the operation of our high-speed file server. The contents of sysctl.conf are listed below for completeness:

net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.rp_filter = 1
fs.inotify.max_user_watches = 65536
fs.inotify.max_queued_events = 65536
net.ipv4.conf.default.promote_secondaries = 1
net.ipv4.conf.all.promote_secondaries = 1
net.ipv4.ip_nonlocal_bind = 1
net.netfilter.nf_conntrack_max = 500000
net.nf_conntrack_max = 500000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.core.netdev_max_backlog = 100000
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1800
net.core.wmem_max = 8388608
net.core.rmem_max = 8388608
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.tcp_wmem = 4096 87380 8388608
net.core.somaxconn = 2048
kernel.pid_max = 65536

A few remarks on some of these values as they bear important to support our running services:
fs.inotify.max_user_watches = 65536 and fs.inotify.max_queued_events = 65536; inotify is the underlying service Lsyncd utilizes to keep the filesystems between servers in sync. The reason why this value is higher than the default is that we should allow the event/watch queue to be large enough at all times. More info on inotify can be found here: http://en.wikipedia.org/wiki/Inotify.

We have also enlarged the possible conntrack queue to 500000, as iptables would otherwise start to drop connections when the conntrack table is full. The default value of the maximum amount of connections stored in the conntrack table is 65536, which we noticed was close to the limit, and in some cases not enough to handle peak traffic.

Whilst tcp_tw_reuse and tcp_tw_recycle were originally enabled, we have decided to disable them after consulting the following conversation: http://www.mail-archive.com/varnish-misc@projects.linpro.no/msg02899.html.

In /etc/security/limits.conf we have only added 1 setting which is the following:

*       soft    nofile  75000
*       hard    nofile  75000

As this setting will limit the number of possible file descriptors(open files) available on the system, it is important that this got changed to a value higher than the linux kernel default of 1024.

High-Availability

As a high-availability solution, heartbeat was chosen. This solution was chosen because of its simplicity to setup and because we didn’t need any complex operations. The intend of the high availability solution was in our case to move the public and virtual ips if one server in the cluster goes down. As each server in one cluster contains exactly the same data it is trivial to let our webserver listen on all interfaces and ips(We limit access through the firewall), and move the ips automatically when something goes wrong. This means that both the virtual ip and the external ip is defined in heartbeat as a resource, and will only be available when heartbeat is running.

The configuration file for heartbeat(/etc/ha.d/ha.cf) contains the following contents:

# Logfile
logfile    /var/log/ha-log

# Log facilty in syslog
logfacility    local0

# Time between heartbeats(Here 2 seconds)
keepalive 2

# Time it takes to declare the other resource as unavailable and to switch its resources to this node
deadtime 30

# "late heartbeat" warnings are issued after this amount of seconds
warntime 10

# Specification of after what time to decide a resource is not alive on heartbeat startup
initdead 120

# UDP port for unicast communication
udpport    670

# Heartbeat communication to the other server through unicasts
ucast eth0 10.0.1.26

# Automatically failback the node when resource comes back online
auto_failback on

# Node definition
node node-25
node node-26

We also had to define the resources in /etc/ha.d/haresources, which contains the following:

node-25 10.0.0.25
#IPaddr_gw::"IP address"/"netmask bits"/"interface"/"broadcast"/"gateway"
node-25 IPaddr_gw::xxx.xxx.xxx.25/24/eth1/xxx.xxx.xxx.255/xxx.xxx.xxx.1
node-26 10.0.0.26
node-26 IPaddr_gw::xxx.xxx.xxx.26/24/eth1/xxx.xxx.xxx.255/xxx.xxx.xxx.1

The single virtual ips 25-26 are easily configured, the external one however, needed a custom function to bring the routing up correctly after the external ip is launched, this was necessary since heartbeat doesn’t come with that specific routing resource by default. We altered the IPaddr resource script and created our own one in /etc/ha.d/resource.d/IPaddr_gw, this file contains:

#!/bin/sh
#
#
# Description:    wrapper of OCF RA IPaddr2, based on original heartbeat RA.
#        See OCF RA IPaddr2 for more information.
#
# Author:    Xun Sun
# Support:      linux-ha@lists.linux-ha.org
# License:      GNU General Public License (GPL)
# Copyright:    (C) 2005 International Business Machines
#
#    This script manages IP alias IP addresses
#
#    It can add an IP alias, or remove one.
#
#    usage: $0 ip-address[/netmaskbits[/interface[:label][/broadcast][/gateway]]] \
#        {start|stop|status|monitor}
#
#    The "start" arg adds an IP alias.
#
#    Surprisingly, the "stop" arg removes one.    :-)
#
unset LANG; export LANG
LC_ALL=C
export LC_ALL

. /etc/ha.d/resource.d//hto-mapfuncs

# We need to split the argument into pieces that IPaddr OCF RA can
# recognize, sed is prefered over Bash specific builtin functions
# for portability.

usage() {
  echo "usage: $0 ip-address[/netmaskbits[/interface[:label][/broadcast][/gateway]]] $LEGAL_ACTIONS"
}

if [ $# != 2 ]; then
    usage
    exit 1
fi

BASEIP=`echo $1 | sed 's%/.*%%'`
OCF_RESKEY_ip=$BASEIP; export OCF_RESKEY_ip
GATEWAY="" 

str=`echo $1 | sed 's%^'$BASEIP'/*%%'`
if [ ! -z "$str" ]; then
    NETMASK=`echo $str | sed 's%/.*%%'`
    OCF_RESKEY_cidr_netmask=$NETMASK; export OCF_RESKEY_cidr_netmask

    str=`echo $str | sed 's%^'$NETMASK'/*%%'`
    NIC=`echo $str | sed 's%/.*%%'`

    case $NIC in
    [0-9]*)    BROADCAST=$NIC
        OCF_RESKEY_broadcast=$BROADCAST; export OCF_RESKEY_broadcast
        NIC=
        ;;
    "")    ;;
    *)    BROADCAST=`echo $str | sed -e 's%^'$NIC'/*%%' -e 's%/.*%%'`
        echo $BROADCAST
        OCF_RESKEY_nic=$NIC; export OCF_RESKEY_nic
        OCF_RESKEY_broadcast=$BROADCAST; export OCF_RESKEY_broadcast
        ;;
    esac

    if [ -n $BROADCAST ];then
    GATEWAY=`echo $str | cut -d/ -f3`
    fi

fi

OCF_TYPE=IPaddr2
OCF_RESKEY_lvs_support=1
OCF_RESOURCE_INSTANCE=${OCF_TYPE}_$BASEIP
export OCF_TYPE OCF_RESOURCE_INSTANCE OCF_RESKEY_lvs_support

ra_execocf $2

if [[ -n $GATEWAY && "$2" == "start" ]]; then
    ip route replace default via $GATEWAY
fi

With the important bit of the above being:

if [[ -n $GATEWAY && "$2" == "start" ]]; then
    ip route replace default via $GATEWAY
fi

Which in essence allows you to specify an extra parameter “$GATEWAY” in the haresources file for this specific function so that the interface is brought up correctly with the right route and default gateway.

Replication Mechanisms

There are multiple replication mechanisms in place between servers. Both lsyncd and nginx(to a degree) are responsible to replicate content across the servers in each cluster.

The primary replication mechanism we use is lsyncd, which makes use of the inotify services provided by the kernel. Inotify will basically register filesystem events which lsyncd will collect and invoke rsync upon to replay these events to the other server in the cluster. Both filesystems needed to be an exact copy of each other once lsyncd was started, this to avoid unexpected data loss once lsyncd was running on both servers. As lsyncd keeps filesystems in sync with each other it is very important to note that a delete/put/move/copy will always be replayed to the other node, this means that deletes need to be done cautiously.

The primary configuration file for lsyncd is defined in /etc/lsyncd.conf, which contains the following simple configuration:

settings = {
   logfile    = "/var/log/lsyncd.log",
   statusFile = "/var/run/lsyncd.status",
   nodaemon   = true,
}
sync{default.rsync, source="/srv/images", target="10.0.1.26::images-domain"}

This lsyncd configuration corresponds to the rsyncd configuration on the other node, in /etc/rsyncd.conf:

gid = users
use chroot = false
transfer logging = true
log format = %h %o %f %l %b
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid

[images-domain]
    hosts allow = 10.0.1.25
    path = /srv/images
    comment = images test
    uid = wwwrun
    read only = false
    write only = true

It is important to note here that only nodes in the same cluster are allowed to communicate with each other over rsync, this to avoid unnecessary mistakes. We also have to concede that lsyncd will not instantly replay file system changes over rsync, the current maximum delay is 10 seconds. There thus had to be a backup mechanism to ensure filesystem consistency and availability at all times. Therefor nginx will check whether or not it has a file available on its local harddrive and if the file was not found, proxy the request to the other node in the cluster(On a different port than 80 to avoid recursion). This should ensure that whatever files are currently on either node of the cluster, are always available on both nodes of the cluster. More information about the nginx configuration can be found in the next section.

Webserver Configuration

Our webserver of choice for serving these static files is nginx. It is currently configured to serve static files on port 80 of the external interface(This is the same for all domains, not the same configuration but the same port) and has extra ports for the internal replication and DAV methods, a good guideline is that in general 1 extra domain on the cluster means 1 extra port in use on each server.

Since we have a very specific use for this webserver, i.e. serving static files, we have used the following compilation parameters:

nginx: nginx version: nginx/1.0.2
nginx: built by gcc 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux)
nginx: configure arguments: --with-http_secure_link_module --with-http_dav_module \
--without-http_ssi_module --without-http_auth_basic_module --without-http_autoindex_module \
--without-http_geo_module --without-http_map_module --without-http_referer_module \
--without-http_fastcgi_module --without-http_limit_zone_module --without-http_empty_gif_module \
--without-http_upstream_ip_hash_module --without-mail_pop3_module --without-mail_imap_module \
--without-mail_smtp_module --with-pcre=/usr/local/src/pcre-8.12

As we can see from the compilation parameters, many of the default modules have been left out on purpose for performance reasons. The minimal setup only needs 3 additional modules(to the core), which is the pcre library, the secure download module, and the dav module.

The main nginx configuration file can be found at /usr/local/nginx/conf/nginx.conf:

user wwwrun users;
worker_processes  100;

error_log  logs/error.log;

events {
    worker_connections  32768;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    tcp_nopush     on;
    keepalive_timeout  65;

    include sites-enabled/*;
}

This configuration file specifies the number of worked threads that are forked at startup, the maximum number of worker_connections, and the configuration directory to include, which specifies the servers. Hereafter you can find the nginx configuration for 1 domain, as other domains are merely more of the same. The server configuration(Port 80) can be found in /usr/local/nginx/conf/sites-enabled/cl1.image.domain.tld:

server {
        listen       *:80;
        server_name  *.image.domain.tld;

        access_log  /var/log/nginx/image.domain.tld/access.log;
    error_log /var/log/nginx/image.domain.tld/error.log;

        location / {
            root   /srv/images/domain/image;
        proxy_store            on;
        proxy_store_access        user:rw group:r all:r;
        proxy_temp_path        /tmp/image;
        if (!-f $request_filename)
        {
        proxy_pass        http://10.0.0.26:8080;
        }
            index  index.html index.htm;
        expires max;
        add_header Cache-Control public;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }

    }

Important to note here is that the storage path for the logs /var/log/nginx/image.domain.tld/ is actually a tmpfs location in memory and will be processed and truncated every 5 minutes.
tempfs mount parameters:

mount -t tmpfs -o size=256m tmpfs /var/log/nginx/image.domain.tld

This is done to avoid excessive log writes on disk and eliminate that possible bottleneck. As far as request flow goes, as described before, if the file is not found on disk the request will be proxied to the other node and if a file is found there, the requested image will be stored on the local node. The configuration file containing the proxy and DAV configurations can be found below(/usr/local/nginx/conf/sites-enabled/cl1.image.domain.tld-proxy):

    server {
        listen       10.0.0.25:8080;

        location / {
            root   /srv/images/domain/image;
                dav_methods PUT DELETE MOVE;
                limit_except GET {
                        allow 10.0.0.0/24;
                        deny all;
                }
        index  index.html index.htm;
        expires max;
        add_header Cache-Control public;
        }
    }

The above describes the DAV access methods and an interface to the local filesystem where other servers can proxy to.

If you managed to read the document until this point, I should congratulate you, you now have in depth knowledge on how to configure static file clusters. As an extra, the following schema describes the typical request flow in our image cluster:

Request Flow Diagram

The above schema gives a visual presentation as to how a request is typically processed.

That was my rather long blogpost for the day, I hope it can help out some people seeking todo the same thing as we did. We currently have these clusters running in production, and they perform very well, I hope they can do the same for your projects.

Posted in Development and tagged . Bookmark the permalink.


4 Responses to Highly Available Static Fileservers – Heartbeat, Nginx, Lsyncd, and RR-DNS

  1. zgldh says:

    If a image (named 123456-abcdef.jpg ) is not in node-25 but in node-26,
    will this image be stored in the path ‘/tmp/image/123456-abcdef.jpg’ in node-25 after proxy?

  2. Benjamin says:

    Hi zgldh,

    If the image is proxied it will be stored in tmp during proxy, after proxy it will be stored in the exact same location as on node-26.

    Cheers,

    Benjamin

  3. zgldh says:

    I see.
    Thank you for your article and response.
    Good luck.
    zgldh

  4. Heaven says:

    Thanks for spending time on the computer (writing) so ohrtes don’t have to.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Join Our Team in Shanghai

Now hiring PHP Developer in Shanghai