I have worked for 10 years in companies that live or die by the uptime and reliability of their web sites. In that time a lot of things have changed. The tools for building feature rich web applications are better than ever. Frameworks such as Rails and Catalyst make web development much, much less painful than it once was. The Model/View/Controller design used by these excellent frameworks has even influenced traditional client side GUI application development practices, as anyone who has used Windows Presentation Foundation can see.
Despite better tools, web startups run into performance trouble. With a single machine running everything, highly dynamic sites can’t handle many users. Adding a second machine dedicated to the database gives a little relief, but not much. Over a period of a few months the server goes from locking up every other week, to at least each week, to every few days, until finally it’s hard to go 24 hours without downtime. Owners should have taken action, but usually squirm a bit like a lobster in a slowly heating pot. They keep rebooting, maybe trying to go with a faster server, or more memory. Many never realize how dramatically performance problems reduce their chance of big success.
Downtime kills web business. If you are down, or even slow, those who would have become your customers will go elsewhere. They will find satisfaction at a competitor and they won’t be back.
If this sounds like your current situation, you need to do something about it before it’s too late.
If you can get your hands on faster hardware within 48 hours, do it. Don’t wait. You must act immediately to stop the bleeding.
The next step is to put a solution in place enables redundancy and capacity planning. This not only allows you to keep sleeping next time your server crashes at 4am, it also makes it easy to predict when you will need more power so you can order new servers well in advance instead of settling for whatever is available right now.
This solution is supported by 3 components: Centralized Storage, Load Balancing, and Monitoring.
Load Balancing
Load balancing spreads incoming requests over multiple servers. I have used several different methods. Each method has advantages and drawbacks.
Round Robin DNS
When a browser goes to your domain, it asks its name server to fetch the IP address of your domain. It then connects. It is possible to have more than one IP address for a name. In such cases, the browser will connect to one of them at random.
Reverse NAT
It is possible to use reverse NAT such as that found in pf to redirect incoming connections to a pool of web servers. Using OpenBSD it is easy to setup a redundant pair of systems. The drawback to this layer 3 method is that your web servers can’t see the true IP address of the client. Everything comes from the reverse NAT device, so IP address based ACLs and GeoIP features are broken.
Reverse Proxy
The reverse proxy is an excellent solution because it works at the application (HTTP) layer. When the reverse proxy accepts an incoming connection it waits for the request to come in and then initiates its own request on behalf of the client. It can add headers to its request, so the back end web servers can know the real IP address of the client. Perlbal, Apache, and Big IP make excellent reverse proxies. I will write about each of these in more detail in a future post.
Centralized Storage
Depending on what features your site has, you may already have centralized storage. If your site stores all state and dynamic content in a SQL database then you’ll be fine. If there is anything stored on the file system it will have to be moved into the database, or MogileFS, which I will discuss in another post.
Monitoring
One important aspect of operations is real time monitoring for problems. You need to know immediately if your site is down so you can take action. You also need to have resource usage history for troubleshooting and capacity planning.
Nagios
Nagios will monitor your services for uptime, but it lacks performance graphs. It has some neat network maps, including a cool looking but useless 3D map.
Cacti
A better monitoring solution is Cacti. I recommend it over Nagios. There is excellent support for monitoring nearly every service you can think of. The graphs are very readable.
Throw in some smart power and you’ll have a very reliable and easy to manage site.