Building for Failure

Software architecture

While most of us aren’t building websites for banks, medical centers, or the NCSA, it’s still important to focus on and be aware of the possible failure points of our applications. Doing so gives our products an attention to detail that will be appreciated by our user base and will evolve into real, tangible profits.

Building an application to handle various failure modes is simply another aspect of mindful, detail-oriented application design. Building for failure does not necessarily mean that the application must recover gracefully from all possible failures; rather, it means it should degrade gracefully in the face of uncertainty. A classic example of this is the “over capacity” error page sometimes presented to the (un)happy masses on Twitter. The error page displayed includes an endearing image of an exhausted cartoon whale being air lifted by a squadron of Twitter birds. Ironically, this failure image has become a positive icon of the modern web world. It’s been seen on clothing and dish-ware, and it has even surfaced as a tattoo. The Twitter fail whale, concocted by a Twitter designer as a graceful way of conveying an internal technical failure to the end user, has its own fan club (

There are very few concrete techniques for building an application that handles failure gracefully. There is no fail_gracefully gem to be installed. Rather, building for failure is a philosophy, a school of thought much like agile development. That being said, there are some good rules of thumb to keep in mind whenever working on your codebase, such as “fail fast” and “never fail silently.” We’ll be discussing each of these in the following solutions.

AntiPattern: Continual Catastrophe

A well-seasoned technical manager once said that the best developers started as systems administrators and that the best systems administrators started as developers. There are many reasons we believe this to be true, not the least of which is that systems administrators are trained with a healthy dose of paranoia.

Let’s start seeing this script:

cd /data/tmp/
rm -rf *

At first glance, you might not see an issue with this snippet. Clearly, it’s there to remove temporary files from the system, a feature likely added to increase the system’s reliability by keeping the storage array from overfilling.

This snippet kept me awake for nights. What would happen if I renamed the data directory application? The failure of the cd command would be ignored, and everything in the root user’s home directory would be destroyed overnight. I fixed that script, but what other time bombs lay in wait for me the next morning?

bash, being designed for paranoid people such as myself, comes with set -e and the powerful && operator to help address these kinds of ignored failures. You need to apply the same sort of techniques to your Rails code as well.

Solution: Fail Fast

The “fail fast” philosophy is applicable both in application code and in utility code such as rake tasks and other support scripts. It most clearly manifests itself as sanity checks placed toward the top of the execution stack.

Whoa, There, Cowboy!

For example, the following method from a Portfolio model has a collection of photo files:

class Portfolio < ActiveRecord::Base def self.close_all! all.each do |portfolio| unless raise "Can't close a portfolio with photos." end portfolio.close! end end end

The Portfolio.close_all! method closes all portfolios, which has the side effect of deleting all the photo files for each portfolio. Consider what happens when a user with 100 portfolios clicks the Close All button. If the 51st portfolio still has files in it, the user is left with 50 open portfolios and a general sense of confusion.

Even though the following version is less performant than the preceding, it produces a much more consistent end user experience:

class Portfolio < ActiveRecord::Base def self.close_all! all.each do |portfolio| unless raise "Some of the portfolios have photos." end end all.each do |portfolio| portfolio.close! end end end

In this version above, you ensure that all the portfolios are empty before closing any of them. This helps avoid the half-closed scenario above. It still leaves room for race conditions if the user uploads more photos while the method is running, but you can alleviate this via database-level locking.

Improve the User Experience

While raising an exception prevents the inconsistency outlined in the above scenario, it doesn’t present a very good user experience, as it allows the user to close all portfolios and simply presents a 500 error screen when that action fails. You can address this with some judicious extraction and double-checking at the Controller and View layers.

The first thing to do is to extract the sanity check into another class method on Portfolio:

class Portfolio < ActiveRecord::Base def self.can_close_all? ! all.any? { |portfolio| } end def self.close_all raise "Can't close a portfolio with photos." unless can_close_all? all.each do |portfolio| portfolio.close! end end end

Like most other method extraction refactorings, this has the added benefits of making the class easier to test and increasing the readability of the code in general. Now you can make use of this predicate class method in your views like this:

<% if Portfolio.can_close_all? %>
<%= link_to "Close All", close_all_portfolios_url, :method => :post %>
<% end %>

And, as a third check, you can add a before_filter to your controller action:

class PortfoliosController < ApplicationController before_filter :ensure_can_close_all_portfolios, :only => :close_all

def close_all
redirect_to portfolios_url,
:notice => “All of your portfolios have been

def ensure_can_close_all_portfolios
if Portfolio.can_close_all?
redirect_to portfolios_url,
:error => “Some of your portfolios have photos!”

While the business logic is properly extracted into a single method (can_close_all?), the repeated checks in this above might seem a bit redundant. Situations in which you need to guard against irreversible actions call for this layered approach.


A further benefit of pushing sanity checks toward the top of your execution stack is readability. It’s easy to see, in the following example, what sanity checks are being run:

def fire_all_weapons

weapons.each {|weapon|! }

If the sanity checks are scattered throughout the execution stack, the purpose is obscured, hindering the readability and maintainability of the codebase.

Consistency Breeds Trust

As is the case with all the other solutions in this article, using the “fail fast” philosophy helps ensure that your application will behave consistently in the face of adversity. This, in turn, is one of the pillars of producing a user base that trusts you and your application.

AntiPattern: Inaudible Failures

An important part of the motivation behind building for failure is the user experience of the end user. Functionality that is error prone or inconsistent leaves the user not trusting your software. Oftentimes, we’ve seen code samples that look something like the following:

class Ticket < ActiveRecord::Base def self.bulk_change_owner(user) all.each do |ticket| ticket.owner = user end end end

The purpose of this code is fairly clear: Ticket.bulk_change_owner(user) loops through each ticket, assigning the user to it and saving the modified record.

You can modify the Ticket model by adding a validation to ensure that each ticket’s owner is also a member of that ticket’s project:

class Ticket < ActiveRecord::Base validate :owner_must_belong_to_project def owner_must_belong_to_project unless project.users.include?(owner) errors.add(:owner, "must belong to this ticket's project.") end end ... end

It’s important to always keep the end user in mind when working on absolutely any part of an application.

Consider, for example, what happens when a user attempts to assign another user to a bunch of tickets across multiple projects. Ticket.bulk_change_owner works fine for any ticket whose project has the user being assigned as a member and silently swallows validation errors for all other tickets. The end results are an inconsistent and buggy experience for the users and an unhappy customer for you.

Solution: Never Fail Quietly

The precise culprit in the bulk_change_owner method in the preceding section is in the inappropriate use of save instead of save!. However, the underlying issue is the silent swallowing of errors in general. This can be caused by ignoring negative return values, as you’ve done above, or by accidentally swallowing unexpected exceptions through overzealous use of the rescue statement.

Embrace the Bang!

In the code above, Ticket.bulk_change_owner should have been originally written to use the save! method—which raises an exception when a validation error occurs—instead of save. Here’s the same code as before, this time using save!:

class Ticket < ActiveRecord::Base def self.bulk_change_owner(user) all.each do |ticket| ticket.owner = user! end end end

Now, when the exception happens, the user will be made aware of the issue (you’ll see how later in this solution). There is still the issue of having updated half of the tickets before encountering the exception. To alleviate that, you can wrap the method in a transaction, as in the following example:

class Ticket < ActiveRecord::Base def self.bulk_change_owner(user) transaction do all.each do |ticket| ticket.owner = user! end end end end

Make It Pretty

In the preceding section, the user is made aware of the fact that a problem occurred, and you no longer have the problem of inconsistent data. However, showing a 500 page isn’t the best way to communicate with your public.

One quick way of producing a better user experience is to make use of the Rails rescue_from method, which you can leverage to display custom error pages to users when certain exceptions occur. While you could add your own exception for the Ticket.bulk_change_owner method, you’ll keep it simple for now and just rescue any ActiveRecord::RecordInvalid exception that finds its way to the end user:

class ApplicationController < ActionController::Base rescue_from ActiveRecord::RecordInvalid, :with => :show_errors

Never Rescue nil

A mistake we commonly see in the wild involves developers accidentally hiding unexpected exceptions through incorrect use of the rescue statement. This always involves using a bare rescue statement without listing the exact exceptions the developer is interested in catching.

Consider this snippet of code, which calls the Order#place! method:

order_number =! rescue nil
if order_number.nil?
flash[:error] = “Unable to reach Fulfillment House.” +
” Please try again.”

The Order#place! method contacts the fulfillment house in order to have it ship the product. It also returns the fulfillment house’s internal order number. The code makes use of an inline rescue statement to force the returned order number to nil if an exception was raised while placing the order. It then checks for nil in order to show the user a friendly message to ask them to try again.

Let’s take a look at the implementation for the Order#place! method:

class Order < ActiveRecord::Base def place! fh_order = send_to_fulfillment_house! self.fulfillment_house_order_number = fh_order.number save! return fh_order.number end end

Here, the Order#place! method is calling the send_to_fulfillment_house! method, which is where the earlier example expected the exception to originate. Unfortunately, the place! method also calls save!, and there lies the rub.

The order_number =! rescue nil line not only swallows any network errors that occurred during the send_to_fulfillment_house! call, it also cancels any validation errors that happened during the save! call. To make matters worse, the flash message instructs the user to attempt to place the order again, which means the fulfillment house will end up sending multiple products to the user because of a simple validation error on your end.

The root issue is using a blanket rescue statement without qualifying which exceptions to catch. The correct solution, is to collect the exceptions you want to catch in a constant and rescue those explicitly.

Big Brother Is Watching

Producing a consistent and trustworthy user experience is just one benefit of writing code that fails loudly. The other critical benefit has to do with instrumentation. Consider the before_save callback on the Tweet model shown here:

class Tweet < ActiveRecord::Base before_create :send_tweet def send_tweet twitter_client.update(body) rescue *TWITTER_EXCEPTIONS => e
HoptoadNotifier.notify e
errors.add_to_base(“Could not contact Twitter.”)

Here, you let the user know that you had issues contacting Twitter (a very rare situation, indeed) by setting a validation error. In addition, you record that fact via the Hoptoad ( service to ensure that the development team is aware of any connectivity issues or of general downtime with the external service.


In the examples in this article, we’ve used Hoptoad, the most popular error logging service for Rails applications. However, there are other services and plugins, such as exception_notification (, Exceptional (, and New Relic RPM (

The Takeaway

You should never ignore exceptions and negative return values. Instead, you should bubble them up to both the end user and to a monitoring system. Doing so ensures that your user’s experience remains consistent, which, as we’ve said before, is key to building a relationship of trust between users and your application. In addition, it removes your team’s blindfold and keeps you aware of the errors your users experience.

Rails™ AntiPatterns: Best Practice Ruby on Rails™ Refactoring
By: Chad Pytel; Tammer Saleh