March 18, 2016

Safe bash scripting: ‘set -e’ is not enough!

Filed under: Rants,Technical — Tags: , — James Bunton @ 10:53 pm

I know, I know, shell scripting is a bad way to write software. I’ve always had a bit of a soft spot for bash though. Sure there are plenty of gotchas in bash scripting, but isn’t that the case for every programming language?

I like bash scripts

Sometimes you just want to start processes, do some piping, filtering and file redirection. For simple tasks like this a language like Python just gets in the way and makes the code more verbose.

For example the following simple bash script takes stdout from a Java app and conditionally passes it as a command line parameter to a Python application. It does this in 5 lines. No other language makes this so easy! :)

output="$(java -jar SomeApp.jar | grep somevalue)"
if [ -n "$output" ]; then
    ./ "$output"

My recent problem


set -e

function init_database {
    curl -o data.sql.gz --fail --retry 3
    gunzip data.sql.gz
    psql mydb < data.sql

if ! init_database; then
    echo "Failed to init database!"
    exit 1

echo "Do more things..."

I have a function which downloads some SQL, decompresses it and loads it into the database. If there's an error at any point I want to detect this and handle it. Specifically I expect that in the snippet above if there is an downloading the file, then init_database will stop and I will see the message "Failed to init database", followed by the script exiting. That's what the set -e option does right?

Not so! If you try these here's what actually happens:

  • curl --fail returns a non-zero exit code, perhaps because of a networking error.
  • gunzip is executed and decompresses as much as it can.
  • psql is executed and inserts whatever data is available. It returns 0 because the file exists and it connected to the DB.
  • init_database returns 0 because the last command it executed returned 0
  • The script continues and prints "Do more things..."
  • Probably it fails much later in an annoying and/or surprising way.

Why why why!?

set -e corresponds to the bash shell option errexit. The errexit feature seems to have been hacked into the language without much up-front design. I guess this is because of the 40-something year history of the Unix shell. Bash itself dates back to 1989.

Internally when you enable the errexit option bash sets a global variable exit_immediately_on_error. Now that's a revealing choice of variable name. The shell doesn't exit immediately on all errors, even with the option enabled. If it did then all sorts of things would be impossible. In the example below the expression in the if statement is sometimes going to exit with an error. This should not cause the script to exit, merely to skip over the contents of the if block.

if [ -f somefile ]; then
    gzip somfile

This behaviour is described in the documentation for the set builtin.

The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command’s return status is being inverted with !.

This works because whenever bash starts executing the conditional in an if statement it disables the errexit feature. It leaves this disabled until the conditional has been evaluated. That means the entire init_database function is executed as if we had never asked for errexit! What's more, even if you try to set -e inside your function bash has a special case to ignore this, otherwise your shell would exit if your conditional returned false. This is the reason that the error handling did not work in the the init_database earlier.

So how do I safely execute a function and check its result?

Glad you asked! Here's a simple example you can run yourself. I've tested this with Bash 4.3.42, but I believe the behaviour should be the same on older versions.

See the inline comments in the script for explanations of why each example fails or succeeds.

set -e

function do_work {
    echo Working

    false # fail to do something

    echo "We should never get here, there should be no more work!"

echo "Start1"
if ! do_work; then
    echo "Failed work, this is what we want."
echo "End1"
# bash disables errexit in the function because it is executed as
# "part of the test in an if statement"

echo "Start2"
do_work || echo "Failed work, this is what we want."
echo "End2"
# bash disables errexit in the function because it is executed as
# "part of any command executed in a && or ||"

echo "Start3"
set +e
if [ "$?" -ne 0 ]; then
    echo "Failed work, this is what we want."
set -e
echo "End3"
# This obviously doesn't work because I've explicitly disabled errexit.
# If I left errexit enabled then when do_work fails the shell would
# exit immediately, which means my if statement would not be executed.

echo "Start4"
do_work | cat
if [ "${PIPESTATUS[0]}" -ne 0 ]; then
    echo "Failed work, this is what we want."
echo "End4"
# Success! This example leaves errexit enabled for the function, however
# the exit status is still available to be checked in PIPESTATUS array.
# Note that this won't work if enable the non-default option: set -o pipefail

Other issues to be aware of with bash scripting

Just for the record here's a few other common problems I see in shell scripting:

  • Variables must be quoted or the tokeniser will split on spaces within the variable.
  • People tend to write spaghetti code instead of using functions.
  • Functions cannot have named parameters.
  • It's not practical to build new data structures, however the builtin arrays and maps work well enough.
  • Reading unset variables gives an empty string. Use set -u.
  • Almost all variables you set are global to the script. Remember to use local foo=1 in your functions!
  • Variables set inside subshells cannot be shared with the rest of the script. This includes while loops, but not for loops.
  • If a command fails by returning non-zero status execution will cheerfully continue. Use set -e.
  • Errors from commands in a pipeline are ignored, except for the command at the end. This will succeed: false | true. You can use set -o pipefail

My recommendations on safe bash scripting

I think it would be a mistake to be too afraid of using bash just because of these problems. There are no other languages which come close to the simplicity of shell scripting when it comes to executing a series of commands in order.

If you follow these rules then you'll be ok:

  • Always use set -eu at the top of your scripts.
  • Always put quotes around everything!
  • In a pipeline don't use commands which may fail, except as the last command.
  • Don't use backticks, instead use: x="$(echo foo)".
  • Use functions for any script more than a few lines long. This is better than using comments to describe the intended behaviour of a series of commands.
  • Use local variables in your functions whenever possible.
  • Don't call functions from within an if conditional.
  • Consider writing your script in another language like Python. Particularly if your script is doing lots of complex tests and conditionals, rather than simply being a series of commands executed in sequence.


warchinal says:

Would a short fix to your problem above (i.e. wanting init_database to return a non-zero status if the curl fails) be the following?

function init_database {
    curl -o data.sql.gz --fail --retry 3
    curl_status="$(echo $?)"
    gunzip data.sql.gz
    psql mydb < data.sql
    [ curl_status -eq 0 ]

That could allow you to explicitly care about the status of a certain command(s) in your function, and have the function's return value reflect that status.

James Bunton says:

Hi Will!

Yes something like that would a simple fix to this specific problem. Actually I’d check the curl exit status and immediately return from after the curl without running gunzip or psql.

However I wanted a general fix. If gunzip or psql fails, or if one of the other functions that is called from an if conditional fails I want all of these to work. I don’t want to have to check the exit code of every command, that’s what set -e is supposed to do for me! :)

warchinal says:

Hey James :)

Thanks for the quick reply. Given that this was obviously a solution to a real world problem (in fact I think I recognize the code) I’m glad to hear you’ve made a more sustainable solution in favor of the quick fix.

Leave a Reply

Your email address will not be published. Required fields are marked *