When Your Deploy Script Fails at 3 AM: A Field Guide to Emergency Fixes

I got the page at 3:17 AM on a Tuesday.

Production was down. The deploy script had hung halfway through a database migration. Half the tables were migrated. Half were not. The health check endpoint was returning 500s. Customers could not log in.

I SSHed into the production box. The deploy script was still running. Or it looked like it was running. The process existed but nothing was happening. No CPU usage. No disk I/O. Just sitting there.

I have been doing this for over 20 years. This exact scenario has woken me up more times than I can count.

The problem was not the script. The script had worked fine for 6 months. The problem was that something in the environment had changed and the script did not know how to handle it. The database connection pool was exhausted. The migration was waiting for a lock that would never release. And the script had no timeout configured.

I killed the process. Rolled back manually. Brought production back up. Spent the next 2 hours adding proper timeout handling and connection pool limits to the deploy script.

That was 8 years ago. I still check for timeout configurations first when a deploy hangs.

The Five Failure Patterns I See Every Time

Most deployment failures fall into five categories.

I have debugged hundreds of broken deploys. Maybe thousands. The failure modes repeat constantly.

Permission and Credential Failures

This is the most common problem.

Your script works locally because your user has permission to write to /var/www. The deploy user in production does not. The script fails with a cryptic error about file creation.

Or the AWS credentials expired. Or the SSH key changed. Or the Docker registry token is invalid.

These failures are hard to debug remotely. The error messages are generic. "Permission denied" could mean file permissions or credential expiry or IAM policy restrictions.

I always check credentials first now. Before anything else. Is the API token still valid. Does the service account have the right IAM role. Can the deploy user actually write to the target directory.

Network and Timeout Issues

Your script pulls a Docker image. Works fine locally. Times out in production.

Why? The production server is behind a firewall. Docker Hub is rate-limiting your IP. The DNS entry for your private registry has not propagated yet.

Network failures look like application failures. The script hangs. No error message. Just silence.

I have learned to add timeouts to everything. Curl gets a timeout. Docker pull gets a timeout. Database connections get a timeout. SSH commands get a timeout.

If it touches the network it can hang forever. Plan accordingly.

State and Idempotency Problems

Your script assumes a clean slate.

It creates a database. But the database already exists from a previous failed deploy. The script exits with an error. "Database already exists."

Or it tries to delete a temporary directory that is not there. Or it assumes a configuration file does not exist yet.

Scripts fail when reality does not match assumptions.

I make my scripts idempotent now. Check if the resource exists before creating it. Delete only if the file is there. Handle both cases gracefully.

The script should work correctly whether you run it once or ten times.

Dependency Version Mismatches

It works on your laptop. Breaks in production.

You are running Node 18. Production is still on Node 16. Your script uses a feature from 18.

Or your local Bash is 5.1. Production macOS has Bash 3.2. The associative array syntax fails.

Or you have a newer version of curl. The command flags you use do not exist in the older version.

Version mismatches are invisible until they blow up.

I pin versions now. Specify exact versions in package.json. Document minimum bash version. Check the Node version before the script runs.

If a dependency matters specify exactly which version you need.

Resource Exhaustion

Disk is full. Out of memory. File descriptor limit reached. Connection pool exhausted.

The script does not check. It assumes resources are infinite. They are not.

I have seen deploys fail because /tmp was full. Because the system ran out of memory during npm install. Because too many log files were open.

Resource limits are real. Check them. Fail early with a clear message if you are about to exceed them.

Add a disk space check at the start of your script. Monitor memory usage during heavy operations. Close file handles when you are done with them.

What Your Error Messages Are Not Telling You

Generic errors hide the real problem.

"Command failed with exit code 1" tells you nothing. What command. Why did it fail. What was the actual error.

I see this constantly. Scripts that swallow useful error messages and replace them with useless summaries.

Silent Failures in Chained Commands

You chain three commands with pipes. The first one fails. The script continues anyway. You only see the error from the last command.

This is the default behavior in Bash. It is terrible for deploy scripts.

I use set -o pipefail now. If any command in a pipe fails the whole thing fails. No silent failures.

Exit Code 0 Does Not Mean Success

Some commands return 0 even when they fail.

Curl returns 0 if it successfully made an HTTP request. Does not matter if the server returned a 500 error. The request succeeded so exit code is 0.

I add curl flags to fail on HTTP errors. I check the response. I do not trust exit codes alone.

Logs Written to Places You Are Not Looking

The script logs to /var/log/deploy.log. You are checking /tmp/deploy.log. You see nothing. You assume the script produced no output.

The logs exist. You are just looking in the wrong place.

I standardize log locations now. Always write to the same path. Document where logs go. Make it easy to find them during an emergency.

The Debugging Checklist I Actually Use

When a deploy fails I follow the same process every time.

This checklist has saved me hours of random guessing.

First 30 Seconds: What Changed

The script worked yesterday. It does not work today. What changed.

Did someone update a dependency. Did the server get patched. Did a certificate expire. Did we hit a rate limit we never hit before.

Most failures are caused by environmental changes. Find the change and you find the problem.

I check recent commits. I check server updates. I check for certificate expiry dates. I check for new firewall rules.

Something changed. Always.

Verify Assumptions

The script assumes the database is running. Is it actually running.

The script assumes it can write to /var/www. Can it actually write there.

The script assumes network connectivity. Is the network actually working.

I verify every assumption. Manually. SSH into the box. Try to connect to the database. Try to write a test file. Try to curl the API endpoint.

Assumptions are where bugs hide.

Add Verbose Logging Without Breaking Things Further

I add set -x to see what commands are running.

But set -x for the entire script is too noisy. I add it around the section that is failing.

I add echo statements before critical operations. "About to connect to database." "About to pull Docker image." "About to restart service."

Logging shows me exactly where the script stops.

Isolate the Exact Line

The script is 300 lines long. Which line is failing.

I comment out sections. I run parts of the script manually. I narrow it down.

Once I know the exact command that fails I can debug that command in isolation.

Reproduce the Failure in a Safe Environment

I create a staging server that matches production. I run the deploy script there. I watch it fail in the same way.

Now I can experiment. Change timeouts. Add retries. Fix permissions. See if it works.

I never test fixes in production. Too risky. Staging first. Always.

Common Script Patterns That Fail Under Pressure

Certain coding patterns look fine but break in production.

I have seen these patterns fail so many times I avoid them automatically now.

Relative Paths

Your script uses ./config/deploy.yml. Works fine when you run it from the project root. Fails when cron runs it from /root.

Relative paths depend on the current working directory. The current working directory is not guaranteed.

I use absolute paths now. Or I set the working directory explicitly at the start of the script. cd to the known location. Then use relative paths from there.

Unquoted Variables

You have a variable called FILE_PATH. Its value contains a space. You use it unquoted in a command. Bash splits it into two arguments. The command fails.

I quote all variable expansions now. "$FILE_PATH" not $FILE_PATH. Every time.

This prevents so many stupid bugs.

Missing Error Handling

You assume the command will succeed. You do not check the exit code. The command fails silently. The script continues. Everything breaks later in confusing ways.

I use set -e now. Exit immediately if any command fails. Or I check exit codes explicitly.

No silent failures. If something breaks I want to know immediately.

Race Conditions in Parallel Execution

You run multiple background processes. They all write to the same log file. The log output is corrupted. Or they all try to create the same directory. One succeeds. The others fail.

Parallel execution is fast. It is also full of race conditions.

I serialize critical sections. Use file locks. Write to separate log files. Make sure only one process touches shared resources at a time.

Assumptions About Filesystem State

Your script deletes /tmp/deploy-temp at the start. But the directory does not exist. The rm command fails. The script exits.

Or the script expects a file to exist. It does not. cat fails. Script exits.

I check before acting. If the directory exists then delete it. If the file exists then process it. Handle both cases.

Tools and Techniques That Actually Help

Some debugging techniques are genuinely useful.

Others sound good but do not help in practice. Here is what actually works.

Using Trap for Cleanup and Debugging

I use trap to run cleanup code when the script exits. Even if it exits due to an error.

trap "rm -rf /tmp/deploy-temp" EXIT

Now the temp directory gets cleaned up no matter what happens.

I also use trap for debugging. trap 'echo "Failed at line $LINENO"' ERR. When a command fails I see exactly which line.

Strategic Use of Set -x

set -x shows every command before it runs. Useful for debugging. Incredibly noisy for a full deploy script.

I turn it on for specific sections. set -x before the problematic code. set +x after. I get debugging output where I need it without drowning in noise.

Logging to Separate Files for Each Stage

One big log file gets messy. Hard to find the relevant part during an emergency.

I write each stage to its own log file. build.log. test.log. deploy.log. migrate.log.

When the migration fails I look at migrate.log. Not the full deploy log.

Separation makes debugging faster.

Health Checks and Smoke Tests After Deployment

The deploy script finishes. Exit code 0. But the application is broken.

I run health checks after every deploy. curl the health endpoint. Check the HTTP status. Make sure the app actually responds.

I run smoke tests. Log in. Create a test record. Read it back. Verify core functionality works.

Exit code 0 means the script ran. Health checks mean the application works. Big difference.

Rollback Strategies That Do Not Require Manual Intervention

Deploys fail. Rollback should be automatic.

I keep the previous version around. If health checks fail after deploy I automatically switch back to the old version. Restart the service. No manual intervention needed.

I have symlinks for current and previous. Deploy creates a new directory. Updates the symlink. If it fails it points the symlink back.

Blue-green deployment. Feature flags. Canary releases. All variations on the same idea. Make rollback fast and automatic.

Environment Differences That Bite You

Development and production are different. The differences cause bugs.

I have been bitten by every one of these at least once.

Local Dev Versus CI Versus Production

Your laptop has 32GB of RAM. CI has 4GB. Production has 8GB. npm install runs out of memory in CI. Not locally. Not in production. Just CI.

Or your laptop has fast SSD. CI has slow network storage. File operations that take seconds locally take minutes in CI.

Resource differences cause timing differences. Timing differences cause race conditions. Race conditions cause intermittent failures.

I test in an environment that matches production. Not just locally.

Bash Version Differences

macOS ships with Bash 3.2. Most Linux systems have Bash 4 or 5. The syntax is different.

Associative arrays do not exist in Bash 3.2. The &>> redirect syntax does not work. The ** globstar pattern does not work.

I write scripts that work in Bash 3.2. Lowest common denominator. Or I explicitly require Bash 4 and check the version at the start.

Environment Variables That Exist Locally But Not in Cron

You have PATH set in your .bashrc. When you run the script manually it finds the commands. When cron runs the script it does not load .bashrc. PATH is minimal. Commands are not found.

Cron runs with a restricted environment. Almost no variables set. No PATH. No USER. No HOME.

I set environment variables explicitly in cron jobs. Or I use absolute paths for all commands.

File Permissions in Docker Versus Bare Metal

You run the script as yourself locally. UID 1000. You create files. They are owned by UID 1000.

In Docker the script runs as root. UID 0. Files are owned by root. The application runs as a non-root user. Cannot read the files. Permission denied.

I set file ownership explicitly after creating files. chown to the correct user. Or I run the deploy script as the same user the application uses.

Timezone and Locale Issues

Your script parses a date string. Works fine in US timezone. Fails in Europe. The date format is different.

Or your script sorts files. Locale settings change sort order. Different results in different environments.

I set LC_ALL=C to get consistent sorting. I use UTC for all dates. I do not trust locale-dependent behavior.

Yes really. I have debugged timezone bugs at 3 AM. They are not fun.

When to Fix the Script Versus When to Fix the Environment

Sometimes the script is fine. The environment is broken.

Sometimes the environment is fine. The script makes bad assumptions.

Figuring out which is which saves time.

Technical Debt in Deployment Automation

Your deploy script has 400 lines of workarounds. Half of them are for problems that no longer exist. The other half are fixing issues that should be fixed in the environment instead.

This is technical debt. It accumulates over time. Eventually the script becomes unmaintainable.

I periodically audit deploy scripts. Remove workarounds that are no longer needed. Fix root causes instead of patching around them.

If the script is working around a permissions problem fix the permissions. Do not add more code to the script.

Making Scripts More Defensive Versus Fixing Root Causes

Your script checks if the directory exists before creating it. Good defensive coding.

But why might the directory already exist. Is a previous deploy leaving garbage around. That is the root cause. Fix that instead.

Defensive coding is good. But it can hide problems. If you are constantly defending against the same issue maybe the issue should be fixed permanently.

The Cost of Overly Clever Scripts

I have written very clever deploy scripts. They handled every edge case. They recovered from every error. They were 800 lines of beautiful error handling.

They were also impossible to debug. When something went wrong I could not figure out which of the 47 error handlers had activated. The complexity made debugging harder not easier.

I write simple scripts now. If something can fail I let it fail loudly. I do not try to handle every edge case. I make the happy path obvious and the failure path obvious.

Clever scripts are a liability.

When to Rewrite Versus When to Patch

Your deploy script is 5 years old. It has been patched 30 times. It still mostly works but it is fragile.

Do you rewrite it or keep patching.

I rewrite when the script is harder to understand than to recreate. If I cannot explain what the script does in 5 minutes it is too complex. Time to start over.

I patch when the script is still understandable. A bug is a bug. Fix it. Move on.

Building Scripts That Fail Gracefully

The goal is not to prevent all failures. Failures will happen.

The goal is to make failures debuggable.

Designing for Debuggability

I write scripts that make debugging easy.

Clear error messages. Logs in predictable locations. Exit codes that mean something. State that can be inspected.

When the script fails at 3 AM I should be able to figure out what went wrong in under 5 minutes.

I add logging before every major operation. I save intermediate state to files. I write a summary at the end showing what succeeded and what failed.

Progressive Deployment with Checkpoints

I break deploys into stages. Each stage has a checkpoint.

Build the application. Checkpoint. Run tests. Checkpoint. Deploy to staging. Checkpoint. Deploy to production. Checkpoint.

If the deploy fails I know exactly which stage failed. I can restart from the last checkpoint. I do not have to redo everything.

Meaningful Exit Codes and Error Messages

Exit code 1 means "something went wrong." Not helpful.

I use specific exit codes. 1 for build failure. 2 for test failure. 3 for deploy failure. 4 for rollback failure.

The exit code tells me which stage failed.

I write error messages that say what went wrong and what to do about it. "Database connection failed. Check DATABASE_URL environment variable and verify the database is running."

Not just "Error: connection failed."

State Validation at Each Stage

After each stage I validate that the system is in the expected state.

After build I check that the binary exists. After deploy I check that the new version is running. After migration I check that the tables exist.

If the state is wrong I fail immediately. I do not proceed to the next stage and compound the problem.

Making it Easy to Resume from Failure

The deploy gets halfway through and fails. I fix the problem. Now I want to resume from where it failed.

I design scripts to be resumable. Each stage checks if its work is already done. If the binary is already built skip the build stage. If the database is already migrated skip the migration.

This makes retries fast. I do not have to redo successful stages. I only redo what failed.

You Will Always Be Surprised

Here is the thing about deploy scripts. They will always fail in ways you did not anticipate.

I have been doing this for over 20 years. I still get surprised. New failure modes. New race conditions. New environmental quirks.

The goal is not to write a perfect script that never fails. That is impossible.

The goal is to make debugging fast. When the script fails at 3 AM can you figure out what went wrong in under 10 minutes. Can you fix it and get production back up.

I design for debuggability. Clear logs. Obvious errors. Simple logic. Easy rollback.

I practice failure scenarios. I intentionally break things in staging. I time how long it takes to diagnose and fix. I improve my process.

The best deployment engineers are not the ones who prevent all failures. They are the ones who recover quickly when failures happen.

And failures will happen. Your pager will go off. Your script will hang. Your deploy will break production.

What matters is how fast you fix it.

Keep your scripts simple. Log everything important. Validate your assumptions. Test your rollback procedure.

And get some sleep before the next 3 AM page comes in.