From CI Nightmares to Reliable Builds: A Developer's Guide to Retry Mechanisms

Posted Nov 5, 2024

By Joey

5 min read

From CI Nightmares to Reliable Builds: A Developer’s Guide to Retry Mechanisms

The 3 AM Build Failure

It was 3 AM when Sarah’s phone buzzed with a notification: “CI Pipeline Failed”. Again. The production hotfix that needed to go out first thing in the morning was stuck because the build couldn’t download dependencies. A flaky network connection had caused curl to timeout while pulling a package. The entire 45-minute build would need to be restarted from scratch, pushing their critical fix dangerously close to the morning deadline.

Sound familiar? We’ve all been there. But what if I told you there’s a better way?

The Real Cost of Flaky Builds

Let’s look at a typical scenario in a modern development team:

Average build time: 45 minutes
Builds per day: 50
Network-related failures: 20% of builds
Developer time spent managing failed builds: 2 hours/day

That’s potentially 450 minutes of wasted build time and 2 hours of developer time daily, just because of transient network issues. For a team of 10 developers, this could cost upwards of $100,000 annually in lost productivity.

Building Resilience: The Retry Pattern

1. Docker Build Resilience

Before we had this in our Dockerfile:

  
FROM node:16-alpine
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash

One network hiccup and the entire build fails. Here’s how we made it resilient:

  
FROM node:16-alpine
# Add retry mechanism for curl
RUN for i in 1 2 3 4 5; do \
      curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash && break \
      || { echo "Retry attempt $i failed"; sleep 10; } \
    done

# For apt-get installations
RUN apt-get update \
    && for i in 1 2 3 4 5; do \
         apt-get install -y --no-install-recommends some-package && break \
         || { echo "Retry attempt $i failed"; sleep 10; } \
       done

2. Package Manager Resilience

Replace:

bundle install

With:

  
bundle install --retry 3 --jobs 4 \
  || (echo "Bundle install failed, cleaning and retrying..." \
      && rm -rf vendor/bundle \
      && bundle install --retry 3 --jobs 4)

3. CI Pipeline Resilience

Here’s a real GitHub Actions workflow that saved our team countless hours:

  
name: Resilient CI Pipeline

on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Cache dependencies
        uses: actions/cache@v2
        with:
          path: |
            ~/.npm
            ~/.bundle
          key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/Gemfile.lock') }}

      - name: Install dependencies with retry
        uses: nick-invision/retry@v2
        with:
          timeout_minutes: 10
          max_attempts: 3
          retry_wait_seconds: 30
          command: |
            npm ci
            bundle install --retry 3 --jobs 4

      - name: Build Docker image with retry
        uses: nick-invision/retry@v2
        with:
          timeout_minutes: 15
          max_attempts: 3
          command: |
            docker build \
              --network-retry-count 3 \
              --network-retry-interval 30 \
              -t myapp:${{ github.sha }} .

The Results: A Success Story

After implementing these retry mechanisms:

Build failure rate dropped from 20% to 2%
Developer interruptions reduced by 85%
Estimated annual savings: $85,000
Team morale significantly improved (no more 3 AM wake-up calls!)

Best Practices We Learned

Layer Your Retry Strategies ```bash #!/bin/bash
build.sh - Our final build script
set -eo pipefail

Function to retry commands

retry() { local max_attempts=”$1” local delay=”$2” local command=”${@:3}” local attempt=1

until $command; do
    if (( attempt == max_attempts )); then
        echo "Command failed after $max_attempts attempts"
        return 1
    fi
    echo "Attempt $attempt failed! Retrying in ${delay}s..."
    sleep $delay
    ((attempt++))
done }

Application of retry strategy

retry 3 10 bundle install –retry 3 –jobs 4 retry 3 30 docker build –network-retry-count 3 . retry 3 10 npm ci

2. **Smart Caching Strategy**
```yaml
# .github/workflows/ci.yml
- name: Cache Docker layers
  uses: actions/cache@v2
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ github.sha }}
    restore-keys: |
      ${{ runner.os }}-buildx-

- name: Build with cache
  uses: docker/build-push-action@v2
  with:
    context: .
    push: false
    cache-from: type=local,src=/tmp/.buildx-cache
    cache-to: type=local,dest=/tmp/.buildx-cache-new

- name: Move cache
  run: |
    rm -rf /tmp/.buildx-cache
    mv /tmp/.buildx-cache-new /tmp/.buildx-cache

Monitor and Alert

  
# lib/build_monitor.rb
class BuildMonitor
  def self.track_retry(stage, attempt)
 Prometheus::Client.registry.counter(
   :build_retries_total,
   labels: [:stage]
 ).increment(labels: { stage: stage })

 if attempt > 1
   Slack.notify(
     channel: '#builds',
     text: "⚠️ #{stage} needed #{attempt} attempts to succeed"
   )
 end
  end
end

Real-world Testing Strategy

Here’s how we test our retry mechanisms:

  
# spec/build_resilience_spec.rb
RSpec.describe "Build Resilience" do
  it "handles network failures gracefully" do
    # Simulate network failures
    allow(Docker).to receive(:build).and_raise(
      Excon::Error::Socket
    ).exactly(2).times.ordered
    allow(Docker).to receive(:build).and_return(true)

    expect {
      BuildProcess.new.run
    }.not_to raise_error
  end
end

Monitoring Your Retry Success

  
class RetryMetrics
  def self.record_retry(operation, attempt, success)
    elapsed = Time.now - Thread.current[:retry_start_time]

    Prometheus::Client.registry.histogram(
      :retry_duration_seconds,
      labels: [:operation, :success]
    ).observe({
      operation: operation,
      success: success
    }, elapsed)

    if success
      Rails.logger.info(
        "Operation #{operation} succeeded after #{attempt} attempts"
      )
    end
  end
end

Conclusion

By implementing these retry mechanisms, we transformed our CI pipeline from a source of frustration into a reliable, efficient system. The key takeaways:

Always plan for network failures
Layer your retry strategies
Monitor and learn from retry patterns
Cache aggressively
Keep your team happy with fewer 3 AM alerts

Remember Sarah from our opening story? She now sleeps peacefully through the night, knowing that transient failures won’t derail her team’s deployments. Her team’s next challenge? Finding new things to discuss at stand-up now that “the build is broken” is no longer a daily topic!

Would you like me to expand on any particular aspect of this story-driven approach? I can add more real-world scenarios or dive deeper into specific implementation details.

development ci

This post is licensed under CC BY 4.0 by the author.