From CI Nightmares to Reliable Builds: A Developer's Guide to Retry Mechanisms
From CI Nightmares to Reliable Builds: A Developer’s Guide to Retry Mechanisms
The 3 AM Build Failure
It was 3 AM when Sarah’s phone buzzed with a notification: “CI Pipeline Failed”. Again. The production hotfix that needed to go out first thing in the morning was stuck because the build couldn’t download dependencies. A flaky network connection had caused curl
to timeout while pulling a package. The entire 45-minute build would need to be restarted from scratch, pushing their critical fix dangerously close to the morning deadline.
Sound familiar? We’ve all been there. But what if I told you there’s a better way?
The Real Cost of Flaky Builds
Let’s look at a typical scenario in a modern development team:
- Average build time: 45 minutes
- Builds per day: 50
- Network-related failures: 20% of builds
- Developer time spent managing failed builds: 2 hours/day
That’s potentially 450 minutes of wasted build time and 2 hours of developer time daily, just because of transient network issues. For a team of 10 developers, this could cost upwards of $100,000 annually in lost productivity.
Building Resilience: The Retry Pattern
1. Docker Build Resilience
Before we had this in our Dockerfile:
1
2
FROM node:16-alpine
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
One network hiccup and the entire build fails. Here’s how we made it resilient:
1
2
3
4
5
6
7
8
9
10
11
12
13
FROM node:16-alpine
# Add retry mechanism for curl
RUN for i in 1 2 3 4 5; do \
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash && break \
|| { echo "Retry attempt $i failed"; sleep 10; } \
done
# For apt-get installations
RUN apt-get update \
&& for i in 1 2 3 4 5; do \
apt-get install -y --no-install-recommends some-package && break \
|| { echo "Retry attempt $i failed"; sleep 10; } \
done
2. Package Manager Resilience
Replace:
1
bundle install
With:
1
2
3
4
bundle install --retry 3 --jobs 4 \
|| (echo "Bundle install failed, cleaning and retrying..." \
&& rm -rf vendor/bundle \
&& bundle install --retry 3 --jobs 4)
3. CI Pipeline Resilience
Here’s a real GitHub Actions workflow that saved our team countless hours:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
name: Resilient CI Pipeline
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache dependencies
uses: actions/cache@v2
with:
path: |
~/.npm
~/.bundle
key: $-deps-$
- name: Install dependencies with retry
uses: nick-invision/retry@v2
with:
timeout_minutes: 10
max_attempts: 3
retry_wait_seconds: 30
command: |
npm ci
bundle install --retry 3 --jobs 4
- name: Build Docker image with retry
uses: nick-invision/retry@v2
with:
timeout_minutes: 15
max_attempts: 3
command: |
docker build \
--network-retry-count 3 \
--network-retry-interval 30 \
-t myapp:$ .
The Results: A Success Story
After implementing these retry mechanisms:
- Build failure rate dropped from 20% to 2%
- Developer interruptions reduced by 85%
- Estimated annual savings: $85,000
- Team morale significantly improved (no more 3 AM wake-up calls!)
Best Practices We Learned
- Layer Your Retry Strategies ```bash #!/bin/bash
build.sh - Our final build script
set -eo pipefail
Function to retry commands
retry() { local max_attempts=”$1” local delay=”$2” local command=”${@:3}” local attempt=1
1
2
3
4
5
6
7
8
9
until $command; do
if (( attempt == max_attempts )); then
echo "Command failed after $max_attempts attempts"
return 1
fi
echo "Attempt $attempt failed! Retrying in ${delay}s..."
sleep $delay
((attempt++))
done }
Application of retry strategy
retry 3 10 bundle install –retry 3 –jobs 4 retry 3 30 docker build –network-retry-count 3 . retry 3 10 npm ci
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2. **Smart Caching Strategy**
```yaml
# .github/workflows/ci.yml
- name: Cache Docker layers
uses: actions/cache@v2
with:
path: /tmp/.buildx-cache
key: $-buildx-$
restore-keys: |
$-buildx-
- name: Build with cache
uses: docker/build-push-action@v2
with:
context: .
push: false
cache-from: type=local,src=/tmp/.buildx-cache
cache-to: type=local,dest=/tmp/.buildx-cache-new
- name: Move cache
run: |
rm -rf /tmp/.buildx-cache
mv /tmp/.buildx-cache-new /tmp/.buildx-cache
- Monitor and Alert
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# lib/build_monitor.rb class BuildMonitor def self.track_retry(stage, attempt) Prometheus::Client.registry.counter( :build_retries_total, labels: [:stage] ).increment(labels: { stage: stage }) if attempt > 1 Slack.notify( channel: '#builds', text: "⚠️ #{stage} needed #{attempt} attempts to succeed" ) end end end
Real-world Testing Strategy
Here’s how we test our retry mechanisms:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# spec/build_resilience_spec.rb
RSpec.describe "Build Resilience" do
it "handles network failures gracefully" do
# Simulate network failures
allow(Docker).to receive(:build).and_raise(
Excon::Error::Socket
).exactly(2).times.ordered
allow(Docker).to receive(:build).and_return(true)
expect {
BuildProcess.new.run
}.not_to raise_error
end
end
Monitoring Your Retry Success
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class RetryMetrics
def self.record_retry(operation, attempt, success)
elapsed = Time.now - Thread.current[:retry_start_time]
Prometheus::Client.registry.histogram(
:retry_duration_seconds,
labels: [:operation, :success]
).observe({
operation: operation,
success: success
}, elapsed)
if success
Rails.logger.info(
"Operation #{operation} succeeded after #{attempt} attempts"
)
end
end
end
Conclusion
By implementing these retry mechanisms, we transformed our CI pipeline from a source of frustration into a reliable, efficient system. The key takeaways:
- Always plan for network failures
- Layer your retry strategies
- Monitor and learn from retry patterns
- Cache aggressively
- Keep your team happy with fewer 3 AM alerts
Remember Sarah from our opening story? She now sleeps peacefully through the night, knowing that transient failures won’t derail her team’s deployments. Her team’s next challenge? Finding new things to discuss at stand-up now that “the build is broken” is no longer a daily topic!
Would you like me to expand on any particular aspect of this story-driven approach? I can add more real-world scenarios or dive deeper into specific implementation details.