Deploy Pipeline Fix: When Your Status Light Lies

April 12, 2026|infrastructure|FIX

Deploy Pipeline Fix: When Your Status Light Lies

Every GH Actions deploy for sports-dashboard had been marked failed for weeks — 15+ consecutive red runs, false Discord ROLLBACK alerts, all while production ran fine. Root cause: the verify loop filtered by coolify.applicationId=UUID but that label is an integer ID, so every poll returned empty. The rollback was a silent no-op on top of the broken polling.

False Failed Deploys

15+

consecutive

Real Failures

New Deploy Time

4m37s

was timing out at 10m

Fix Size

1 file

-40 lines

The Symptom

Every GH Actions run for Deploy Sports Dashboard was red. 15+ consecutive runs marked failed. Discord was firing ROLLBACK alerts on every push. But the site kept working, new features kept appearing in production, and nobody noticed anything was actually broken.

The Bug

The deploy workflow polled for containers using a label filter:

LABEL="coolify.applicationId=$APP_UUID"
docker ps --filter "label=$LABEL"

Coolify labels containers with applicationId=55 (an integer internal ID), not the UUID. The UUID lives in coolify.name. Every one of the 30 polls returned empty, every deploy timed out at 5 minutes, every "rollback" fired.

The Cover-Up

The rollback logic was even more broken than the polling. It did this:

PATCH /applications/$UUID  { docker_registry_image_tag: $PREV_SHA }
GET /deploy?uuid=$UUID&force=true

Then polled the health URL for HTTP 200.

Turns out deploy?force=true ignores the config PATCH and pulls :latest anyway. So the rollback just re-deployed the same new container and got HTTP 200 from it. The workflow reported "Rollback successful" while doing nothing. Production was actually running the latest commit the whole time.

How We Found It

Manually inspecting a running container:

docker inspect <container> | grep applicationId
"coolify.applicationId": "55"

Then checking the rollback on a "failed" run — the container creation time was before the rollback timestamp. No new container was ever created by the rollback. The whole verify-rollback scheme was theater.

The Fix

One workflow file, ~40 lines simpler. Instead of filtering by the broken label:

Record the current container ID before triggering deploy
Trigger Coolify deploy
Poll by container name prefix (name=^<UUID>-), watch for a new container ID
Crash-detect the new container honestly
Verify HTTP 200 once it's up
No rollback — real failures fail loudly with a Discord alert

First real run: 4m37s, verified green. Coolify swapped to a new container on poll 3 (24 seconds), health check passed on poll 4.

The Lesson

Status lights that lie are worse than no status lights. For weeks, the workflow was telling us deploys failed when they succeeded, which meant we'd have no way to distinguish a real failure. The deeper bug — that the rollback itself was a silent no-op — was hidden underneath the cosmetic bug of the broken polling.

When you see a system failing in an "impossible" way (everything broken but site works fine), that's the signal the verification layer is broken, not the thing being verified.

Added to infra docs: do not filter Coolify containers by applicationId=<UUID>. It's an integer, the UUID is in coolify.name, and the cleanest match is docker ps --filter name=^<UUID>-.