Most Retros Are a Waste of Everyone's Time
I've sat through probably 50 incident retrospectives. Maybe 8 of them produced something useful. The rest followed the same pattern: everyone sits in a room, someone reads a timeline, people nod, someone says "we need better communication," and nothing changes.
If your incident retrospective process looks like that, you're just doing theater.
The whole point of a retro is to make the next incident less painful. If you walk out without concrete, assigned action items with deadlines, you wasted an hour of your team's time and you'll have the exact same conversation after the next incident.
What a Good Incident Retrospective Process Looks Like
I stole our retro format from a DevOps team I worked with in 2023. It works. Here's the structure:
Timeline first. Build a detailed, timestamped timeline before the meeting. Not during it. The person running the retro should gather Slack messages, alert logs, and deployment records ahead of time. When we show up, the timeline is already on a shared doc. This saves 20 minutes of people arguing about when things happened.
Five questions, answered honestly.
- What triggered the incident?
- How did we detect it, and how long did detection take?
- What slowed down the response?
- What fixed it?
- What would have prevented it entirely?
That last question is the one most teams skip. They focus on response and ignore prevention. But the best retros I've run produced monitoring improvements that caught future issues before they became incidents.
The Detection Gap Is Where the Money Goes
Here's the number that matters most in any retro: time to detection. How long was the problem happening before anyone knew about it?
We tracked this across 30 marketing funnel incidents last year. The average time to detection was 4.7 hours. That's 4.7 hours of broken funnels, wasted ad spend, and lost conversions before someone realized there was a problem.
In 60% of those cases, the detection method was "someone noticed the numbers looked weird." Not an alert. Not a monitoring tool. A human, looking at a dashboard, hours after the fact.
Your incident retrospective process should always ask: could we have detected this faster? And if the answer is yes (it almost always is), the action item should be to set up monitoring that catches that specific failure mode. We use FunnelLeaks for funnel-specific monitoring, Pingdom for uptime, and custom alerts in our analytics platform for traffic anomalies.
Blame Culture Kills Honest Retros
If people are scared of being blamed, they won't tell you what actually happened. They'll sanitize the story. They'll leave out the part where they pushed a config change without testing it, or where they saw the alert and assumed it was a false positive.
I've made it a rule on our team: retros are blameless. We talk about systems, not people. Instead of "Sarah didn't check the alert," we say "the alert went to a channel that doesn't have clear ownership." Same root cause, totally different energy.
The teams that run the best incident retrospective process are the ones where people feel safe saying "I screwed up, and here's what we should change so it doesn't happen again." That only works if leadership models that behavior first.
Turn Every Retro Into a Monitoring Improvement
Here's my rule of thumb: every incident retro should produce at least one new monitoring check or alert. If you had an incident and your monitoring didn't catch it, that's a gap. Fill the gap.
Over time, this compounds. After 10 retros, you've got 10+ new monitoring checks running. Your detection time drops. Your incidents get shorter. Your team gets less stressed because they know the system will catch things.
If you're not turning your retros into monitoring improvements, you're just documenting your failures without learning from them. Set up the alerts. Automate the checks. And if you need a monitoring tool built for marketing funnels specifically, FunnelLeaks is where we'd point you. We built it because generic DevOps monitoring tools don't understand the marketing-specific failure modes that actually cost you money.
