The maintenance work nobody puts on the roadmap
Dependency updates, expiring certs, and quiet capacity creep never make the roadmap — until they cause the outage. Here is the boring upkeep that keeps software alive.
Roadmaps are full of features. They almost never contain the work that actually keeps software running: the upgrades, renewals, and quiet upkeep that prevent the outage rather than respond to it.
That work is invisible right up until the moment it isn’t — and then it’s an incident, a postmortem, and a scramble. The boring move is to do it on a schedule, so it never becomes a story.
The things that break while you’re not looking
A surprising amount of downtime comes not from new code, but from time passing:
- An expiring certificate. TLS certs, signing certs, API credentials — they all have expiry dates, and they all expire on a weekend.
- A dependency that went stale. The longer you wait to upgrade, the bigger and scarier the jump, until you’re three major versions behind and afraid to touch it.
- A disk filling slowly. Logs, uploads, a table that only ever grows. Capacity problems arrive gradually, then all at once.
- A platform deprecation. Your cloud provider is sunsetting a runtime, an API version, an instance type. The email arrived months ago. Nobody owned it.
- A backup that silently stopped. The job has been failing for six weeks. You find out the day you need to restore.
None of these are interesting. All of them are predictable. That’s exactly why they’re worth automating away.
Small and often beats big and scary
The same logic as deploys applies to maintenance. A dependency upgrade you do weekly is a five-minute, low-risk diff. The same upgrade deferred for a year is a multi-day migration with a real chance of breaking production.
So we keep it small and continuous:
- Automated dependency PRs, reviewed and merged on a regular cadence — not batched up for an annual “upgrade sprint” that never quite happens.
- A patch window that’s routine and boring, so security fixes land in days, not quarters.
- Calendar reminders for every expiry — certs, domains, credentials — with enough lead time to renew calmly.
Make the invisible visible
The reason this work falls off roadmaps is that it has no champion and no signal. Fix the signal:
- A short dependency and platform health check that runs on a schedule and reports what’s drifting out of support.
- Capacity alerts that fire with headroom — at 70% disk, not 98% — so you’re renewing a lease, not fighting a fire.
- A restore drill on a schedule. Actually restore the backup into a scratch environment. A backup is only real once you’ve restored it.
Someone has to own it
The deepest reason maintenance gets skipped is ownership. Features have product managers. Upkeep has nobody, so it becomes everybody’s someday.
In our maintenance and support engagements it’s explicit: keeping the lights on is a named deliverable with a named owner, not the thing we get to after the fun work. Dependency updates, cert renewals, capacity, backups, and a tested runbook for when something does break.
The goal of all of this is profoundly unglamorous: a system that just keeps working, where the most exciting thing that happens this quarter is a feature you chose to ship — not a 3am page about a cert that expired while everyone was looking the other way.