The 3 a.m. phone call
It always starts the same way. A controller faults. The line stops. Someone calls the on-call engineer. The engineer drives in. Then the real work begins: which version of the program was running on that PLC? Is the file on the laptop in the cabinet the one that was commissioned, or the one with the tweak from Tuesday's troubleshoot? Is there even a current backup?
For most plants, the answer to that last question is "sort of." A shared drive somewhere holds a backup. Whether it matches the controller that just died is anyone's guess. That uncertainty is where the 14-hour outage comes from — not the hardware swap, but the forensics that have to happen before anyone can safely flash a replacement.
This guide is the playbook we wish every plant had before the 3 a.m. call. It assumes you don't have unlimited budget, you can't replace every controller with a hot-standby pair, and you have to defend every dollar to a CFO who doesn't think about Allen-Bradley parts for a living.
RPO and RTO, translated for the plant floor
Two acronyms borrowed from IT do most of the heavy lifting here. RPO — Recovery Point Objective — is the most data you can afford to lose, measured in time. If the last backup was 24 hours ago and the controller dies now, your RPO is 24 hours: you've lost a day of program changes, tag adjustments and recipe tweaks. RTO — Recovery Time Objective — is how long you can afford to be down before the line is producing again.
Both numbers should be set by the cost of downtime, not by what feels comfortable. A Tier-1 line that loses $18,000 per hour can't tolerate the same RTO as a Tier-3 utility skid that mostly idles.
| Tier | Downtime cost | RPO target | RTO target | Recommended posture |
|---|---|---|---|---|
| Tier-1 | > $10k/hr | < 1 hour | < 15 min | Hot standby or cold spare + event-triggered backup |
| Tier-2 | $1k–$10k/hr | < 8 hours | < 1 hour | Cold spare + scheduled signed backup |
| Tier-3 | < $1k/hr | < 24 hours | < 4 hours | Daily signed backup, shared spare pool |
A worked example: the 24-line packaging plant
Picture a packaging plant in the US Midwest. 24 lines. Mix of Allen-Bradley CompactLogix and Siemens S7-1500. Two shifts, six days a week. Average gross margin per produced case is $1.40. A representative Tier-1 line runs 380 cases per hour — call it $530/hour of margin lost when the line is down. Three of the 24 lines clear that threshold; the rest are Tier-2 or Tier-3.
The plant's old posture: an engineer manually pulled program backups to a shared drive whenever they remembered. Audit found backups were on average 47 days stale. When a CompactLogix faulted in March, restoring took 11 hours — 9 of those spent reconciling the shared-drive copy against the last known commissioning ZIP and a screenshot someone took during a previous outage.
New posture, using PLC backup software configured per tier:
- 3 Tier-1 lines — event-triggered signed backup, cold spare CompactLogix staged, runbook in cabinet.
- 14 Tier-2 lines — backup every shift, drift alert to maintenance on mismatch.
- 7 Tier-3 lines — daily backup, shared spare pool.
Six months in, the plant tested the restore on a Tier-1 line. Engineer pulled the spare controller, restored the latest signed backup, validated against the baseline, and was producing in 7 minutes. The same event under the old posture would have cost roughly $5,800 in lost margin and a full shift of engineering time.
The one-page cabinet runbook
Every controlled cabinet should have a laminated, one-page runbook taped inside the door. It is the thing the on-call engineer reads at 3 a.m. when their hands are cold and the line is down. It does not assume the engineer wrote the program, knows the network topology, or has the vendor IDE installed.
- 1Confirm controller failureFault LED + comms timeout > 30s
- 2Lockout / tagout per cabinet sequenceDoor card · steps 1–4
- 3Swap in staged spare from labeled shelfShelf C-07 · matched firmware
- 4Open VEM, search line IDClick Restore latest signed backup
- 5Verify signature before re-enabling lineed25519 · auto-logged
- 6Close event — VEM captures attributionOperator + version stamped
- RPO target
- ≤ 4 h
- RTO target
- ≤ 30 min
- Spare on shelf
- C-07
- Firmware
- v33.011
- Last drill
- 2026-04-18
- Result
- RTO 21 min · pass
Illustrative example of a VEM-generated cabinet runbook — content is representative, not from a specific customer site.
- Confirm the controller has actually failed (fault LED + comms timeout).
- Power down per the lockout sequence on the cabinet door.
- Swap in the staged spare controller from the labeled shelf.
- Open VEM, search line ID, click Restore latest signed backup.
- Verify signature against the running controller before re-enabling the line.
- Log the event — VEM auto-captures who restored and from which version.
Testing the plan before you need it
An untested backup is folklore. The only way to know your RTO target is real is to restore from backup on a deliberate cadence — and to treat each restore as a small drill, with a stopwatch and a written debrief. Tier-1 lines should be drilled quarterly. The drill costs one planned hour of downtime; the alternative is finding out at 3 a.m. that the backup file is corrupt.
| Date | Asset | Vendor | RTO actual | Target | Operator | Result |
|---|---|---|---|---|---|---|
| 2026-06-14 | Line 03 · Filler | Rockwell | 21 min | ≤ 30 min | m.lindberg | pass |
| 2026-06-12 | Mixer · Station 4 | Beckhoff | 18 min | ≤ 30 min | j.okafor | pass |
| 2026-05-30 | Aeration · WWTP-N | Schneider | 34 min | ≤ 30 min | p.kovac | warn |
| 2026-05-17 | Press · Line 12 | Siemens | 12 min | ≤ 20 min | a.tanaka | pass |
| 2026-05-02 | Conveyor · Line 09 | Rockwell | — | ≤ 30 min | s.müller | fail |
| 2026-04-18 | Robot cell · 2B | Beckhoff | 9 min | ≤ 15 min | m.lindberg | pass |
Illustrative example of a VEM restore-drill log — sample data, not a record from any specific site.
Change control is the other half of disaster recovery
Most "disasters" aren't fried controllers — they're unauthorised changes that nobody documented. A technician tweaked a setpoint during a Tuesday troubleshoot, never wrote it down, and the line starts behaving oddly two weeks later. Without versioned, attributed history, the root cause is invisible.
This is where dedicated PLC backup software earns its keep beyond pure DR — every change is captured, attributed and reversible. A drift between the running controller and the approved baseline becomes a notification, not a forensic project.
Walk through your own DR posture with us
We don't send gated PDFs. If the playbook above is useful, the next step is a working session — tier your assets, review your current backup setup, and see where VEM fits.
Talk to the team →