What's a realistic RTO for a downed PLC?

With a current signed backup and a spare controller staged on the shelf, a trained technician can be back in production in 5–15 minutes. Without a current backup, 4–14 hours is typical — most of which is spent finding the right project file.

Do we need a hot-standby controller for every line?

No. Hot-standby is justified for Tier-1 lines where downtime cost exceeds the controller cost in under a shift. For Tier-2 and Tier-3, a cold spare plus a verified backup is usually the right balance.

How often should we test the restore?

Quarterly for Tier-1 lines, semi-annually for Tier-2, annually for Tier-3. An untested backup is a hope, not a plan.

Where should backups live?

On-prem for low-latency restore, replicated off-site for disaster scenarios. The on-prem copy is what restores a controller in 5 minutes; the off-site copy is what survives a fire.

PLC Disaster Recovery Playbook — RPO/RTO for the Plant Floor

The 3 a.m. phone call

It always starts the same way. A controller faults. The line stops. Someone calls the on-call engineer. The engineer drives in. Then the real work begins: which version of the program was running on that PLC? Is the file on the laptop in the cabinet the one that was commissioned, or the one with the tweak from Tuesday's troubleshoot? Is there even a current backup?

For most plants, the answer to that last question is "sort of." A shared drive somewhere holds a backup. Whether it matches the controller that just died is anyone's guess. That uncertainty is where the 14-hour outage comes from — not the hardware swap, but the forensics that have to happen before anyone can safely flash a replacement.

This guide is the playbook we wish every plant had before the 3 a.m. call. It assumes you don't have unlimited budget, you can't replace every controller with a hot-standby pair, and you have to defend every dollar to a CFO who doesn't think about Allen-Bradley parts for a living.

RPO and RTO, translated for the plant floor

Two acronyms borrowed from IT do most of the heavy lifting here. RPO — Recovery Point Objective — is the most data you can afford to lose, measured in time. If the last backup was 24 hours ago and the controller dies now, your RPO is 24 hours: you've lost a day of program changes, tag adjustments and recipe tweaks. RTO — Recovery Time Objective — is how long you can afford to be down before the line is producing again.

Both numbers should be set by the cost of downtime, not by what feels comfortable. A Tier-1 line that loses $18,000 per hour can't tolerate the same RTO as a Tier-3 utility skid that mostly idles.

Tier	Downtime cost	RPO target	RTO target	Recommended posture
Tier-1	> $10k/hr	< 1 hour	< 15 min	Hot standby or cold spare + event-triggered backup
Tier-2	$1k–$10k/hr	< 8 hours	< 1 hour	Cold spare + scheduled signed backup
Tier-3	< $1k/hr	< 24 hours	< 4 hours	Daily signed backup, shared spare pool

A worked example: the 24-line packaging plant

Picture a packaging plant in the US Midwest. 24 lines. Mix of Allen-Bradley CompactLogix and Siemens S7-1500. Two shifts, six days a week. Average gross margin per produced case is $1.40. A representative Tier-1 line runs 380 cases per hour — call it $530/hour of margin lost when the line is down. Three of the 24 lines clear that threshold; the rest are Tier-2 or Tier-3.

The plant's old posture: an engineer manually pulled program backups to a shared drive whenever they remembered. Audit found backups were on average 47 days stale. When a CompactLogix faulted in March, restoring took 11 hours — 9 of those spent reconciling the shared-drive copy against the last known commissioning ZIP and a screenshot someone took during a previous outage.

New posture, using PLC backup software configured per tier:

3 Tier-1 lines — event-triggered signed backup, cold spare CompactLogix staged, runbook in cabinet.
14 Tier-2 lines — backup every shift, drift alert to maintenance on mismatch.
7 Tier-3 lines — daily backup, shared spare pool.

Six months in, the plant tested the restore on a Tier-1 line. Engineer pulled the spare controller, restored the latest signed backup, validated against the baseline, and was producing in 7 minutes. The same event under the old posture would have cost roughly $5,800 in lost margin and a full shift of engineering time.

The one-page cabinet runbook

Every controlled cabinet should have a laminated, one-page runbook taped inside the door. It is the thing the on-call engineer reads at 3 a.m. when their hands are cold and the line is down. It does not assume the engineer wrote the program, knows the network topology, or has the vendor IDE installed.

VEMCabinet runbook

rev 04 · 2026-06-12

/ Line

Line 07 · CompactLogix L33ER

Plant A · Filler bay 2 · Cabinet E-04

1
Confirm controller failure
Fault LED + comms timeout > 30s
2
Lockout / tagout per cabinet sequence
Door card · steps 1–4
3
Swap in staged spare from labeled shelf
Shelf C-07 · matched firmware
4
Open VEM, search line ID
Click Restore latest signed backup
5
Verify signature before re-enabling line
ed25519 · auto-logged
6
Close event — VEM captures attribution
Operator + version stamped

/ Targets

RPO target: ≤ 4 h
RTO target: ≤ 30 min
Spare on shelf: C-07
Firmware: v33.011
Last drill: 2026-04-18
Result: RTO 21 min · pass

/ Restore

Latest signed backup ready · v143 · 06:14 today

Print laminated · tape inside cabinet door · review quarterly

Illustrative example of a VEM-generated cabinet runbook — content is representative, not from a specific customer site.

Confirm the controller has actually failed (fault LED + comms timeout).
Power down per the lockout sequence on the cabinet door.
Swap in the staged spare controller from the labeled shelf.
Open VEM, search line ID, click Restore latest signed backup.
Verify signature against the running controller before re-enabling the line.
Log the event — VEM auto-captures who restored and from which version.

Testing the plan before you need it

An untested backup is folklore. The only way to know your RTO target is real is to restore from backup on a deliberate cadence — and to treat each restore as a small drill, with a stopwatch and a written debrief. Tier-1 lines should be drilled quarterly. The drill costs one planned hour of downtime; the alternative is finding out at 3 a.m. that the backup file is corrupt.

VEMRestore drill log · Q2 2026

4 / 6 passedexported · audit-ready

Date	Asset	Vendor	RTO actual	Target	Operator	Result
2026-06-14	Line 03 · Filler	Rockwell	21 min	≤ 30 min	m.lindberg	pass
2026-06-12	Mixer · Station 4	Beckhoff	18 min	≤ 30 min	j.okafor	pass
2026-05-30	Aeration · WWTP-N	Schneider	34 min	≤ 30 min	p.kovac	warn
2026-05-17	Press · Line 12	Siemens	12 min	≤ 20 min	a.tanaka	pass
2026-05-02	Conveyor · Line 09	Rockwell	—	≤ 30 min	s.müller	fail
2026-04-18	Robot cell · 2B	Beckhoff	9 min	≤ 15 min	m.lindberg	pass

Generated by VEM · signed · immutable · downloadable as CSV/PDF

Illustrative example of a VEM restore-drill log — sample data, not a record from any specific site.

Change control is the other half of disaster recovery

Most "disasters" aren't fried controllers — they're unauthorised changes that nobody documented. A technician tweaked a setpoint during a Tuesday troubleshoot, never wrote it down, and the line starts behaving oddly two weeks later. Without versioned, attributed history, the root cause is invisible.

This is where dedicated PLC backup software earns its keep beyond pure DR — every change is captured, attributed and reversible. A drift between the running controller and the approved baseline becomes a notification, not a forensic project.

/ Next step

Walk through your own DR posture with us

We don't send gated PDFs. If the playbook above is useful, the next step is a working session — tier your assets, review your current backup setup, and see where VEM fits.

Talk to the team →

PLC disaster recovery, without the 14-hour outage

The 3 a.m. phone call

RPO and RTO, translated for the plant floor

A worked example: the 24-line packaging plant

The one-page cabinet runbook

Testing the plan before you need it

Change control is the other half of disaster recovery

Walk through your own DR posture with us

Frequently asked questions

From 14-hour outage to
7-minute restore.

PLC disaster recovery, without the 14-hour outage

The 3 a.m. phone call

RPO and RTO, translated for the plant floor

A worked example: the 24-line packaging plant

The one-page cabinet runbook

Testing the plan before you need it

Change control is the other half of disaster recovery

Walk through your own DR posture with us

Frequently asked questions

What's a realistic RTO for a downed PLC?

Do we need a hot-standby controller for every line?

How often should we test the restore?

Where should backups live?

From 14-hour outage to 7-minute restore.

From 14-hour outage to
7-minute restore.