DR drills: how often to run them and how
Four levels of DR drill (tabletop, walkthrough, partial, full failover), the cadences recommended by NIST and ISO 22301 and an operational checklist.
TL;DR
Untested DR does not work. There are four levels of drills (tabletop, walkthrough, partial failover, full failover), and mature practice combines them in an annual calendar. NIST SP 800-34 and ISO 22301 recommend at least one full drill per year; in NIS2-subject sectors that recommendation becomes a de-facto requirement.
The four drill levels
1. Tabletop drill (annual)
A 2-hour meeting where a facilitator presents a scenario ("ransomware encrypted the file server at 14:00") and participants describe what they would do. Nothing touched. Very low cost, big value: surfaces outdated runbooks and responsibility gaps.
2. Technical walkthrough (semi-annual)
A sysadmin walks the DR runbook against a single non-production workload. Verifies that credentials work, the documented commands run, and external dependencies (DNS, KMS, SaaS) are reachable.
3. Partial failover (quarterly)
A real failover of a single workload (e.g. file server) to the DR site, without rerouting production traffic. Measures the real RTO of that system and the integrity of restored data. Sefthy includes this level in every plan.
4. Full failover (annual)
Failover of the entire stack in a test environment. This is the drill that produces the company's real RTO number. Half a day of prep + 4-6 hours of execution. Schedule outside business hours.
Recommended cadence
Reference standards:
- NIST SP 800-34: annual full drills + additional drills when critical systems change.
- ISO 22301:2019: exercises at planned intervals, at least annually.
- NIS2 Article 21: periodic drills, without specifying frequency.
In practice an effective rotation is:
- 1 tabletop per year;
- 2 walkthroughs per year;
- 4 partial failovers per year (one per critical workload in rotation);
- 1 full failover per year.
What to measure in each drill
Five minimum metrics:
- real RTO per workload (timed);
- real RPO (last data present in the recovered system);
- data integrity (hashes, counters, test transactions);
- client behaviour (reconnect time, errors);
- deviations from the runbook (unclear steps, missing dependencies).
Document in a standard report. It becomes the evidence auditors look for.
Common mistakes
- drills in production without fallback: pointless risk. Use isolated test environment.
- always drilling at the same time: you find the same-time problems. Vary it.
- drilling without backup staff: the senior who knows everything will be on holiday on the real day.
- not closing corrective actions: open a ticket for each gap and track it.
Sefthy DR Simulation
The DR Simulation feature allows running failover drills without disturbing production: the recovered VM runs in an isolated VLAN, gets a test IP and admins verify it with standard tools. Times and logs are exportable to PDF for audit.
FAQ
How long does a full failover for an average company take?
For a 50-100 employee company with 8-15 servers: 4-6 hours execution + 2-3 hours prep and debrief.
Can I drill in production?
Only partial failovers in genuinely isolated environments. Never full failovers. The risk of extending an outage is not worth the saved effort.
Do I need external vendors for drills?
For tabletop and walkthrough, no. For full failover it helps to have an external consultant the first time — they become the reference point for subsequent audits.
For RTO calculation after a drill, see How to calculate RTO. For the snapshot vs continuous replication distinction in drills, read Snapshot vs continuous replication.
Want to see Sefthy in action?
Same IP, same subnet, RTO in minutes. Try it free for 7 days or talk to one of our specialists.