Data Backup and Disaster Recovery Best Practices for Enterprise Data Infrastructure
Enterprise downtime costs between $5,600 and $9,000 per minute depending on industry. The difference between a two-hour recovery and a two-day recovery is almost always the quality of planning done before the incident, not the quality of the response during it.
Backup and disaster recovery planning for data infrastructure is routinely treated as an IT afterthought. It gets done once, filed somewhere, and revisited when there’s an incident, or when an auditor asks for it. That approach is acceptable for recovering a file server. It’s not acceptable for recovering a data warehouse, a set of operational pipelines, or an analytics platform that business processes depend on in real time.
This guide addresses backup and DR specifically for enterprise data infrastructure: cloud data warehouses, data lakes, operational databases behind ERP and CRM systems, pipeline configurations, and ML model artifacts. It also covers the regulatory requirements that now apply to DR planning in regulated industries.
Key Takeaways
- RTO and RPO must be defined per system, not organisation-wide, the cost of tighter objectives grows exponentially
- Immutable backups are the primary defense against ransomware that targets backup data, 75% more ransomware attacks targeted backups in 2024
- A DR plan you haven’t tested is not a DR plan, only 23% of enterprises have tested theirs in the past 12 months
Core Concepts: RTO and RPO
Every backup and DR conversation starts with two definitions. If these aren’t agreed and documented per system, the DR plan is incomplete.
Recovery Time Objective (RTO) is the maximum acceptable downtime after a failure, the amount of time between an incident and when the system is back in operation. An RTO of four hours means the business can tolerate up to four hours of outage. An RTO of 15 minutes means the system must be restored within 15 minutes of a failure.
Recovery Point Objective (RPO) is the maximum acceptable data loss, measured as time. An RPO of one hour means the business can tolerate losing up to one hour of data. An RPO of zero means the system must recover with no data loss (continuous replication required).
Define RTO and RPO by system and business criticality. An ERP system handling live orders has different requirements than a quarterly reporting database. A real-time operational dashboard has different requirements than a weekly analytics report. Generic, organisation-wide RTO/RPO targets don’t work, they either over-invest in protection for low-criticality systems or under-protect critical ones.
Exponential cost of tighter objectives. Moving from a four-hour RTO to a one-hour RTO doubles infrastructure cost. Moving to a 15-minute RTO may cost five to 10 times more than a four-hour RTO. Define requirements from business impact, not from an aspirational desire for zero downtime.
The 3-2-1 Rule
The 3-2-1 backup rule is the minimum viable backup posture for any organisation:
- Three copies of data (the original plus two backups)
- Two different storage media (cloud object storage + a different storage tier, or on-premise + cloud)
- One offsite copy (geographically separate from the primary)
This rule protects against the most common failure modes: hardware failure, site-level disaster, and accidental deletion. If your current backup strategy doesn’t meet 3-2-1, it’s insufficient.
The 3-2-1-1-0 extension. Modern backup standards have extended the rule to account for ransomware:
- 3 copies, 2 media, 1 offsite, plus 1 immutable copy (cannot be altered or deleted), and 0 errors on the last backup test
The immutable copy requirement has become essential as ransomware attacks increasingly target backup data. The zero-error requirement enforces backup testing, an untested backup is not a backup.
Backup Types
Different backup types suit different systems and recovery requirements.
Full backup. A complete copy of all data. Highest storage and time requirement. Simplest and fastest to restore from, no dependencies on prior backups. Suitable for smaller datasets or systems with low change frequency.
Incremental backup. Only the data that changed since the last backup (full or incremental) is copied. Minimal storage and time per backup cycle. Slower to restore: recovery requires the last full backup plus every incremental since. Used for high-volume systems where full backups are impractical.
Differential backup. All changes since the last full backup. More storage than incremental but faster to restore (only two backups needed: last full plus last differential). A practical middle ground for medium-volume systems.
Continuous Data Protection (CDP). Real-time replication of every write to a secondary location. Near-zero RPO. Higher infrastructure cost. Used for systems where data loss of even a few minutes is unacceptable, core financial transaction systems, real-time operational databases.
The right type depends on the RPO requirement. CDP for zero-RPO systems. Full or differential for daily-RPO systems. Incremental for high-volume systems where daily full backups are impractical.
Immutable Backups: The Ransomware Defense
Ransomware attacks targeting backup infrastructure increased 75% in 2024. Attackers understand that an organisation with accessible, restorable backups can recover without paying a ransom, so they encrypt or delete the backups first.
What immutability means. An immutable backup cannot be altered or deleted after it has been written, for a defined retention period. Even a user with administrative credentials cannot delete an immutable backup before its retention period expires. This breaks the ransomware attack chain.
Why it’s the only reliable defense. Ransomware that gains domain admin access can delete traditional backups. It cannot delete immutable backups stored with write-once-read-many (WORM) policies. The backup exists regardless of what happens to the primary environment.
Cloud immutability options. AWS S3 Object Lock provides WORM storage at the object level. Azure Blob Storage immutability policies work at the container level. Both support compliance mode (cannot be shortened even by the storage account owner) and governance mode (can be shortened by privileged users). For ransomware protection, compliance mode is required.
Air-gapped backups. For the most critical data, an air-gapped backup, physically or logically isolated from the network, provides the strongest guarantee. Cloud vault tiers with no internet-accessible endpoint provide logical air-gapping at lower cost than physical isolation.
A mid-sized logistics company ran 40 virtual machines and a data warehouse supporting 300 users. Ransomware encrypted the primary environment and the traditional backup files. Recovery from bare metal took 11 days. Business continuity cost exceeded $2.4 million. Post-incident analysis showed that S3 Object Lock, a $400/month addition to their existing cloud backup spend, would have reduced recovery time to four hours. The immutable copy would have been untouched.
Data Infrastructure-Specific Backup Considerations
General IT backup guides focus on file servers and databases. Enterprise data infrastructure has additional components that require specific approaches.
Cloud data warehouses (Snowflake, BigQuery, Redshift). These platforms provide built-in redundancy and backup features. Snowflake’s Time Travel allows point-in-time recovery up to 90 days for all objects. Fail-Safe provides an additional seven days of recovery after Time Travel expires. BigQuery provides automatic table snapshots and cross-region replication. These built-in features are the first layer of protection, understand their limits and supplement where required.
Data lakes. Object storage (S3, Azure Data Lake, GCS) provides high durability by default, but durability is not backup. Enable versioning on critical data lake buckets to retain previous versions of objects. Configure lifecycle policies to transition older versions to lower-cost storage tiers. Cross-region replication for critical data provides geographic redundancy.
Operational databases (ERP/CRM). For databases backing ERP and CRM systems, continuous replication using write-ahead log (WAL) streaming achieves near-zero RPO. Point-in-time recovery within the backup retention window covers most failure scenarios. Test recovery to a specific timestamp, not just to the latest backup.
Pipeline configurations and code. Infrastructure-as-code and pipeline definitions must be in version control (Git). This is not optional. If the Airflow DAGs, dbt models, Terraform configurations, and CI/CD pipelines exist only on a server, a server failure destroys the ability to rebuild the environment. Git is the backup for code. For hosted pipeline configurations (Fivetran, Airbyte), export and version-control configurations programmatically.
ML model artifacts. Production ML models require a model registry (MLflow, SageMaker Model Registry, Vertex AI Model Registry) that versions and stores all deployed models. Rolling back a model after a data quality incident or a drift failure requires the ability to restore a prior version. Treat model artifacts as data, back them up with the same rigour.
Disaster Recovery for Data Infrastructure
Backup is not disaster recovery. Backup stores copies of data. DR is the plan and infrastructure for restoring operations from those copies when the primary environment fails.
Multi-region deployment for critical analytics systems. Active-active or active-passive multi-region architectures provide geographic redundancy. For cloud data warehouses, cross-region replication keeps a secondary environment ready to serve if the primary region is unavailable. The cost is the replication overhead and the secondary environment, weigh that against the RTO requirement.
DR runbooks. For every critical system, a DR runbook documents: the system components, the backup locations, the exact recovery steps (including commands), the expected recovery time, and the contact list for escalation. Runbooks must be stored outside the primary environment, if the system is unavailable, the runbook must still be accessible.
Failover testing. A DR plan that has never been tested has unknown reliability. Test before you need it. The steps that look correct on paper regularly fail in practice: the secondary database is out of date, the authentication configuration doesn’t match, the networking rules weren’t replicated. These failures are found in tests, not incidents, if you test.
Business process validation in recovery exercises. Technical failover is necessary but not sufficient. After technical recovery, validate that the business processes that depend on the system actually function correctly: a finance user runs the period-end report and it produces correct numbers; an operations team member creates and processes an order end-to-end. Technical recovery without functional validation is incomplete.
Regulatory Requirements
Several major regulatory frameworks now include explicit requirements for backup and disaster recovery. These are not guidelines; they are compliance obligations.
DORA (Digital Operational Resilience Act). EU regulation effective January 2025 for financial services firms. DORA requires documented ICT incident response and recovery plans, annual ICT DR testing (full and partial), and reporting obligations for major ICT incidents. Organisations subject to DORA must be able to demonstrate tested DR capability to regulators.
GDPR. Backup copies are personal data subject to GDPR. Retention limits apply: you cannot hold personal data in backups indefinitely. Deletion requests (right to erasure) must be fulfilled, including in backup copies, which requires either backup systems that support selective deletion or a clear legal basis for retention. GDPR also requires that backup data be protected with appropriate security measures.
SOX (Sarbanes-Oxley). Financial data must be recoverable for audit purposes. SOX requires documented, tested recovery procedures for financial systems. Audit evidence includes documentation of backup procedures, retention policies, and test results.
Industry-specific requirements. Healthcare (HIPAA) requires ePHI availability and recovery documentation. Payment card processing (PCI DSS) requires backup and recovery procedures for cardholder data environments. Financial services outside the EU face equivalent national requirements in most jurisdictions.
Tobias, Head of Infrastructure at a fintech firm operating in Germany and the Netherlands, faced his first DORA audit six months after the regulation took effect. The technical backup procedures were solid. What the auditors flagged: no documented annual test of the DR plan, no formal RTO/RPO targets per system, and no formal incident response plan that included data system recovery. Three months of documentation and testing work later, the firm passed a follow-up audit. The cost of that remediation work exceeded the cost of having done it right initially.
Testing Your DR Plan
Only 23% of enterprises have tested their DR plan in the past 12 months. This means 77% of DR plans have unknown reliability.
How often to test. Annually as a minimum for all systems. Quarterly for critical systems (those with RTOs under four hours or RPOs under one hour). After any significant infrastructure change, new data warehouse provisioning, pipeline architecture changes, major cloud migrations.
Test types, from least to most rigorous.
- Tabletop exercise. Stakeholders walk through a failure scenario verbally: who does what, in what order, using which procedures. No technical execution. Takes two to four hours. Finds gaps in runbooks and escalation procedures.
- Partial failover. Restore a subset of systems (one database, one pipeline) to a DR environment and validate functionality. Higher confidence, without disrupting production.
- Full failover. Complete cutover of a production workload to the DR environment. Highest confidence. Requires a maintenance window and careful planning for systems with active users.
What commonly fails in DR tests.
- Backup data is older than expected (replication lag or backup job failures weren’t monitored)
- Authentication credentials in the DR environment are expired or wrong
- Network configuration differences prevent applications from connecting
- Runbook steps are out of date with current infrastructure
- Recovery time exceeds the RTO target, the team didn’t know until the test
Document every test: the scenario, the systems tested, the recovery time achieved, the data point recovered to, and every gap identified. Track gap remediation to closure.
AI-Enhanced Backup in 2026
Machine learning is changing how backup and DR systems operate.
Anomaly detection before backup. AI systems now analyse data patterns before backup runs to detect signs of corruption, ransomware encryption, or data exfiltration. Backing up corrupted or maliciously modified data contaminates the recovery point. Detection before backup preserves a clean recovery point.
Automated recovery orchestration. Orchestration platforms increasingly use AI to sequence recovery steps, manage dependencies between systems, and adjust the recovery order based on current environment state. This reduces the manual coordination overhead during DR events, and reduces the risk of human error in a high-stress situation.
Intelligent data classification. AI-driven classification identifies which data is critical, sensitive, or redundant, informing backup frequency, retention, and protection tier decisions. Classification that once required manual cataloguing can now be done automatically and updated continuously.
FAQ
How often should we back up our data warehouse? Define backup frequency based on the RPO for each system. For a production data warehouse that business decisions depend on daily, daily full or incremental backups with continuous transaction log backup provides an RPO of minutes. For less critical systems, daily or even weekly backups may be sufficient. Most cloud data warehouses include built-in point-in-time recovery, understand how far back those built-in features allow recovery before supplementing with external backups.
What’s the difference between high availability and disaster recovery? High availability (HA) keeps a system running through component failures, a redundant server, a database replica, a load balancer. HA is designed to prevent downtime from individual component failures. Disaster recovery is designed to restore operations after a significant failure, a site-level outage, a major ransomware attack, a catastrophic data loss event. Both are needed; they are not substitutes for each other.
Does cloud storage count as backup? Cloud object storage with versioning and cross-region replication meets most backup requirements. It is not inherently backup, storing data in S3 without versioning, lifecycle policies, or replication is not a backup strategy. Configure object versioning, define retention policies, test restore procedures, and ensure immutability for critical data. Then it qualifies as backup.
How should we protect data pipelines, not just the data itself? Store all pipeline code, configurations, and infrastructure definitions in version control (Git). Infrastructure-as-code means the entire environment can be rebuilt from the Git repository. CI/CD pipelines should be version-controlled as well. For hosted pipeline tools (Fivetran, Airbyte, dbt Cloud), export configurations programmatically and commit to version control. A pipeline that can’t be rebuilt quickly is as much a DR risk as a database that can’t be restored.
What are the DORA requirements for data backup? DORA requires EU financial services entities to have documented ICT incident response and recovery plans, tested annually at a minimum. The testing requirement includes realistic DR scenarios for critical systems. Evidence of testing must be available to regulators. DORA also requires documented RTO and RPO targets per system and a business impact analysis that justifies those targets.
How do we prioritise which systems to protect first? Conduct a business impact analysis. For each system, assess: what is the cost of one hour of downtime? Four hours? 24 hours? What data loss is tolerable? Which downstream systems depend on this system? Systems with the highest hourly impact cost and the tightest tolerance for data loss should have the most rigorous protection, the tightest RTO/RPO targets, and the most frequent testing.
Conclusion
A backup plan you haven’t tested is not a backup plan. A DR runbook that exists only as a document in a shared drive, not as a set of validated, executable procedures, has unknown reliability when you need it most.
The investment required to build and test a solid backup and DR programme for data infrastructure is a fraction of the cost of a single significant recovery incident. Define RTO and RPO per system, implement the 3-2-1-1-0 backup standard with immutable copies for critical data, test quarterly, and keep the runbooks current.
Netodin designs data infrastructure with backup and disaster recovery requirements built in from the architecture stage, not retrofitted after an incident. To discuss DR requirements for your data platform, explore Netodin’s big data infrastructure services or contact the team directly.