IT Disaster Recovery Plans

Explore top LinkedIn content from expert professionals.

  • When Azure launched, region pairs were the go-to for disaster recovery, modeled on the multi-datacenter strategies enterprises already ran. Since then, Availability Zones, Safe Deployment Practices, and region-of-choice replication have fundamentally changed the resilience equation. In my latest post, I break down four resilience patterns, from in-region HA with AZs to multi-region active/active and explain when each one makes sense based on your workload's RTO, RPO, and cost constraints. If you're still defaulting to paired regions without evaluating whether AZ-based designs can simplify your architecture, this is worth a read. https://lnkd.in/gmPkfKWD

  • View profile for Alex Lanin

    U.S. Energy Grid & AI Infrastructure | Independent Research & Investment Analysis | AI Grid Insider

    6,129 followers

    Three strikes. Three AWS data centers. Two availability zones down — the redundancy model AWS built to survive any single failure, gone in one attack. AWS confirmed it in their own health dashboard: attacks "disrupted power delivery to infrastructure." Not servers — power. A drone doesn't need to hit the building. The substation outside is enough. This isn't a war story. It's a structural vulnerability hiding in plain sight across the entire industry. 200+ data centers across the Middle East. Yemen, Iraq, Iran, the Red Sea — regions where drones are already a standard tool of pressure. Drones are getting cheaper. Data centers are getting more expensive. That asymmetry is only going to widen. The industry has no answer — because no one ever asked the question. Tier III/IV, BICSI, Uptime Institute, EN 50600. Not one standard contains the word "drone." They were written for a world where threats arrive on foot. The solutions exist — they're standard practice in military infrastructure: → Underground cable entries — you can't hit what you can't see → 3 independent power feeds from different directions — one strike doesn't take the site down → BESS — keeps the facility alive while power is restored → Hardened substations — reinforced concrete instead of an open yard → Anti-drone EW systems (Dedrone, Aaronia) — jam GPS guidance up to 3 km out. Cost: from $200K. Cost of two AZ downtime: orders of magnitude higher AWS was the first confirmed case. The precedent is set. Which data centers are next depends on who prepares first. Have you already seen drone defense requirements appear in data center RFPs or site specs? Photo credit: Wikimedia Commons (AWS us-west-2, Oregon) A typical hyperscale data center campus. Open substations, exposed power infrastructure, no standoff defense. In geopolitically stable regions — not a concern. In conflict zones — a potential single point of failure. #DataCenters #CriticalInfrastructure #PhysicalSecurity #EnergyResilience #AWS

  • The recent news on AWS center in the Middle East going down because of the war made me relive my experience decades ago! I once helped build what we proudly called a best-in-class disaster recovery architecture. We did everything right—on paper. ✔️ Business Impact Analysis done ✔️ RTO & RPO agreed with stakeholders ✔️ Sophisticated tools deployed ✔️ DR site fully provisioned We were confident. Almost too confident and then came the day that tested everything ! A dual power supply failure hit our primary data center. Within minutes, 300+ servers went down abruptly. What followed was worse than downtime: Critical application databases got corrupted AND THEN The DR site also got corrupted ! Real-time transactions came to a complete standstill. With every passing hour, we lost millions of dollars in revenue. In that moment, all our architecture diagrams, tools, and planning meant one thing: NOTHING —because the system didn’t recover !!! What this experience taught me: 1) Testing isn’t real until it’s brutal Table-top simulations give comfort. Full-scale failover drills expose truth. Test like it’s already failing: -Simulate real load -Introduce chaos scenarios -Assume components will fail unexpectedly 2) DR is not a technology problem—it’s a systems problem We focused heavily on tools. We underestimated dependencies. Ensure: -End-to-end recovery (infra + app + data integrity) -Isolation between primary and DR (to avoid cascade failures) -Backup validation, not just backup completion 3) Communication is your real recovery engine In crisis, confusion spreads faster than outages. Build: -Clear SOPs for business continuity -Pre-defined escalation paths -Regular cross-team drills (not just IT—include business teams) 4) Leadership presence changes outcomes War rooms are intense. Fatigue, panic, and noise creep in. As a tech leader: -Your presence brings calm -Your clarity drives prioritization -Your energy keeps teams going Sometimes, leadership is less about answers… and more about Stability 5) Assume your DR will fail—and design for that This was the hardest lesson. Build layers: - Immutable backups - Offline recovery options -“Last resort” recovery playbooks Because resilience is not about one backup plan. It’s about what happens when that backup plan fails... Have you ever seen a #DR plan fail in real life? How often do you run full-scale disaster recovery drills? What’s the one thing most organizations still get wrong about resilience? Curious to hear real experiences—those are always more valuable than frameworks. #DR #disasterrecovery #drill #test #BCP #leadership #technology #resilience

  • View profile for Nishant Kumar

    Data Engineer @ IBM | AWS · Spark · Kafka · PySpark · Airflow | RAG · LLMs · GenAI | Event-Driven Data Platforms | 110K DE Community

    112,738 followers

    Disaster Recovery is one of the most misunderstood concepts in data and cloud engineering. I see the same confusion again and again — even in experienced teams. DR is not what most people think it is. • Multi-AZ is DR • S3 is already durable, so no DR needed • Snowflake Time Travel is enough Let’s clear this up once and for all. 𝐅𝐢𝐫𝐬𝐭, 𝐨𝐧𝐞 𝐬𝐢𝐦𝐩𝐥𝐞 𝐭𝐫𝐮𝐭𝐡 High Availability (HA) ≠ Disaster Recovery (DR) • HA keeps your system running during small failures • DR brings your system back after big disasters If an entire cloud region goes down, HA won’t save you. Only DR will. 𝐃𝐑 𝐢𝐬 𝐚𝐥𝐰𝐚𝐲𝐬 𝐚𝐛𝐨𝐮𝐭 2 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 ➛ RPO (Recovery Point Objective) • How much data loss is acceptable? ➛ RTO (Recovery Time Objective) • How long can the system be down? Lower RPO + Lower RTO = Higher cost. There is no “free” DR. Now, how DR actually looks in a real data platform Here’s a practical, end-to-end DR strategy 𝐒3 (𝐑𝐚𝐰 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞) • Cross-region replication • Your source-of-truth must always survive 𝐑𝐃𝐒 (𝐓𝐫𝐚𝐧𝐬𝐚𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐬𝐲𝐬𝐭𝐞𝐦𝐬) • Multi-AZ for availability • Cross-region read replica for DR 𝐑𝐞𝐝𝐬𝐡𝐢𝐟𝐭 (𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 𝐰𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞) • Automated snapshots • Cross-region snapshot copy • Restore when needed 𝐒𝐧𝐨𝐰𝐟𝐥𝐚𝐤𝐞 (𝐌𝐨𝐝𝐞𝐫𝐧 𝐚𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬) • Time Travel for human errors • Cross-region database replication for real DR Different layers. Different strategies. Same goal: business continuity. 𝐓𝐡𝐞 𝐛𝐢𝐠𝐠𝐞𝐬𝐭 𝐃𝐑 𝐦𝐢𝐬𝐭𝐚𝐤𝐞 𝐈 𝐬𝐞𝐞 Trying to give everything zero RPO and zero RTO. That’s not architecture. That’s overspending. Good DR design is about classifying data by criticality, not panic-replicating everything. 𝐎𝐧𝐞 𝐥𝐢𝐧𝐞 𝐭𝐨 𝐫𝐞𝐦𝐞𝐦𝐛𝐞𝐫 𝐟𝐨𝐫𝐞𝐯𝐞𝐫 Design for High Availability to survive failures, and Disaster Recovery to survive disasters. If you’re working on cloud, data engineering, or system design, understanding this will instantly level you up. Follow for more 👋 #DataEngineering #CloudArchitecture #DisasterRecovery #SystemDesign #AWS #Snowflake #DataModernization

  • View profile for Wias Issa

    CEO at Ubiq | Board Director | Former Mandiant, Symantec

    6,804 followers

    The detailed incident report from AWS is now public, and it’s well worth a read (link in comments). Here’s a distilled summary of what went wrong, and what tech leaders should take away. What happened: 1️⃣ A race condition in the DNS management system serving DynamoDB in US-EAST-1 led to endpoint resolution failures. 2️⃣ That dominant database service failure cascaded: new EC2 launches failed due to lease-management issues (on which EC2 depends) and network components suffered health-check failures that rippled across load balancers. 3️⃣ The impact was global. Apps and critical services relying on AWS saw outages, degraded performance, or intermittent failures. Why this matters: 1️⃣ Concentration risk: Even for a hyperscale provider like AWS, a failure in one region and one service (DynamoDB DNS) can cascade globally, turning a “cloud issue” into a business continuity event. 2️⃣ Complex interdependencies: The issue wasn’t just database DNS; it propagated into compute, networking, automation, and customer-facing systems. We often design for failure at one layer but underestimate coupling across layers. 3️⃣ Recovery complexity = resilience risk: Recovery isn’t just restarting services; it’s clearing backlogs, restoring state, and ensuring downstream systems don’t remain impaired. My perspective/takeaways: 1️⃣ Design for worst-case provider failure. Not just “an AZ down,” but “core service in region down” and the ripple effects. 2️⃣ Visibility and dependency mapping matter, so know what services your stack depends on, and how managed service failures might cascade. 3️⃣ Recovery orchestration is as vital as fault tolerance, so plan for backlog recovery, state cleanup, and cross-team communication. 4️⃣ Cloud-vendor resilience is not infinite, and shared failure domains persist even in hyperscale clouds. Plan for multi-region or cross-provider fallback and clear internal recovery roles. 5️⃣ Executive mindset and risk alignment. For C-suites, this is a reminder: infrastructure risk is business risk. Discuss cloud-failure modes at the board table, not just application risk. What this isn't about: This isn’t about blaming AWS. The lesson is that even the largest provider can experience a systemic failure, and we can all learn from these experiences. And... it's always DNS 😉

  • View profile for Lalit Chandra Trivedi

    Railway Consultant || Ex GM Railways ( Secy to Government of India’s grade ) || Chairman Rail Division India ( IMechE) || Empaneled Arbitrator - DFCC and IRCON || IEM at MSTC and Uranium Corp of India

    41,434 followers

    Navigating the Aftermath: Managing an AI-Powered Railway Post-Cyber Attack As artificial intelligence (AI) becomes the backbone of modern railway systems—optimizing routes, predicting maintenance, and enhancing safety—cyber threats have grown exponentially. A single attack can paralyze operations, disrupt schedules, and compromise passenger safety. Over the past five years, cyber incidents targeting railways have surged by over 220%, with cases like remote hijacking via radio frequencies in Poland (2023) and ticketing disruptions in Ukraine (2025) serving as stark reminders. Here’s a practical framework for managing an AI-driven railway system after a cyber attack. 1️⃣ Immediate Containment – Isolate and Assess Once an intrusion is detected, the first step is to contain it. In AI-managed railways, this means isolating compromised systems—dispatch algorithms, predictive maintenance modules, or signaling networks—from the rest. Activate a Rapid Response Team: Bring together cybersecurity experts, AI engineers, and railway operations specialists to identify attack vectors—whether phishing, ransomware, or signaling manipulation. Eradicate the Threat: Reset credentials, patch vulnerabilities, and enforce multi-factor authentication (MFA). For AI systems, encrypt models during storage and transmission to prevent theft or tampering.
The 2023 Polish incident, where 20 trains were halted via radio interference, proved how swift isolation minimizes damage. 2️⃣ Recovery & Restoration – Rebuild with Resilience Containment alone isn’t enough; recovery demands validating both physical assets and AI model integrity. System Integrity Checks: Apply frameworks such as NIST CSF 2.0 to verify that automated safety functions are uncompromised before resuming operations. Data Recovery: Restore from secure, encrypted backups; implement zero-trust access policies. Business Continuity: Test disaster-recovery plans regularly, ensuring seamless switchovers to manual operations when required.
Post-incident analysis should be mandatory—review logs, trace root causes, and update security policies, as seen in U.S. freight rail guidelines. 3️⃣ Long-Term Prevention – Fortify the Future True resilience lies in learning from the breach and preventing recurrences. Secure-by-Design: Embed cybersecurity through the AI lifecycle, from data collection to deployment. Continuous Monitoring: Use AI itself for real-time threat detection and anomaly analysis, ensuring human oversight in decision loops. Collaborate & Comply: Follow rail-specific cybersecurity standards and share threat intelligence across the ecosystem. AI can be both the target and the shield—its predictive power can detect attacks faster than humans ever could, provided its training data and parameters remain uncompromised. #CyberSecurity #AIRailway #InfrastructureManagement #Resilience #RailSafety #AIinTransport #CriticalInfrastructure

  • View profile for Alexander Abharian

    Scaling businesses on AWS | Reliable, efficient & secure cloud infrastructures | Founder & CEO of IT-Magic - AWS Advanced Consulting Partner | AWS Retail Competency

    7,073 followers

    Multi-AZ keeps your app online. It does not keep your business alive when firefighters cut the power. On March 1, AWS shared an incident in UAE. Objects hit a data center. There were sparks. A fire. The fire department cut power to protect people. Recovery was measured in hours. Cloud is still physical: Power Fire Access Connectivity Human safety decisions The problem starts earlier. Teams stop at Multi-Availability Zone and call it disaster recovery. Multi-AZ is availability inside one Region. Disaster recovery is a copy of the workload that can run somewhere else. If one AZ is down for hours, Multi-AZ helps only when:    • You are deployed across AZs in reality    • Your databases and external services are too If your critical path runs in one Region, you should consider disaster recovery in another Region. Business-first disaster recovery starts with two numbers:    • RTO: how long can we be down?    • RPO: how much data can we lose? Then you choose the model:    • Backup and restore    • Pilot light    • Warm standby    • Active / active For me, a minimum viable multi-Region setup looks like:    • Backups or replication to a second Region    • IaC and CI/CD that can deploy there without heroics    • A tested failover path with DNS or routing plus a clear runbook    • Disaster recovery tests on a real cadence; quarterly already beats “never” Multi-AZ keeps you safe from a broken rack. Disaster recovery keeps you in business when a whole building is dark. If your primary Region goes degraded for a few hours, do you still sell or do you wait and watch logs refresh? If you want to review your AWS DR plan from a business angle, let’s talk. #AWS #DisasterRecovery #BusinessContinuity #CloudArchitecture

  • View profile for Akhil Mishra

    Tech Lawyer for Fintech, SaaS & IT | Contracts, Compliance & Strategy to Keep You 3 Steps Ahead | Book a Call Today

    10,709 followers

    Every freelancer in the IT industry has gone through this. They work with international clients and then suffer from: The issues caused by different time zone. Because you're building sites in the morning. Taking client calls at midnight. Replying to “urgent” messages during lunch. All while pretending this is normal. But you’re not being flexible. You’re being available. And they’re not the same thing. And the fix is clarity. Not hustle. Structure. Not burnout. And there's a few basic things you can do for next time: 1/ Set your hours like a business Not “when I’m free.” and "Not “when they need me.” Your hours. In your time zone. Write it. Share it. Stick to it. Example: “I work Mon–Fri, 9am–5pm IST. Replies within 24 hours during this window.” 2/ Put it in the Contract Not a vague email. A real clause. For example: “Freelancer’s working hours are 9am–5pm IST. Communication outside these hours may be delayed. For emergencies, phone contact is allowed - only for critical issues.” 3/ Use tools that do the talking Calendly. Auto-responders. These save you from typing “Sorry I missed this” 20 times a week. Let software protect your sleep. 4/ Say it before they assume it Time difference? Mention it. In-person work? Mention it. You’re not ignoring them - you’re just offline. 5/ Keep receipts Confirm availability by email. Screenshot the agreement. So when the drama hits, you have the proof. This is how you stay respected in your field. Boundaries don’t push clients away. They build trust. So protect your time, or someone else will take it. --- ✍ Tell me below: What’s one boundary you wish you had set earlier in your freelance career?

  • View profile for Sanjay Chandra

    The Databricks + Fabric guy on LinkedIn · Helping data engineers think in production, not just in tutorials · LinkedIn Top Voice ’24 & ’25

    74,906 followers

    When I first learned Azure Data Factory (ADF), I focused on getting pipelines to run. What I didn’t realise? The real challenge is making sure they don’t break quietly at 2 AM. Here are 12 error handling techniques in ADF: 1. Try-Catch-Finally Pattern This classic structure lets you execute primary activities (Try), define actions upon failure (Catch), and specify cleanup tasks that run regardless of the outcome (Finally), ensuring a robust pipeline. 2. Activity Output Access the result of a preceding activity to make decisions. Use the expression @activity('ActivityName').Output to retrieve its JSON output, which is crucial for custom validation and conditional logic. 3. Activity Error Details When an activity fails, capture the specific error details. The expression @activity('ActivityName').Error provides the error code and message, which is essential for precise logging within a Catch block. 4. Retry Policy Automatically re-run a failed activity. You can configure the retry count and the interval between attempts, making your pipeline resilient to transient issues like temporary network failures or database locks. 5. Timeouts Set a maximum run duration for an activity. If it exceeds this time, ADF marks it as "TimedOut" and fails it, preventing a single long-running task from stalling the entire pipeline. 6. Validation Activity Proactively check for a condition before proceeding. This activity can verify if a file exists or if a query returns a specific value, failing early if the prerequisite is not met. 7. Fault Tolerance (Copy Activity) In a Copy Activity, configure settings to skip or log incompatible rows (e.g., data type mismatches) instead of failing the entire operation. This is essential for handling inconsistent source data. 8. Custom Error Logging Use a Script or Stored Procedure activity within a Catch block to write detailed error information to a log table. Capture helpful diagnostics like @pipeline().RunId for auditing purposes. 9. Alerting and Notifications Integrate your data factory with Azure Monitor to create alerts. You can configure action groups to automatically send email or Teams notifications when a pipeline fails, enabling prompt incident response. 10. Global Parameters Define environment-specific settings, like connection strings or logging levels, as global parameters. This allows you to manage configurations centrally, reducing errors when deploying across different environments (Dev, QA, Prod). 11. Modular Error Handling Use an "Execute Pipeline" activity to encapsulate logic. A child pipeline can have its own error handling and pass its final status (success or failure) back to the parent pipeline. 12. Data Consistency Verification After a copy activity, use a Lookup or Script activity to query source and sink row counts. This practice validates data integrity and helps you catch silent data loss failures.

  • Lately I’ve been noticing a recurring claim that ransomware is now directly targeting operational technology (OT). However, our data tells a different story. In most cases where the OT process is disrupted by ransomware, it’s collateral damage from IT assets being hit. That usually happens in one of two ways: • The OT depends on the affected IT assets and can no longer function • Engineers proactively shut down or disconnect the OT environment out of caution, perhaps because they don’t trust the architecture or controls in place What we don’t see are ransomware attacks that intentionally target OT assets. We’d know. We developed a technique that does exactly that, and we haven’t seen anything like it in the wild. In fact, we rarely even see ransomware make its way into OT environments. In the first image below, check the “Data Encrypted for Impact” node on the right. None of those impacts come from Category 2 attacks. Just one strand flows from a single Type 1c attack; a ransomware event that hit the SCADA system at a meat processing facility in Spain. That’s the only known case in our dataset and we don't even know if it was deliberately targeted at the OT or coincidence. Even ransomware that spills from IT into OT (Type 1b) makes up just 8% of the total. The second image gives a broader view of adversary types by category. You’ll notice there are no criminal actors involved in Category 2 attacks at all. So when someone says ransomware is hitting OT, it’s worth asking if they mean OT is being 𝘵𝘢𝘳𝘨𝘦𝘵𝘦𝘥, or just 𝘪𝘮𝘱𝘢𝘤𝘵𝘦𝘥. Those are not the same problem, and they don’t require the same response. Funny how certain hyperbolic narratives fall apart once you’re honest about the data. --- Visuals below 👇 𝗧𝗧𝗣 𝘁𝗼 𝗜𝗺𝗽𝗮𝗰𝘁 𝗢𝘃𝗲𝗿𝘃𝗶𝗲𝘄 𝗦𝗮𝗻𝗸𝗲𝘆 𝗖𝗵𝗮𝗿𝘁 #OTsecurity #ICSsecurity

Explore categories