Post-Incident Report – November 10, 2025

SUPERIOR and QUETICO servers


Key takeaways

  • Impact was limited to the SUPERIOR and QUETICO servers. The rest of WPCloud infrastructure remained fully operational.
  • Root cause was a connector fault in the storage drives, not a physical disk failure. No data loss occurred.
  • Emergency failover came online quickly, but reserve capacity could not sustain this newer, larger cluster at peak.
  • Production was restored after the connector was corrected and arrays resynchronised.
  • Failover hardware for this cluster is now an exact duplicate of production. A company-wide failover upgrade and process review is underway.

Summary of what happened

On November 10, 2025, a storage event on the cluster powering the SUPERIOR and QUETICO servers caused service disruption. Initial symptoms resembled a two-disk failure. To minimise downtime during peak business hours, we failed over to our reserve environment at 13:29 ET. The reserve accepted traffic but reached capacity limits under peak load, causing degraded performance.

A hands-on hardware intervention later identified a connector fault that made two drives appear absent. After correcting the connector and resynchronising, production passed integrity checks. We migrated traffic back starting at 18:00 ET and confirmed stable service at 18:40 ET. We kept the incident open for extended monitoring and closed it at 22:28 ET.

No client data was lost. Data remained intact on the production drives, on the failover, and in redundant off-site backups.


Who was impacted and for how long

Only sites hosted on the SUPERIOR and QUETICO servers were affected. This was a very small subset of hosted sites. All other WPCloud servers and clusters remained available.

  • Primary disruption window: 2025-11-10 12:22 ET to ~18:40 ET
  • Extended verification and monitoring: to 22:28 ET

Detailed timeline of events (ET, absolute)

2025-11-10 12:20:00 ET
High I/O wait spike observed on the cluster.
Action. Sysadmin team begins triage.

2025-11-10 12:21:00 ET
Cluster becomes unresponsive. Failure alerts.
Action. Soft reboot. Post-reboot, two of four array drives not detected.
Action. Hard reboot. Drives still not detected.
Action. Second hard reboot. Same result.
Assessment. Symptoms indicate dual-disk loss, exceeding RAID fault tolerance.

2025-11-10 12:30 – 13:29 ET
Decision. Treat as dual-disk failure. Replace or rebuild expected, then resync from failover back to production off-peak.
Rationale. Minimise daytime downtime. Fail over now. Plan failback later.
Action. Begin IP migration to reserve failover.

2025-11-10 13:29:00 ET
Failover reserve online and serving traffic.
Status. Sites initially load, then performance degrades as reserve nears saturation.

2025-11-10 13:42:00 ET
Failover shows sustained high I/O wait with CPU and RAM fully utilised.
Action. Allocate remaining compute to failover. Intended to stabilise under emergency load.

2025-11-10 14:16:00 ET
Only slight improvement.
Action. Continued tuning to reduce resource consumption.
Planning. If stabilised, schedule failback to production off-peak, for example 04:00 ET.

2025-11-10 15:00:00 ET
Conclusion. Reserve failover cannot meet SUPERIOR and QUETICO peak demand despite tuning.
Decision. Pivot to production recovery. Assume drive replacement and approximately two-hour resync.

2025-11-10 15:20:00 ET
Action. Engage data-centre technicians for repair ETA. Continue incremental optimisations on failover.

2025-11-10 15:52:00 ET
Client update. Failover capacity limits acknowledged. Production hardware intervention pending. Awaiting ETA.

2025-11-10 16:20:00 ET
Technicians confirm intervention to begin as soon as possible.

2025-11-10 16:25:00 ET
Client update. Intervention imminent.

2025-11-10 16:30:00 ET
Hardware intervention begins on production cluster.

2025-11-10 16:59:00 ET
Client update. Replacement and testing in progress.
Assurance. No data loss expected due to real-time replication and off-site backups.

2025-11-10 17:21:00 ET
Client update. Work continues. Final testing underway.

2025-11-10 17:25:00 ET
Finding. Connector fault identified. Drive replacement not required.
Action. Reconnect drives. Reboot. Start array resync.
Status. Production host healthy. Short validation passed.

2025-11-10 18:00:00 ET
Client update. Begin IP migration from reserve back to production.

2025-11-10 18:30:00 ET
Client update. Production cluster up.

2025-11-10 18:35:00 ET
All service IPs visible on production. Sites resolving correctly.

2025-11-10 18:40:00 ET
Final confirmation. Production up. Sites up. No data loss.

2025-11-10 19:00:00 ET
Client update. Manual checks of every site on SUPERIOR and QUETICO in progress.

2025-11-10 21:21:00 ET
CEO message posted. Restoration confirmed. Actions and commitments outlined.

2025-11-10 22:22:00 ET
Stability confirmation following extended monitoring.

2025-11-10 22:28:00 ET
Incident resolved and closed.


What we did to resolve it

  1. Investigated immediately at the first sign of high I/O wait and unresponsiveness.
  2. Executed a soft reboot, then hard reboots, verifying repeat drive-missing symptoms.
  3. To minimise peak-time downtime, migrated IPs and traffic to the failover reserve at 13:29 ET.
  4. Allocated all available compute to the failover and tuned services to improve performance.
  5. Engaged on-site technicians for hands-on production repair.
  6. Identified connector fault as the root cause, reconnected drives, rebooted, and resynchronised arrays.
  7. Migrated IPs and traffic back to production starting 18:00 ET. Confirmed full restoration at 18:40 ET.
  8. Performed manual verification of every site on the two servers and extended monitoring before closing.

Why failover performance degraded

  • The reserve failover for this cluster was built in March 2025 and remained more than sufficient for other clusters.
  • The SUPERIOR workload had grown significantly since then. Under peak Monday load, the reserve reached compute and I/O ceilings.
  • Sites initially loaded but later experienced slow responses and intermittent errors until production was restored.

What we are doing to prevent a repeat

  • Immediate hardware upgrade for this cluster. The failover reserve attached to SUPERIOR is now an exact duplicate of production. In stress tests, peak utilisation reached about 55 percent, leaving ample headroom.
  • Two-week failover replacement program. All failovers across WPCloud will be replaced or expanded to meet or exceed production specifications.
  • Process improvements. We are adding time-boxed decision checkpoints. If a stabilisation path does not yield measurable improvement within a set window, we escalate or pivot sooner to the alternate path.
  • Communication cadence. During any active incident, we will post updates every 15 minutes. Even if progress is unchanged, updates will confirm that work is ongoing.
  • Training and drills. Additional response drills and cross-team simulations will reinforce decision timing and handoffs.

A message from our CEO

I was involved in this incident personally from the moment it began. I trust my team and stand by the decisions they made with the information available at the time. As investigations progressed, it became clear that some choices could have been made differently. In a mission-critical environment, minutes matter and decisions must be made quickly. In this case, choices made to avoid extended downtime contributed to a longer disruption. We have learned from this and we are acting on it.

I started WPCloud more than 13 years ago and I still work here every day. I built this company one server at a time through long days and long nights. I see that same dedication in every person on our team. On the day of this incident, our system administrators, our support staff, our operations and data-centre partners worked without breaks until every site was back online. I am proud of that effort.

This was the toughest day in our history. Within an hour of recovery we began implementing changes. Hardware has been upgraded, processes are being refined, and new safeguards are already in place. Over our history, WPCloud’s uptime has consistently exceeded 99.9 percent. That is not an accident. It reflects planning, discipline, and care.

Over the next seven days I will introduce a simple channel where I can communicate directly with you about improvements we are making and the day-to-day work that keeps WPCloud running. There is a lot our clients do not see behind the scenes. I am looking forward to sharing more of that work with you.

I am sincerely sorry for the disruption this incident caused to your work and your clients.

Steve Wilton
CEO, WPCloud


Confidential to intended recipients. Please do not forward without written permission.

Updated on November 12, 2025
Was this article helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *