Railway GCP Account Suspension Incident Report

原文

🚅

This report reflects what we know at time of publication and may be updated pending Google Cloud's internal review.

Railway experienced a platform-wide service disruption due to Google Cloud incorrectly placing our account in a suspended status. This resulted in a temporary loss of service for all GCP hosted infrastructure. This infrastructure supports our dashboard, API, and pieces of our network infrastructure. As cached network routes expired, the outage extended beyond GCP to affect all Railway workloads.

Below, we walk through what happened, how we responded, and what we're doing to prevent a similar incident in the future.

On May 19, 2026 between 22:20 UTC and approximately 06:14 UTC on May 20 (~8 hours), Railway experienced a platform-wide outage after Google Cloud suspended services on our production account. This took our API, control plane and databases offline, along with compute infrastructure hosted on Google Cloud.

Users immediately experienced 503 errors on the dashboard and API, including "no healthy upstream" and "unconditional drop overload" messages, and were unable to log in. All workloads hosted on Google Cloud compute were taken offline.

While workloads on our own Railway Metal and AWS burst-cloud environments remained up, Railway's edge proxies rely on a Google Cloud-hosted control plane API to populate their routing tables, causing the outage to cascade beyond Google Cloud. As the route caches expired, these other workloads became unreachable, resulting in returning 404 errors as the network control plane could no longer resolve routes to active instances. At peak impact, all Railway workloads across all regions were rendered unreachable.

As we recovered our Google Cloud environment, builds and deployments were blocked platform-wide while we restored the individual services. Once the entirety of our infrastructure was restored, a significant backlog of queued deploys was gradually drained to avoid overwhelming the platform. In parallel, GitHub began rate-limiting Railway's OAuth and webhook integrations, temporarily blocking logins and builds. The volume of these calls increased as a result of our caches being cleared from the Google Cloud outage. As a side effect, Terms-of-service acceptance records were also reset, prompting users to re-accept on their next visit to the dashboard.

We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage, and detail below what happened, how we recovered, and the changes we are making to prevent this from happening again.

May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue.
May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in.
May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly.
May 19, 22:29 UTC - Incident declared.
May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.
May 19, 22:35 UTC - Cached network routes began expiring; workloads on Railway Metal and AWS began returning 404 errors as the networking could no longer resolve routes.
May 19, 23:09 UTC - First persistent disk comes back online.
May 19, 23:54 UTC - All persistent disks restored to ready state. Network still down.
May 20, 00:39 UTC - Disks confirmed ready. Recovery blocked on Google Cloud networking restoration.
May 20, 01:30 UTC - Compute instances began recovering.
May 20, 01:38 UTC - Edge traffic being served again. Networking restored.
May 20, 01:57 UTC - Orchestration and build infrastructure restored. Deploys temporarily paused to prevent overwhelming systems as queued work attempted to execute simultaneously.
May 20, 02:04 UTC - Compute hosts being brought back online incrementally.
May 20, 02:47 UTC - GitHub began rate-limiting Railway's OAuth and webhook integrations; some users unable to log in, builds blocked.
May 20, 02:55 UTC - Dashboard accessible again.
May 20, 03:59 UTC - Deployments beginning to process again across all tiers.
May 20, 04:00 UTC - API, dashboard, and OAuth endpoints confirmed operational. Remaining workloads continuing to restore.
May 20, 06:14 UTC - Incident moved to monitoring.
May 20, 07:58 UTC - Incident is resolved.

At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.

This suspended status disabled our GCP related infrastructure, which supports the Railway Dashboard, API and parts of our Network infrastructure, along with additional burst-compute infrastructure hosted on Google Cloud.

Railway's control plane is a set of a core dependencies that serves the dashboard, processes builds and deployments, and populates the routing tables used by our edge. The impact was immediate for all workloads on Google Cloud.

Railway's edge proxies maintain a cache of routing tables from the network control plane, which is hosted within Google Cloud. While that cache held, workloads on Railway Metal and AWS continued to serve traffic. Once the cache expired, the edge could no longer resolve routes to active instances, and workloads across all regions, including Metal and AWS, began returning 404 errors. This caused the network outage impact to cascade beyond Google Cloud, into these regions as well, even though the workloads themselves remained online.

Railway's infrastructure is designed for high availability. Our databases run across multiple availability zones, and our network uses redundant connections between AWS, GCP, and Railway Metal. However, restoring account access did not restore these individual services. Persistent disks, compute instances, and networking all required separate recovery. Due to the nature of this recovery process, the outage was extended by several hours. Disks were restored to a ready state by 23:54 UTC, but core networking and edge routing did not fully restore until approximately 01:30 UTC on May 20. (We are awaiting confirmation to see if this delay and associated errors were on Google’s side)

As networking was restored, recovery of Railway core services and validation of end user workloads proceeded layer by layer. To prevent overwhelming our build systems we temporarily paused deploys, and gradually allowed them to resume. In parallel to our core system recovery, GitHub began rate-limiting Railway's OAuth and webhook integrations, due to the volume and burst nature of all retried requests, temporarily blocking user logins and builds.

By approximately 04:00 UTC on May 20, the API, dashboard, and OAuth endpoints were confirmed operational, with remaining workloads continuing to restore.

Railway’s network control plane is designed for resilience. It is a multi-AZ, multi-zone control plane which can tolerate the loss of multiple machines and components, while still functioning with zero user impact. This has been tested in both staging as well as live traffic (prior to its rollout a few months ago).

We have invested in resiliency as a result of prior incidents which have assisted us in dealing with the impact. A prior example of these lessons was Railway being able to gracefully recover user GitHub installations without triggering secondary rate-limits.

However, many have asked over multiple forums, how could Railway have a single dependency that would affect all customer workloads?

Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal GCP AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud. This meant that despite the mesh continuing to operate for an hour, when the route cache expired, the mesh failed to re-populate the routing tables.

We are immediately working on removing this dependency, making this a true mesh. This means that if any of the interconnects go out, there is always a path between the clouds.

As a result of this, we will be extending the high availability database shards across AWS and Metal. In the future, should all instances in a particular cloud disappear instantly, database quorum will keep everything running and immediately failover any no longer running workloads.

Finally, we are in planning to remove Google Cloud services from our data plane’s hot path, and keeping them only for secondary/failover. This is in parallel to implementing a new architecture for our data plane (enabling connectivity to hosts), and our control plane (which powers the dashboard you use to access and manage Railway). These architecture upgrades will ensure that our core services, especially user facing components, are not dependent on any one vendor or platform.

Railway owns our vendor choices, and we ultimately own this one. Your customers don't care whether the failure was Google or Railway; they see your product. Your uptime is our responsibility, and we'll keep delivering on it.