Game Backend Instability
Incident Report for SNOW

First of all, I want to apologize for the outage that affected all our users on the morning of the 11th of February 2016. I take great pride on the stability and availability of our platform, and outages like this are completely unacceptable.

What Happened?

One of our caching servers stopped responding to requests. The way our application is currently set up, it relies entirely on being able to establish a connection to all of these servers. As a result of that, our entire application started sending 500 Internal Server Error responses to almost all requests.

Spike in response time

It also caused quite a significant spike in our response time, for the few requests that made it through.

What Will We Do?

To make sure this will never happen again, we will be changing how our application interfaces with the cache layer, to make it not rely on it completely, but only use it as a performance enhancer, like we initially intended. We will also change from Memcache to Redis, to get the benefit of more modern caching and to make our caching cluster more resilient against single instance failure.

Posted over 1 year ago. Feb 11, 2016 - 12:54 CET

Resolved
We're now confident the issue is resolved, and all systems are behaving normally.
Posted over 1 year ago. Feb 11, 2016 - 12:07 CET
Monitoring
We believe the issue is fixed. We'll be monitoring the service and dig into what the root cause of this outage was.
Posted over 1 year ago. Feb 11, 2016 - 10:54 CET
Identified
We've identified the issue, and are putting a fix in place.
Posted over 1 year ago. Feb 11, 2016 - 10:52 CET
Investigating
We're currently seeing high error rates and periodic outages with our backend. We're investigating.
Posted over 1 year ago. Feb 11, 2016 - 10:44 CET