Elevated Errors on SNOW Backend
Incident Report for SNOW

Earlier today we had a short outage because one of our cache servers started reporting errors when receiving data.

What Happened?

This issue is related to the one we had on the morning of the 11th of February 2016. Back then I promised that we'd change our caching solution from Memcache to Redis. We finished the migration to Redis a few days after the first outage. Todays outage was because of a mis-configuration in our Redis cluster, where we didn't automatically prune stale cache keys.

Todays outage was luckily very small and short compared to the one we had back in February. This is partly because we had seen a similar outage before, but also because we were able to automatically failover to a standby server, while we recovered our main cluster.

We noticed the issue because we saw elevated error rates.

Elevated Error Rate

As the cluster came back up, we saw a short spike in response time as the cache was being populated once again.

Response Time Spike

What Have We Done?

We have fixed the configuration issue, and implemented a check in our release procedure to make sure this config rule is set correctly when deploying updates. This should make sure we never experience this issue again.

Posted about 1 year ago. May 02, 2016 - 16:38 CEST

Resolved
This incident has been resolved.
Posted about 1 year ago. May 02, 2016 - 16:00 CEST
Monitoring
The server is back up, and we're monitoring it's stability. There should no longer be any impact on players.
Posted about 1 year ago. May 02, 2016 - 15:25 CEST
Identified
We're currently seeing elevated error rates on our backend due to a cache server failing. We are restarting the server and expect it to be back up shortly.
Posted about 1 year ago. May 02, 2016 - 15:17 CEST