Earlier today we had a short outage because one of our cache servers started reporting errors when receiving data.
This issue is related to the one we had on the morning of the 11th of February 2016. Back then I promised that we'd change our caching solution from Memcache to Redis. We finished the migration to Redis a few days after the first outage. Todays outage was because of a mis-configuration in our Redis cluster, where we didn't automatically prune stale cache keys.
Todays outage was luckily very small and short compared to the one we had back in February. This is partly because we had seen a similar outage before, but also because we were able to automatically failover to a standby server, while we recovered our main cluster.
We noticed the issue because we saw elevated error rates.
As the cluster came back up, we saw a short spike in response time as the cache was being populated once again.
We have fixed the configuration issue, and implemented a check in our release procedure to make sure this config rule is set correctly when deploying updates. This should make sure we never experience this issue again.