First of all, I want to apologize for the outage that affected all our users on the morning of the 11th of February 2016. I take great pride on the stability and availability of our platform, and outages like this are completely unacceptable.
One of our caching servers stopped responding to requests. The way our application is currently set up, it relies entirely on being able to establish a connection to all of these servers. As a result of that, our entire application started sending 500 Internal Server Error responses to almost all requests.
It also caused quite a significant spike in our response time, for the few requests that made it through.
To make sure this will never happen again, we will be changing how our application interfaces with the cache layer, to make it not rely on it completely, but only use it as a performance enhancer, like we initially intended. We will also change from Memcache to Redis, to get the benefit of more modern caching and to make our caching cluster more resilient against single instance failure.