Summary

We experienced an outage across all stores last night, starting at 5:26pm CST and ending at 7:36pm for 99% of stores.

Most stores started experiencing errors around 5:26pm and returned to normal by 7:36pm; Blue = successful visits; Orange = error visits

Most stores started experiencing errors around 5:26pm and returned to normal by 7:36pm; Blue = successful visits; Orange = error visits

What caused the outage

At Cratejoy, we’ve built our system around redundancy. We’ve handled many merchants who have been featured on Shark Tank, CNN, Good Morning America, and many other high-profile promotions.

Last night we saw 5x more traffic than we’ve ever seen from any promotion, due to highly successful marketing campaigns. Despite being prepared for large spikes, this was much larger than anything we had seen before or had been prepared to handle.

Our site reliability engineering team was able to isolate the problem and return nearly all storefronts to a healthy state. We designed techniques on the fly to mitigate the problem and restore service to normal. Techniques that we can use much faster next time this happens.

What we’re doing to improve

We get as excited about these promotions as you do and want to make sure we can meet these critical high-traffic moments. Making the 2017 holiday season a blow-out for all merchants is critical to our mission. We don’t take failures like this lightly.

Through last night’s incident, we’ve exposed more work that we need to do to be fully prepared for the holiday season. The good news: much of this work was already underway and we expect to have a smooth holiday season with fast storefronts. We’re knocking out ways to allow us to handle much more traffic than we saw last night for a faster, more reliable Cratejoy for all merchants.

Posted Nov 09, 2017 - 17:00 CST

Resolved

Storefront error rates and latency have been back in the normal range for nearly 4 hours now, we're closing this incident out.

Posted Nov 09, 2017 - 00:02 CST

Update

Update: At 5:36 PM CST, we received a large surge of traffic which overwhelmed our servers and caused our automatic scaling rules to react too slowly to accommodate the traffic.

We attempted to manually intervene and it took some time to identify and isolate the incoming traffic.

We have put provisions in place so that, going forward, we will be able to identify and isolate large traffic spikes much quicker in order to prevent this problem from occurring again.

We are still seeing heavy traffic usage and are prioritizing storefronts while we monitor. Access to the Merchant Dashboard may be degraded for now.

Posted Nov 08, 2017 - 22:08 CST

Update

We're seeing recovery of errors across most storefronts, and page response times getting back to normal.

Posted Nov 08, 2017 - 19:38 CST

Monitoring

We've managed to correct the store scaling and error rates are dropping, but still elevated

Posted Nov 08, 2017 - 18:40 CST

Investigating

One of our scaling groups appears to not be expanding as fast as intended, we're looking into it

Posted Nov 08, 2017 - 17:52 CST