Increased error rate and latency in some areas of marketplace and merchant portal
Incident Report for Cratejoy
Postmortem

Between 1:09 and 2:21 PM CST we suffered an increased error rate across functions affecting both the Merchant Portal and Marketplace.

The Merchant dashboard was hardest hit, impacting 10% of requests made, while the Marketplace only saw around 4%. The Merchant Portal uses this service heavily, including in the auto-suggest global search bar.

This disruption was caused by a processing anomaly on one node of our primary search cluster. While this service already has built in failover should a node fail, the particular problem did not disable the node but maximized it’s usage drastically lowering the throughput. The result as a user is if you had the luck of being routed to that node, you’d have an extended wait, error or timeout.

We’ve already reached out to our cloud provider about the issue of the problem with failover.
We are expanding our capacity in our search cluster so that we have extra room for the unexpected going forward.

Posted Jul 06, 2020 - 16:01 CDT

Resolved
Systems stable and performing well since last update, we'll attach a postmortem to this event for those interested.
Posted Jul 06, 2020 - 15:49 CDT
Monitoring
We've seen recovery in the search systems and are currently responding with normal error rates and improved responses. We're continue to work with the problematic system.
Posted Jul 06, 2020 - 14:36 CDT
Investigating
We're having an issue with a new search cluster that's causing some increased load times and occasional error on these services. We're looking into a solution or mitigation.

Storefronts and api are not affected
Posted Jul 06, 2020 - 13:28 CDT
This incident affected: Merchant Portal and Marketplace.