Nucleus Slowdown
Incident Report for Nucleus
Resolved
On Sunday (July 21 at 11:57 EDT) we were alerted to reduced responsiveness of the Nucleus primary server. After some initial investigation the problem seemed to be stemming from the improved search index we had shipped the week before.

In testing the improved search index prior to release, the load increase had not caused any significant issues. But on Sunday, the high volume of requests from users combined with the increased length of time it took to serve the index, meant that requests were stacking up faster than the server could respond. This caused a negative feedback loop. Because requests were taking longer, more requests stacked up, and this in turn, caused requests to take longer.

At 12:05 EDT we deployed a change to the search index responses to cache them locally in the browser for a time, to reduce the number of requests on the server. The load on the server began to decline almost immediately.

It took until 12:22 EDT for most of the existing requests to clear and the server to become reasonably responsive again.

By 12:41 EDT, 44 minutes after the issue started, the load on the server was back to normal levels.

On Monday (July 22 at 3:36 PM) we deployed another change greatly reducing the time it takes to create search indexes by consolidating the database queries used to gather information on tags, speakers, and scripture attached to media items. This change is especially noticable for sites with hundreds or thousands of media items. For example, The search index was only taking about 1-3 seconds on small sermon engines, but sometimes, over 40 seconds to load for the largest sermon engines.

So now with this Monday update, we eager load for the media items, reducing the thousands of unnecessary queries into just three queries for the entire set of media items. Which means the load time for the search index on the largest sermon engines is now only about 1-2 seconds. And smaller/mid-size sermon engines are now down to less than a half a second (~400ms or less!) total response time.

Moving forward, we're continuing to look at ways we can increase the amount of caching we do, and optimize the data included in the search indexes to mitigate load on the server even further.
Posted Jul 21, 2019 - 11:57 UTC