outage 26/05/2017

(Tasos Katsoulas) #1

On Friday 26th of May 2017, ReMo portal experienced some problems during the creation of the poll for the Council elections. The Reps Module Owner (Ioana) made us aware of the issue through our public Telegram channel.

Specifically, admin users were not able to add the nominees for the range poll. Multiple attempts to create new polls and add nominees flooded the users with multiple emails. We quickly identified that the issue was located in our infrastructure. After a few hours of investigation we determined that the site could not handle the load. This was mostly because, the poll for the Council elections had to create multiple action items for all the Rep users.

  • To remedy this, we deleted the action items in order to allow Reps to proceed with the voting. This means that people lost a voting notification upon navigating to the Reps portal. This loss of information was compensated by the Reps Council multi-channel awareness campaign for the Council elections.
  • Immediately after that, we scaled our gunicorn web workers to three workers per node to increase concurrency and stability.

The site is stable and we are keep monitoring its performance.

As a side-effect of the investigation we also identified that the login issue reported by a few users was due to a cache mis-configuration. By updating the cache entries in our infrastructure we resolved this issue.

Our key takeaway from this is that we now have a better understanding of scaling gunicorn on container-based infrastructure, and this experience will greatly help us for the upcoming migration of to the Parsys Infrastructure. Another insight we gained is that we need to catch up with monitoring debt and get New Relic up and running on ParSys infrastructure.