(Almost) every time our servers crashed

11 min readJul 23, 2020

Written by: Kevin Pei

#0. 😊 Humble Beginnings

Hello, Hackers.

Zero score and six years ago, a group of young, budding students at the University of Waterloo sought to reinvent the hackathon. They dared to question the status quo, vowing to build an event that would:

Include students from countries around the world
Bring in inspirational founders and figures from across the industry
Never, ever serve the all-too-common pile of dough (pizza) seen throughout the country

With this innovative mindset came an innovative technology stack, one built to last the ages with powerful permissions handling and the flexibility to store anything a hackathon organizer could ever ask for. Hacker applications? Of course. Organizer reimbursements? Yup! Sponsor data? Most definitely. Simply put, the founders built their own version of Google Forms.

But with great power (and little time) came no maintainability, a patchwork system of shell scripts and 5-line escape hatches destined to haunt ̶o̶r̶g̶a̶n̶i̶z̶e̶r̶s̶ unsung heroes for generations to come. These are the stories of exhilarating adventure, preposterous bugs, and tireless nights. These are the stories of our infrastructure.

#1. 🤤 Insatiable Permissions

Our HackerAPI backend consists of a “meta api” of sorts that allows us to define arbitrary data “pipelines” for data to go through. The pipelines then consist of:

Claims: An entry of data into the pipeline
Fields: The kind/fields of data stored in each claim
Stages: The progression of a claim through the pipeline

To give an example, our “hacker applications pipeline” is defined like so:

Claims: Representing an application by a Hacker
Fields: The questions we ask the hacker within each application
Stages: Is the application submitted? Accepted? Checked into the event?

While this structure allows us to essentially store anything we’d like in our database, it also results in some incredibly complex permissions logic that requires numerous database lookups to complete. In particular, each pipeline, stage, and field has its own set of permissions in increasing levels of specificity.

Consider this example:

Users can create a claim in the “hacker application pipeline.” This corresponds to a hacker applying to our event
Users cannot edit their claim after it has been moved to the “submitted” stage.
Users can generally read the associated fields on their claim, as it contains their answers to the application questions, but there also exists fields they cannot read such as their application review score.

This notably led to the famous “n+1” problem where we would repeatedly load new (and sometimes the same) fields, pipelines, and stages from the database for each claim we ran a permission check on. Worse, the normal resolution of eager loading didn’t work for us as these permission checks were common throughout our codebase and often occurred through several levels of indirection.

# file1.py:
def getClaimWithUserId(userid):
   # ...
   return claim  # file2.py
from file1 import getClaimWithUserId
def somethingElse():
   claim1 = getClaimWithUserId(2)
   claim2 = getClaimWithUserId(3)
   return [claim1, claim2]# file3.py 
from file2 import getClaimWithUserId 
def someEndpoint():
   # this triggers the permissions check as we are outputting it to 
   the client   
   return withPermissions(somethingElse())

This was a significant bottleneck, causing endpoints with multiple claims (such as search results) to take upwards of 30 seconds per request.

However, exciting things are happening in 2020. We are rewriting our backend from a Python-based REST API to a TypeScript-based GraphQL API. With this has come a new database ORM called Prisma. Prisma has built-in query batching that allows it to intelligently combine queries from different files at runtime.

In short, this means that when our application decides to load 40 different fields at once, Prisma will transparently combine these 40 database queries into one quick, efficient query before sending it off to the database. So far, we have seen performance gains of over 10x — it is said that performance engineers hate this one simple trick :)

#2. 😨 Drowning in Apps

Every now and then, I tell someone that Hack the North uses Kubernetes and I always get “why in the world?!” in response. For the uninitiated, Kubernetes is an enterprise-scale container orchestration framework developed by Google. In other words, it uses a lot of servers to deploy and host a lot of applications.

Throughout its history, Hack the North has progressively accumulated a surprising number of internal tools and apps that it needs to self-host. To be specific, as of July 2020, Hack the North hosts in one way or another:

The main hackthenorth.com website
The day-of attendee dashboard
The volunteer shift dashboard
The sponsor dashboard
The internal team dashboard
The old internal team dashboard
Metabase, our business intelligence and data analytics tool
The old API backend
The new API backend
The judging tool
The mentorship slack bot
Elasticsearch, which includes two main nodes, an instance of Kibana, beat for APM ingestion
The hardware checkout tool
The hacker check-in tool
The volunteer QR scanning tool
The hacker application dashboard
Plausible, our analytics tool
Outline, our documentation tool
Bitwarden RS, our password management tool
Buildkite, our continuous integration and deployment pipeline

As you can probably imagine, as our list of internal tools and apps grew it became increasingly untenable to manage individual instances on AWS. Worse, it was horribly expensive and cost-inefficient as we’d deploy individual instances for each thing we wanted to host. With that in mind, we made a difficult but necessary switch to Kubernetes on Google Kubernetes Engine. Beyond simpler resource management and cost efficiency, we gained several additional benefits from Kubernetes:

Instantaneous and automatic scaling for high-demand resources

What used to be a scary and precarious process of resizing individual AWS instances became an automatic process that could be configured at the click of a button

Simple authentication walls for internal resources

S.S. Octopus allowed us to quickly and easily deploy internal tools such as Kibana behind a Google authentication wall that only organizers can access

Safe, reproducible deployments written in configuration

Kubernetes is wildly known to be mildly confusing at first due to the myriad of concepts and configuration concepts one must understand to use it effectively. However, this is actually an advantage as it allows us to codify much of the infrastructure knowledge that was previously simply “remembered” by the relevant team member.

#3. 🙈 No-Face meets Kubernetes

Imagine you check your server performance to find this:

[06:13] all the endpoints have been about 10x slower since july 27
[06:13] for spook reasons I don’t understand
[06:17] so far I have confirmed: — database load (in CPU, memory, capacity, etc.) has not meaningfully changed — request numbers have actually gone down since the app rush — There aren’t any new database calls being made — Resource restrictions preceded July 27 (those happened on July 22), so I don’t think it’s that
- Database call time has increased about two fold
- Weird idle wait time has increased about 7 fold
See: [Kibana Link]
[06:18] What I most likely suspect right now is this change: [Github link]. It’s the only thing that would have affected a large number of endpoints
[06:19] But I don’t reasonably see why that would slow anything down

You spend the next month debugging, beyond confused about how this happened when the only thing that was released that day was a completely unrelated feature with no theoretical impact on performance. Eventually, your team convinces you to try reverting it, but, alas, it doesn’t work:

This leaves you with a one-line change that has yet to be reverted, but that one line change, again, has absolutely no relation to performance and certainly no relation to the performance of every endpoint on the server. What could be the cause of the slowdown? How does a backend server just get 10x slower with no reason to be found in version control nor chat logs? How!?! No clue, but it’s been a whole month, and you need to figure this out before the event runs.

I’ll let the chat logs explain the rest:

Other organizer:
[00:41] Thoughts on downgrading to only 1 instance
[00:42] And testing individual endpoints to maybe further narrow what exactly is so expensive?
Me:
[00:42] I know for a fact
[00:42] it’s all endpoints that ping the DB
Other organizer:
[00:43] Yeah because we have run sync db calls LMAO
[00:43] There’s no way we can become async bcuz it requires library changes and we can’t do that without blowing way too much off
Me:
[00:44] Sure
[00:44] but these exact endpoints
[00:44] were 10x faster
[00:44] before 2:00AM SGT July 28
[00:44] Even the schedule endpoint
[00:44] which only makes 2–3 db calls
[00:44] went from 200ms response to 2s
[00:45] according to kibana the actual queries didn’t get much slower
[00:45] just odd waiting time
….
Codirector:
[18:08] Following up on this [emergency meeting schedule request], can we throw something onto the calendar to make sure we actually do this?
… (next day)
Another organizer:
[22:31] https://www.postgresql.org/docs/current/runtime-config-logging.html#GUC-LOG-LOCK-WAITS

And, to the best of my memory, the contents of that call:

Everyone: 2 hours in, sitting there with no idea why this is happening
…
Me: We’ve looked at the code changes, we’ve tried reverting them, we’ve looked at our server resource allocations, we’ve looked at our database throughput, what else could there be?
…
Other organizer: Wait, why does Kubernetes (our hosting platform) say our instances have a limit of 0.1 CPU?
Me: …what?
Other organizer: Yeah, look [shares screen]
Me: Wait, but our config files say there’s no hard limit on the CPU
Other organizer: OHHHH, remember when I changed the CPU to 0.1 a month and a half ago to test?
Me: …
Other organizer: …

MFW

It turns out that when you don’t specify the CPU limit on your new Kubernetes config, it elects to keep the old CPU limit. The old CPU limit was set to 0.1 for debugging, and nobody knew the better. It wasn’t in our code, it wasn’t in our saved config files, and it certainly wasn’t in our slack messages.

The next day it all went back to normal!

#4. 📧 You have Mail!

Of all the reasons a server can crash email delivery didn’t seem like one of them. After all, sending out a newsletter to 10000000000 recipients usually meant throwing it at Mailgun (our mail delivery service) and calling it a day. But this is Hack the North, the place of innovation… and overdone complexity.

#Bino”mail”

Past organizer commented on Apr 8, 2017
The way that a server processes emails (Background Information):
Stage 1: greylisting. The server validates domain that each email is coming from by forcing it to send follow up requests. Takes 2–5 minutes for 1 email to go through
Stage 2: whitelisting. After greylisting (the first few emails go through successfully), we are whitelisted until there is a period of inactivity of X minutes. (After the period of inactivity, we have to go through the greylisting process again). While we are in the whitelist stage, all of our emails go through, provided we don’t exceed Y emails per minute (e.g. ~100/minute for UW).
Based on this, our goals are:
Send a few emails while in greylisting stage
Send a much higher number of emails in whitelisting stage, keeping in mind:
a) don’t exceed Y emails per minute
b) stretch white-listing period so that we aren’t back to greylisting stage
We care about extending our whitelist period even when we are sending a large batch, because they usually require a response from the receiver, and then a follow-up email from us (e.g. 1. acceptance email -> pls RSVP, 2. thanks for rsvp-ing email)
— — — — — — — — — — — — — — — — —
Sending out a large batch of emails (e.g. acceptances, rejections)
Gmail doesn’t care, so we don’t need to offset there.
For every other domain: Schedule to be sent as a binomial distribution, push to queue

This meant we created our own email scheduler, notably one that used the immense powers of numpy and a recurring cron job to individually send out emails in an approximately normal-like distribution over time.

However, this also meant that we kept a list of pending emails in our database and had to update them one by one as they got sent out. Back in 2017, our email list wasn’t large enough for this to be a performance problem. In 2019, however, it turned out that running 1000 sequential database calls often choked the server of resources and caused everything to come crashing down. It was particularly difficult to track this down because our marketing emails often coincided with feature launches (e.g. applications open!) that caused additional load to the server anyways. But we eventually caught on, and the backend was patched to batch database calls in a much more performant manner.

#So many events

A second issue arose from us attempting to ingest massive numbers of events from Mailgun. For example, Mailgun would call our webhook every time an email was delivered, not delivered, sent to spam, not sent to spam, etc. Since delivery of mail usually happened after we make the API calls to deliver them, this meant we were effectively DDOS’ing ourselves every time we sent out a newsletter. Worse, there wasn’t an easy way to batch the database insertion queries as each entry came in as a separate request.

In the 5 or so years since Hack the North started, our mail events table had collected nearly 600,000 rows! In fact, it was literally half of all the data we had stored in our database.

Our database before and after we deleted our mail events table

This was clearly unsustainable, but the solution turned out to be rather simple: we already internally hosted an instance of Elasticsearch for performance monitoring purposes, so why not use it to archive our mail events too? Elasticsearch is an incredibly fast and powerful JSON store built to ingest large amounts of data, and our performance monitoring suite was already feeding it a gigabyte or so of data per week. So, we threw our mail events at it too and our server load went away :)

#5. Fin

There was a beautiful time one score ago when every computer booted up with this magnificent, confidence-inspiring piano ballad that went something like dun, dun dun dun, dun dun. This beautiful cacophony would then be followed by a peaceful photo of rolling green hills against a vivid blue sky.

Millions of people would start their day like this, imbued with the energy, strength, and stability that this jewel brought them. Little did they know just how much of a disaster this jewel was under the hood, how, despite its brilliance, it was duct-taped together by unsung heroes of tireless nights.

And while our servers have no such startup sound nor rolling green hills (they run on Linux), they do have the same sense of duct-taped-together brilliance that somehow makes it all work — built by 5 generations of unsung heroes who worked tireless nights to help bring the event to fruition.

This was the story of our infrastructure.

Hack the North is scheduled for January 15–17, 2021 🎉

Sign up for our mailing list at hackthenorth.com to hear the latest from Hack the North! ⚙️