Earthquake Recap: What Worked, What Didn't

Safecast was formed in reaction to the Fukushima Nuclear Powerplant Disaster, which was caused by the magnitude 9.0 Tōhoku earthquake and tsunami of March 11, 2011. Specifically, our project was begun because of the lack of publicly available data regarding the accident’s impacts and the condition of the nuclear reactors themselves in the days and weeks following the start of the disaster. In the last 5+ years we’ve put in a lot of work creating tangible results, not the least of which is collecting and publishing almost 60 million background radiation data points and deploying a realtime, always on static sensor network. We’ve created the largest open radiation dataset to ever exist and put it entirely in the public domain. Researchers are now able to study radiation backgrounds in ways that were not possible before, and the public now has a vetted, trustworthy, independent source of information. We’ve done this for many reasons. We want to help study the environmental consequences of the 2011 disaster; we want to help avoid the information vacuum, trust, and communication issues that occurred then the next time something similar takes place; and we want to help people become familiar with normal background radiation levels around the world. The 7.3 earthquake that struck off Fukushima yesterday was exactly what we’ve been preparing for, and luckily it proved to be a excellent test run for us. For everyone’s sake, we’re thankful it wasn’t another full scale emergency.
With yesterday’s events fresh in our minds, and in the interest of transparency, we thought it might be useful to look a little closer at how things played out on our end – what worked, what didn’t, and what we can improve on for next time. Much of our work has been preparation for similar hypothetical situations, so we’re fortunate to have an actual dry run to see how things work in practice.
The short version is that almost everything worked as expected or better, and the bits that failed, failed in a predictable way along failure lines that we’d anticipated. The failures were localized issues isolated from core processes, and were easy to correct as soon as we assessed the situation. By and large, we consider this all to be an overwhelming success and as well as an excellent learning opportunity.
And very importantly, all of this work is pointless if no one knows about it or learns about it too long after the fact. We’re humbled and flattered by how fast the global community came together to help signal boost our reports and spread the word of our efforts. Getting the word out is arguably one of the most difficult aspects of this kind of work, and in that regard yesterday played out perfectly for our point of view.
So lets look at some specifics. While our mobile data, which makes up the majority of our dataset, helps paint the picture of radiation levels around the world and allows mapping of areas previously uncharted, it’s a time snapshot and isn’t very helpful for monitoring a quickly developing situation like yesterday’s. That’s why we’ve been building static sensors, called “Pointcast,” which constantly report from one specific location and can immediately alert us (and everyone else) to changes within seconds of radiation levels changing. We purposely built the static sensor system on a separate server so that if, in the case of an emergency, it was suddenly overwhelmed with traffic it wouldn’t bring the entire Safecast system down. And that’s exactly what happened.
Here’s a look at the requests we saw to yesterday:
Requests to realtime.safecast
And here’s a look at the response time from yesterday:
Response time on realtime.safecast
As you can see, things got quite interesting there for a short period. For almost one hour the realtime subdomain was timing out, and for another hour after that it was experiencing intermittent outages. This was restricted to the web-facing server, while all back-end functionality, the API and all servers, remained online throughout. The most obvious culprit was in fact low memory on the Apache server which runs our realtime system, and once we were able to access the server, upgrade the settings and allocations, and restart it, things balanced out quickly.
And here’s another look at the response time historically, at peak yesterday, and once we got things in order again:
It’s worth reiterating that we saw spikes in traffic across the entire Safecast platform, but thanks to the isolation of the realtime server, it’s going down didn’t impact anything else. Here’s a chart of the requests we saw to the main Safecast map:
Tilemap Traffic Spike
Now let’s get into the real guts of how this all played out. Safecast is an almost entirely volunteer organization with people spread out across the entire world working on different aspects of the project. When this earthquake hit it was midday in the US (as the PST time markers show on the images above), but it was very early in the morning in Japan and very late at night in Europe. Some people were woken up by the quake, others by frantic calls and notifications but everyone mobilized and came together right away. It was wonderful to see in action. While Azby, Jam, Pieter, and I we’re able to jump in immediately and start directing the public, the press and volunteers to where they were needed most, Edouard, Kalin, Marc, and Rob dove into the servers and managed the technical stress wonderfully.
Here’s a general timeline:

  • Earthquake hit and news began reporting it
  • Within minutes most of the team was in our Slack channel monitoring the situation.
  • We confirmed all sensors/systems were online
  • We reached out to friends and community and asked for help spreading the word.
  • Our links were posted and reposted thousands of times on Twitter and Facebook. Several million impressions on some of our tweets appeared in a matter of minutes.
  • We started hearing reports of our realtime web data not loading for people, and experiencing some issues on our end as well.
  • We checked our backend and all sensors were still online, our API still functioning and logging data, and we were able to determine the problem was limited to the realtime subdomain.
  • We decided to restart Apache. Within a minute, the whole machine went down as the memory became exhausted.
  • The machine started responding again, as all the Apache processes that were paralyzing it died one after the other.
  • A reboot gave us a cleaner slate to work with, but only for a short while as the requests would add up quickly bringing the system down again.
  • All the while we were still seeing data coming in from the sensors behind the scenes and were able to convey this data out on our blog and social media streams – so while the front end visualizations of the static sensors were not working, the overall system still was.
  • We changed the configuration of Apache to more conservative settings to avoid bringing the machine down again, we limited the number of simultaneous Apache connections in the Apache config.
  • We monitored memory, CPU and requests per second for 30 minutes to ensure all was well and then decided to be less conservative with the Apache config in order to face the higher number of requests.
  • We rolled out new settings (max 100 simultaneous workers), and we all saw it worked out.
  • This kept things online and allowed us to take some deep breaths and enjoy the news that none of our sensors were showing any increases in radiation and that overall we seemed to have dodged a bullet with this earthquake.

Had something gone wrong at one of the nuclear power plants, and radiation started leaking, we’re confident we would have known it within moments of the release, confirmed it quickly, and been able to convey that information to the public seconds later. Our system is independent of government or nuclear industry monitoring systems, arguably the only one of its kind, and nothing like it existed in 2011. As a reminder of how things played out then, it took days for the general public to receive confirmation of radiation leaks, and decisions about determining the impacts and final evacuations took weeks. In March 2011, many people were at risk of being exposed to radiation for days before basic information was made available to them, and even then the official statements were vague and generalized. Having specific, independent, confirmable information in seconds this time around is an improvement that can’t be overstated. It’s massive. And it worked.
It was an eventful few hours to say the least, and how this played out helped us identify some weak links in our process chain and devise corrections for them. We’ve already begun moving the realtime server immediately to another host that offers load based scaling so when demand is low so is availability, and as demand goes up so do the resources. This should prevent overloads and crashes, but can also get costly depending on how long the surge lasts, and can become a pricey target for attack. Our volunteer team has proven again and again to be exceptional under pressure, though we also recognize how useful having fulltime staff on call can be in times like this. We’ve been fortunate to have the support of the Shuttleworth Foundation and the Knight Foundation over the years, enabling us to get infrastructure in place, and ongoing, recurring donations from the public (that’s you, thanks!) help us on an daily basis keep things online. As we look ahead a year, 5 years, 10 years down the line, we know we’ll need to make some bigger investments to ensure these systems continue to function as needed, when needed. And we continue to look for partners to join with us on this mission.