The pressure of Black Friday traffic at many companies and product launches at Apple are legendary. Companies and fortunes are made and lost on thriving or not during major spikes in interest, whether expected or not. Imagine making all the preparations for massive product launch traffic, and still having a major system failure…

On Tuesday November 15, 2016 I was a panelist at the Techstars Seattle Startup Week CTO Panel on War Stories and Best Practices. We were asked to “tell the story about the single most extreme, tech-related, business-impacting event you’ve experienced in your career, and what resulting advice would you pass along to an existing or future CTO?” This is an excellent question (thanks for the invitation, Jeff!) and I’d like to share my story and 7 crucial practices for resilience at scale with you now.

On June 15, 2010 my team launched the Apple Store app for iPhone, which instantly became the fastest app to reach a million downloads. The app’s marquee feature allowed customers to reserve a new iPhone 4 without having to enter their AT&T account information.

In previous launches, the API Apple used to communicate with AT&T had longer than expected latency, causing requests to pile up and eventually time out. For this launch, we used high estimates of traffic and latency to size our server fleet. We deployed additional instances and configured each with 32 threads to give ourselves much more headroom.

Despite these preparations, we were still seeing reservations slow to a crawl and time out. Our marquee feature was almost unusable.

We always ran a “War Room” for launches, and we had our own people walking through the reservation flow, so we had first hand knowledge of problem with the customer experience. Within minutes of go-live, we were getting inquiries from people up the management chain asking what the problem was and when it would be resolved. Directors appeared in the War Room asking pointed questions and formulating responses to business owners. We were on the verge of being overtaken by the avalanche. We needed to find and fix the root cause—quickly! Here’s what we did…

We engaged the team that owned the AT&T interface to start rapid troubleshooting. The upstream system was showing heavy load but was coping with it. However, things were so bad for us that we could not access our administrative API endpoints on our service instances, which made it very difficult to diagnose.

We still suspected we were tying up request servicing threads waiting for responses from the upstream service, but until we could access our internal metrics and configuration data through the admin API, we were hesitant to go making infrastructure changes on just the hypothesis.

I teamed up with one of our core platform engineers and we sequestered ourselves from the group to gain access to the admin interface. Once we could focus on just that one aspect, we rapidly identified the problem.

Based on the service statistics, such as request rates and instance counts, I concluded that we were only using 8 out of the 32 threads each instance was configured for. At the same time, the engineer discovered that the load balancer for the application was routing all the requests to the admin port instead of the main port, and sure enough the admin port was configured for only 8 threads.

As soon as we routed traffic to the correct port, the system was able to handle the load easily.

A few points of advice from this tale:

  1. Manage risk — Prevent where you can, mitigate elsewhere. We did not have an alternate implementation of this marquee feature, so we designed it to handle the worst case latency. When we cleared the configuration problem, we had plenty of headroom.
  2. Own your dependencies — Strong relationships with upstream teams. Manage via metrics and SLAs. This is actually an Amazon principle of ownership; read John Rossman’s excellent book The Amazon Way: 14 Leadership Principles Behind the World’s Most Disruptive Company.
  3. Design for serviceability — Make it possible to inspect (and, where appropriate, modify) the actual state— including config, environment as well as stats—in the live instances.
  4. Don’t be stuck — Production support can be scary, especially when the stakes are so high. Its a trainable skill to recognize you are getting stuck and engage others.
  5. Truly understand your platform — Keep your experts close. Knowledge about how the system really works can save you.
  6. Demand data — We could have just added instances, but without the real data and math, we could have added too few and spent a bunch more time figuring out why that still was not enough.
  7. Load-test — We sized our fleet based on the load we had measured a single instance could handle, which was a good start. But, we had not placed enough test load on the final production fleet to detect the misconfiguration before we went live.

Oh, and don’t allow your admin API endpoints to serve requests for your main API! We would have discovered the problem much earlier if we had not made this error.

As we head into the peak season for retail, it’s not too late to plan ahead for resilience; architecture and design adjustments may have to wait but smart risk management and, if necessary, issue handling can make all the difference. Surf that avalanche!

Have you surfed your own avalanche, or do you see one coming? I’d love to hear your hopes, fears, challenges and successes. If you’re not sure where your avalanche may come from, share your thoughts in the comments below, or reach out to me.