Lessons Learned From Our October Outage

Mid-October, we had a serious outage. Here are some of the lessons we learned about developing in the bitcoin space, and software in general.

Reliability is King

This past year, we’ve learned one of the most important factors for a great payment experience is reliability. Regardless of the discounts we offer, if people feel like they might be surprised or embarassed at the register- they won’t pay with Fold.

In retrospect, this is obvious. If you’ve ever had a credit card declined, you know how embarassing that can be. Do you ask the cashier to run it again? There are people waiting behind you!

That’s why this outage hurt so bad- it interrupted what should be a simple, consistent payment experience. It’s difficult to recommend a bitcoin payment app to a friend if you aren’t sure it will work.

Startups on Startups

In case you missed the original writeup, the our issue was the result of a blockchain API we relied on. Because we only used one notifcations provider, delays on their end kept our customers from buying their morning coffee.

We knew that something like that could happen. Building a startup on top of a startup is always a risky proposition. In our case, we already built our company on top of bitcoin, a startup if there’s ever been one- and relying on others compounded our risk.

We originally took this approach because we were assured by the provider that uptime was their first priority. And while that have been true, it wasn’t enough. When uptime of a service is core to your customer experience, you have to take matters into your own hands.

Spread the Risk

On the other hand, we aren’t ready to become blockchain analysis experts. Accomplishing what notification services like BlockCypher do is more than just running nodes- it also means extensive monitoring of the network to provide statistics like transaction confidence- important if you accept 0-conf transactions.

So if you shouldn’t trust a single provider, but you don’t have the time to be an expert- what do you do?

Spread the risk across providers. We rebuilt our system to accept backup notifications from other services. And while We haven’t needed the backups yet- BlockCypher is pretty great- if there ever is an outage, our system will use a different provider.

Take My Money

If you’re using a service that isn’t charging you money— well, you might not be using that service for long.

Startup founders love to save money. But if build your company around another service, when things go to hell, you want to be a customer; not a “user”. Customers get support. Customers can vote with their dollars. Free users can be easily dropped next time the company decides to pivot.

And if you are building a startup on a startup- wouldn’t you rather know that the service you’re building on has a sustainable business model?

Better Monitoring

Lastly, better monitoring would’ve made all the difference.

It was hours before we got the first support request, and longer before we diagnosed the issue. Monitoring user payments could’ve shaved 6 hours off the incident response time.

But monitoring the notifcation service’s average response time would’ve predicted this issue, and let us prepare in advance. Reviewing the logs later showed that notifications had been erratically delayed for a while- this was just the first time it led to a service interruption.

We’d love to hear what you think! Discuss with us ZapChain.