Hello Folks, Wassup!

This is the first post in the distributed systems & scalability blog series that I’ve started to discuss the intricacies of the production systems of large scale distributed services such as YouTube, Heroku, Netflix & so on. Arguably every service running today online, dealing with heavy traffic, is distributed in nature.

I’ve written a bunch of articles on the real-world architectures already, you can find all of them here.

This post got pretty lengthy with over 2500 words like my other posts so I’ve decided to cut them down into parts to make them look less scary & overwhelming to read.

Affiliate Disclaimer: A few of the resources stated in this article contain affiliate links. That means if you find these resources helpful, worthy of spending your money on & you buy a subscription, a course or a book; I get a small cut without you paying anything extra.

I recommend these resources to you because I think the content they offer is pretty good & these will assist you big time in upskilling yourself enabling you to soar in your career.

With that being said, let’s get on with it.

Heroku – Adding Client Rate Throttling to the Platform API

Heroku engineering in their blog post shared an insight into how they’ve implemented rate-throttling on their client. In computing throttling means having control over the rate of events occurring in a system; this is important to avoid the servers from being overwhelmed when subjected to a traffic load beyond their processing capacity.

Rate-limiting is a similar technique implemented on the backend of the system & as the name implies it puts a limit on the rate of requests the API servers process within a stipulated time to ensure the service uptime.

If a certain client sends too many requests, it starts receiving an error in the response due to rate-limiting implemented on the server. But just limiting the rate at which the API processes the client requests doesn’t make the system entirely efficient.

Implementing rate-limiting on the backend doesn’t stop the clients from sending the requests. In this scenario, the bandwidth is continually consumed as well as the rate-limiting logic continually has to run on the backend thus consuming additional compute resources.

We need to implement rate-throttling on the client to enable it to reduce the rate at which it sends the requests to the backend as & when it starts receiving the error.

Rate-throttling & rate-limiting implemented together make an efficient API. Another way to deal with the deluge of clients requests is to implement Webhooks. If you need to understand Webhooks & wish to gain a comprehensive insight into the fundamentals of web application & software architecture you can check out my Web Application & Software Architecture 101 course here.

There are several reasons that necessitate the need for throttling & limiting requests to the API:

  • The resources are finite, there is a limit to the processing power and the bandwidth capacity of the service. If the client keeps sending the requests, the rate-limiting logic has to run on the backend continually thus consuming resources.
  • Rate-limiting adds a layer of security to the service to avert it from being bombarded by bot requests, web-scraping & DDoS attacks.
  • Several SaaS Software as a Service businesses have their pricing plans based on the number of requests the client makes within a stipulated time. This makes rate-limiting vital for them.

So, this was a rudimentary introduction to rate-throttling & rate-limiting; If you need to further understand the strategies & techniques, this Google Cloud resource is a good read on it. Now let’s talk about how Heroku implemented it.


The Heroku API uses the Genetic Cell Rate Algorithm, a rate-limiting strategy on its API. The service returns an HTTP 429 response if the client requests hit the rate limit.

From the Mozilla developer doc
HTTP 429
The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time (“rate limiting”).
Retry-After header might be included to this response indicating how long to wait before making a new request.

They needed an efficient rate-throttling logic on their client to tackle the HTTP 429 error; retrying every time a request failed would certainly have DDoSed the API.

They started with different tests to write an efficient rate-throttling strategy, but checking the effectiveness of different strategies was far from being simple. That made them write a simulator in addition to their tests. The simulator would simulate the API behaviour, that enabled them to come up with a satisfactory throttling strategy.

Now the next step was to integrate that strategy with the application code and deploy the updates into production, but the algorithm they came up with wasn’t that maintainable. It would have been hard for a new resource to tweak things if the creator of the algorithm wasn’t around.

If you wish to understand the code deployment workflow, application monitoring, how nodes work together in a cluster, how cloud deploys our service across the globe & more, you can check out my platform-agnostic cloud computing fundamentals course here.

Quoting the programmer who wrote the algorithm “I could explain the approach I had taken to build an algorithm, but I had no way to quantify the “goodness” of my algorithm. That’s when I decided to throw it all away and start from first principles.

The new quantifying goals of the rate-throttling algorithm were:

  • Minimum average retry rate: this meant having a retry rate with minimum HTTP 429 response from the server. This would also cut down the unnecessary bandwidth consumption, also the backend would consume less resources running the rate-limiting logic.
  • Minimum maximum sleep time: this meant minimizing the wait time of the client before it retries the request, no consumer of the service should wait for longer than what is absolutely necessary.
    At the same time, the system should ensure that throttling the requests doesn’t leave the API under-consumed; If the API can handle 500K requests a day, the clients should be capable of consuming that quota every single day.
  • Minimize variance of request count between clients: Every client should be treated as the same by the server, there should be no exceptions.
  • Minimize time to clear a large request capacity: As the state of the system changes, the clients should adapt. If the new load is introduced to the backend with the introduction of new clients the rate-throttling algorithm of all the clients should adapt.

Finally, the Heroku team ended up using the exponential backoff algorithm to implement the rate throttling on their clients.

As per Google Cloud the exponential backoff is a standard error-handling approach for network applications. As per AWS, this not only increases the reliability of the application but also reduces the operational costs for the developer.

Check out The Good Parts of AWS: Cutting Through the Clutter course on Educative. This is not a typical AWS reference course. You won’t find most of the knowledge that’s shared here in the AWS docs. The goal here is to help you realize which AWS features you’d be foolish not to use — features that have passed the test of time by being at the backbone of most things on the Internet – written by a former Amazon engineer with 15 years of experience working on AWS.

If you wish to delve into details here is the Heroku post link.
If you want to have a look at the code here you go

Recommended reads:
Designing scalable rate-limiting algorithms
Building a distributed rate limiter that scales horizontally

Zero to Software Architect 🙂

Zero to Software/Application Architect learning track is a series of four courses that I am writing with an aim to educate you, step by step, on the domain of software architecture & distributed system design. The learning track takes you right from having no knowledge in it to making you a pro in designing large scale distributed systems like YouTube, Netflix, Google Stadia & so on. Check it out.

Short-Term & Long-Term Solutions When Your Service Fails To Scale

Here is an interesting post by the 8th Light tech team listing out the short-term & the long-term solutions when our service is hit by unexpected traffic and sadly fails to scale. It might be the database or the application server that is on fire or maybe the disk that has maxed out as a result of the data deluge. How do we respond to such a situation? Have a read.

Become a Pro In DevOps Starting Right From The Basics 🙂

KodeKloud is a platform that trains you right from the basics – on the fundamentals of DevOps. You can go through the fundamental courses on technologies such as Kubernetes, Docker, Ansible & many more & can also take the certification preparation courses on the same. The platform offers interactive hands-on training in real environments right in your browser. They have already trained over 180,000 students on DevOps. Check it out.

Alright, folks! this is pretty much it in the first edition of distributed systems & scalability. In the next, I’ll be discussing the data exchange formats in high-performance applications. You can subscribe to my newsletter to stay notified of the new content published on the blog.

Your subscription could not be saved. Please try again.
Please check your inbox to confirm the subscription.

Subscribe to my newsletter

Get updates on the new content published on the blog by joining my newsletter

Here is my LinkedIn profile in case you want to say Hello!

Until then… Cheers!