API Rate Limiting
2 min readFeb 13, 2022
What is rate-limiting?
- Dropping incoming requests if it exceeds the capacity or rate you can handle
Why implement rate-limiting?
- To avoid getting hammered by clients
- To avoid system outages
How to know if you have congestion going on?
Look at
- Average Response Time, p90,p50 response time
- Age of messages in the queue
- Count of messages in the dead letter queue
- Look at the request throughput, node memory, CPU, etc.
What happens if I dont rate limit the incoming requests?
Due to misbehaving clients or actual traffic surge, you may see:
- Out of Memory(OOM) exceptions
- Delayed Responses / Higher latencies
- Resource exhaustion
- Cascading failure is possible causing system-wide outage( all machines/nodes eventually die)
Algorithms for Rate limiting:
- Sliding window
- Timer wheel/ Hierarchical timer wheel
- Leaky Bucket
- Token Bucket
Safeguards to put in place before Rate limiting:
- Consider using GRpc built on HTTP 2.0 which does not have “ahead of line blocking”
- Consider an async way to handle many requests (Multiplexing)
- Use Kryo (Lossless Compression) to save bandwidth
- Look at the number of client connections (Keep HTTP connections open for a certain window to avoid the overhead of creating/destroying connection)
- Pull/Push Hybrid model: Normal user posts( update is fan-Out) vs Celebrity posts a message(Users will pull update)
- Graceful degradation: Stop/suspend functionality that is not critical for your service during high load
- See if using circuit breakers, timeout, retry, bulkhead patterns are helpful
Summary:
- Use a distributed service to know if a request should or should not be accepted
- Tricks to use request batching/collapsing( Combine multiple requests to the same request), client-side rate-limiting(exponential back-off), global/distributed cache layer to avoid repeated computations.
- Use rate limiting algorithm such as Hierarchical timer wheel a variation of the token bucket algorithm