Studying web apps that operate at scale teaches us a lot. The complicated architectures of companies like Netflix have not only enabled them to serve content to millions of users, but also improve their user’s experience and increase engagement.
For e.g. in a small company, a mechanism like push notification can be part of the main codebase and may operate along with the rest of the application in the same machine.
But at Netflix, they have multiple servers spread across the globe tasked to do the push notifications. I have attached a link to a talk where the architecture is being talked about in detail, towards the end of the newsletter.
While it’s interesting to learn these case studies, most people are still not working on such a huge scale. In some cases, it’s easier to learn how companies transformed from their old, mainstream ways of serving their app to modern solutions to cater to their user’s experience.
Let’s take the example of Twitter!
10 years back, around 2012–13, Twitter was growing exponentially and serving 150 million users.
They were handling around 6000 tweets/second from these users. I would say, this is still manageable. We have databases with high throughput and write speeds that can handle it.
But the issue occurs for users reading the tweets. Every user queries the home page of Twitter and they have to be served the combination of tweets from everyone they follow. This comes out to be 300k requests/second!!
That’s an insane amount of reads to directly do from the DB!
And what does each of these requests look like? Lemme give you some perspective. Imagine a person who is following 5000 users. They open the Twitter site. The backend hits the database with a query like
Get all the tweets ordered according to time, if it belongs to any of these 5000 users
Even if I limit the query to a few tweets, this is still going to put a heavy load on the database.
How Twitter solved their scaling issue 🚀
Twitter has 2 timelines
- User timeline: This is the collection of all your tweets. This is fetched from the disk.
- Home timeline: This is the collection of tweets from the people you follow, basically your homepage.
While designing a database, the simplest DB will be to just append all the writes to the end of a file and read it whenever it’s required. Nothing is faster than simply appending to a file. But querying something from this type of DB will take a long time as the database file grows
To reduce the query time, we add indices to it. But adding indices would mean your writes are going to take longer as you have to edit the index before writing it to the database. But since your number of reads are going to be way more than writes, this seems like a fair trade-off.
Similarly, Twitter’s reads are way more than its writes. So they came up with a system that can serve the home timeline of users better. They precomputed the home timeline of all users and stored it in Redis clusters. As simple as that!
Now, whenever a user tweets, the tweet is inserted into the timeline queue of each of their followers. So if you are having 5000 followers, your tweet will have 5000 writes!! Plus 1 write in the database itself 😅
This sounds like a crazy idea in the beginning, but it makes sense when you think about it. You can now serve millions of users instantaneously without hitting the disk, which can reduce latencies radically. This process is called fan-out!
So how does the Twitter architecture look so far?
- A user visits the homepage of Twitter.
- Twitter looks for the home timeline of the user in one of the Redis clusters.
- Once it is found, it shows it to the user as it is.
- When the user sends a tweet, that tweet is replicated across all the timelines of the user’s followers.
Some additional details to this
- Twitter maintains a graph database that contains all the data regarding who follows whom. When the fan-out occurs, the fan-out service queries this graph DB to determine where to push the tweet to.
- This fan-out is replicated across 3 different machines in a datacenter i.e. each user’s timeline is stored in 3 separate machines. This is required because if one of the machines fails, the others can serve the traffic.
- The tweets themselves are not stored in the clusters. Only the tweet ID is stored. Tweets are retrieved while delivering them to the user.
- If you are inactive for more than 30 days on Twitter, your timeline won’t be present in the Redis cluster. For such users, their timeline is reconstructed from disk when they return and make a request to the home timeline.
There’s a limit to the number of Tweets that are stored per user’s home timeline. Only 800 tweets are shown to every user at a time. This is a design decision they made to keep the memory usage under control!
Celebrities get special treatment everywhere 💃🏻🕺🏻
The above optimization works for a regular user. But let’s take the example of Lady Gaga, who back then, had about 31 million followers.
When Lady Gaga tweets, the fan-out process happens for 31 million users!! And it has to be replicated 3 times!!
The result is, there might be users who will see Lady Gaga’s tweet only after 5 mins of her tweeting since their timelines haven’t been updated yet.
This brings out an undesirable effect where, a user who could see Lady Gaga’s tweet, replies to it. That reply, being parallelly processed by the fanout, will land in another user’s timeline who hasn’t yet received the original tweet of miss Gaga! Not sure if we can call them Headless tweets 🤔
To avoid this, Twitter came up with a hybrid approach. The approach goes something like this:
- If a person having too many followers tweets something, then fanout doesn’t occur for them.
- The tweets of such users are merged into the timeline only when someone requests the homepage.
Simple solution, again!
Wrapping up
Studying the entire architecture of Twitter will be very difficult. Especially the one powering the giant company right now.
But taking notes from the way they solved the problems of scale in the past can help you make decisions at your workplace as well. Even if it might sound as ridiculous as storing one tweet at thousands of places, if it works and doesn’t have any side effects, that is a good solution!
Reference:
I write about Software Engineering and how to scale your applications on my weekly newsletter. To get such stories directly in your inbox, subscribe to it! 😃
If you like my content and want to support me to keep me going, consider buying me a coffee ☕️ ☕️