Twitter’s Migration to Google Cloud – An Architectural Insight
This is my second write-up on the infrastructure @Twitter. In the first, I delved into the database technologies leveraged by the social platform to store petabytes of data generated every single day.
Twitter needs no introduction, it’s is one of the most popular social networks on the globe today with hundreds of millions of Tweets sent all across every single day.
Recently, it moved a part of it’s workload to the Google Cloud. What made them make this shift? What were the primary reasons behind the migration to the cloud considering the fact that they already have an efficient on-prem infrastructural setup handling millions of users? And they have been doing this really well for quite some time.
Let’s take a plunge.
Why Move to the Cloud?
Businesses today prefer running their services on the cloud for obvious reasons. They either build services right from the bare bones with an intent to run it on the cloud or want to move their legacy systems to the cloud or are in the process of migration.
They prefer cutting down the additional weight of infrastructure management, after all, managing a scalable, highly available cloud infrastructure is something not so trivial.
Cloud is the new workload deployment norm. Evernote having started with a self-managed on-prem setup eventually migrated their entire workload to the Google Cloud enabling its developers focus on bringing new features to the market, focussing on the customer needs as opposed to worrying about keeping the machines up.
Migration to the cloud helps with faster capacity provisioning, improved security, disaster recovery. Engineering teams can leverage new state of the art cloud technology offerings, leverage cloud-native features like elasticity, availability, scalability, geographically spread out infrastructure.
Now, let’s go through the different phases of the whole migration process.
The Research Phase
First up, go through this write-up on why use cloud? How it is architecturally different than traditional computing? This will give you a good insight into why a business should deploy its workload on the cloud as opposed to on-prem setup.
One can assume the efforts that go in managing the on-prem infrastructure at Twitter scale. Hundreds of millions of Tweets being sent, Processing over a trillion events, Hundreds of Petabytes of data being ingested. Tens of thousands of jobs over a dozen clusters every single day. Scalability & HA is a crucial factor.
Finally, Twitter took the call to migrate it’s services to the cloud that would eventually help them a great deal in increasing their productivity & make their business grow.
The engineering team at Twitter started a rigorous cloud evaluation process, meditating on the various options available in the industry that could manage something at the scale of Twitter.
They evaluated different options such as Real-time processing, Key value, Message queues, Object Stores, Databases, General computing, Batch computing.
They brooded over if they had to re-architect the existing services for the cloud, should they replace their tech with that provided by the cloud vendor or just lift & shift the workload from on-prem to the cloud.
They considered several cloud providers ran functional tests, Micro & Macro benchmarking, Scale testing on the cloud to get the lay of the land. To validate if a certain cloud platform fits their needs.
Micro-benchmarking helped them get an insight into the resources they would have to depend on, for instance, different cloud storage classes and stuff.
Macro-benchmarks & Scale testing helped them gauge their performance.
All the tests & iterations took them over a year. Finally, they picked Google Cloud, Google’s infrastructure suited their requirements pretty well. The cloud platform provided flexibility in both storage & compute with a high-speed network.
The Migration Phase
For the migration, a plan, pipeline, timeline & financial projections were prepared with a sole motive to avoid any sort of disruption to the existing platform functionality.
To cut risks, a divide & conquer approach was chosen which meant moving small parts of the entire workload to the cloud, split things up & learn as they go.
Twitter picked it’s cold storage data & Hadoop clusters to move to Google Cloud. Simply, because if anything went south, it would have a minimum immediate direct impact on the running services. Risks were comparatively lower in this case.
Twitter moved approx. 300 Petabytes of data to Google Cloud Storage.
To read more about the persistence technologies & Hadoop cluster used at Twitter do read What databases does Twitter use?
Twitter started using Hadoop to take MySQL backups but over time the use cases for the tech grew manifold.
Today Hadoop at Twitter is used to ingest real-time data generated by the end users on the platform. It is used to run back-end production jobs. It helps the engineering team manage not so frequently accessed data.
Hadoop is also used for social graph analysis, API analytics, user engagement prediction, recommendations, ad analytics etc.
When migrating the Hadoop clusters to the cloud, the cold storage clusters were picked while the production Hadoop clusters remained on-prem for the reasons I’ve stated above, less risk & all. Also, the cold storage would have benefited the most from the cloud platform offerings.
The CPU in the cold storage cluster was underutilized in comparison to the CPU utilization of other clusters.
The diagram below shows Hadoop clusters at Twitter’s on-prem infrastructure
Moving Data to Google Cloud
Data is continually produced at the social platform & the size of it continually augments with time.
So, picking a chunk of data & manually physically moving it to the cloud in a hard drive is not a feasible solution. To transfer the data, the engineering team needed to establish a performant network connection between Twitter and Google Cloud.
They used an 800 Gbps redundant direct peering network connection between the two platforms to transfer the data.
In every data centre, dedicated copy clusters were setup with a sole aim to transfer data over from Twitter to Google Cloud.
To process the data, Twitter uses Hive, Presto queries which translate into DAG Direct Acyclic Graph jobs which run over the clusters.
On cloud the same tasks are run using products like Cloud Dataflow, Cloud Dataproc & BigQuery managed services.
Google Cloud Dataflow is a fully-managed serverless real-time streaming & batch data processing offering.
Cloud Dataproc is a fully managed cloud service for running Apache Spark & Hadoop clusters in a cost-effective way. It provides a comprehensive platform for data processing, analytics and machine learning & BigQuery is a serverless scalable data warehouse with an in-memory business intelligence engine.
As you can see in the diagram above all the data generated by the users at Twitter still flows through the on-prem infrastructure of the platform & then replicated over to Google Cloud.
The new architecture averts the need for a dedicated Hadoop cold storage cluster.
As in the new architecture the compute & storage for the Hadoop workload are separate it has enabled Twitter to save tens of thousands of underutilized provisioned CPU cores.
For more information on Hadoop implementation on Google Cloud do go through the videos below from Google Cloud’s YouTube channel.
How @TwitterHadoop Chose Google Cloud (Cloud Next’)
Visualizing Cloud Bigtable Access Patterns at Twitter for Optimizing Analytics (Cloud Next’)
More On the Blog
Guys, this is pretty much about the migration of Twitter to Google Cloud. If you liked the article, do share it with your folks.
Follow 8bitmen on social media. You can subscribe to the browser notifications to stay notified on any new content on the blog.
I’ll see you in the next write-up.
Looking for developer, software architect jobs? Try Jooble. Jooble is a job search engine created for a single purpose: To help you find the job of your dreams!
If you are a developer and find it hard to cope with constant changes in technology. You are sick and tired of it. You are looking for ways to jump off that endless upskilling treadmill staying relevant and hireable.
You might want to check out my course, DEVELOPER’S ROADMAP TO EXCELLENCE AND FINANCIAL INDEPENDENCE, where I share with you the roadmap and techniques that I follow to keep my sanity in this ever-changing world of software development without killing myself. In it, you’ll find actionable advice and critical points that will enable you to make informed career decisions and accelerate your career at MACH speed.
- Distributed Systems and Scalability Feed #2
- Application Hosting – How PolyHaven Manages 5 Million Page Views and 80TB Traffic a Month for < $400
- The Technology Upskill Mess For Application Developers
- Low Code: The Only Article You Will Ever Need
- No Code: A Profound Insight For Citizen and Professional Developers