YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space?
YouTube is the second most popular website on the planet after Google. As of May 2019, more than 500 hours of video content is uploaded to the platform every single minute.
With over 2 billion users, the video-sharing platform is generating billions of views with over 1 billion hours of videos watched every single day. Boy!! These are just incredible numbers.
This write-up is an insight into the databases used at YouTube and the backend data infrastructure that enables the video platform to store such an insane amount of data, as well as scale with billions of users.
So, here it goes…
YouTube started its journey in 2005. As this venture capital-funded technology startup gained traction it was acquired by Google in November 2006 for US$1.65 billion.
Before they were acquired by Google the team consisted of:
2 system admins
2 scalability software architects
2 feature developers
2 network engineers
2. Backend Infrastructure
MySQL is the primary database powered by Vitess, a database clustering system for horizontal scaling of MySQL. Memcache is used for caching & Zookeeper for node co-ordination.
Popular videos are served from the CDN & moderately, lesser-played videos are fetched from the database.
Every video, at the time of the upload, is given a unique identifier and is processed by a batch job that runs several automated processes such as generating thumbnails, metadata, video transcripts, encoding, setting the monetization status and so on.
VP9 & H.264/MPEG-4 AVC Advanced Video Coding codecs are used for video compression that encode videos with HD & 4K quality at half the bandwidth required by other codecs.
Dynamic Adaptive Streaming over HTTP protocol is used for the streaming of videos. It’s an adaptive bitrate streaming technique that enables high-quality streaming of videos over the web from conventional HTTP web servers. Via this technique, the content is made available to the viewer at different bit rates. YouTube client automatically adapts the video rendering as per the internet connection speed of the viewer thus cutting down the buffering as much as possible.
I’ve discussed the video transcoding process at YouTube in a separate article – How does YouTube serve high-quality videos with low latency. Check it out.
So, this was a quick insight into the backend tech of the platform. MySQL is the primary database used. Now, let’s understand why did the engineering team at YouTube felt the need to write Vitess? What were the issues they faced with the original MySQL setup that made them implement an additional framework on top of it?
Zero to Software/Application Architect learning track is a series of four courses that I am writing with an aim to educate you, step by step, on the domain of software architecture & distributed system design. The learning track takes you right from having no knowledge in it to making you a pro in designing large scale distributed systems like YouTube, Netflix, Google Stadia & so on. Check it out.
3. The Need For Vitess
The website started with a single database instance. As it gained traction, to meet the increasing QPS (Queries Per Second) demand, the developers had to horizontally scale the relational database.
3.1 Master-Slave Replica
Replicas were added to the master database instance. Read requests were routed to both the master and the replicas parallelly to cut down the load on the master. Adding replicas helped get rid of the bottleneck, increased read throughput & added durability to the system.
Master node handled the write traffic, whereas both the master and the replica nodes handled the read traffic.
Though in this scenario, there was a possibility of getting stale data from the read replica. If a request fetched the data from it before it being updated by the master with the new information, this would return the stale information to the viewer.
At this point in time, the data between the master and the replica would be inconsistent. The inconsistency in this scenario would be the difference in view counts for a particular video between the master & the replica.
Well, that is completely alright!! A viewer won’t mind if there is a slight inconsistency in the view count. Right? It’s more important that the video gets rendered in his browser.
The data between the master and the replicas would eventually become consistent.
If you are unfamiliar with concepts like eventual consistency, strong consistency, CAP Theorem. If you want to understand how does data consistency work in a distributed environment? What database should I choose when developing my own app? & much more. Do check out my Web Application Architecture & Software Architecture 101 course here.
So, the engineers were happy, viewers were happy. Things went smooth with the introduction of the replicas.
The site continued gaining popularity and the QPS continued to rise. The master-slave replication strategy now struggled to keep up with the rise in the traffic on the website.
The next strategy was to shard the database. Sharding is one of the ways of scaling a relational database besides the master-slave replication, master-master replication, federation & de-normalization.
Sharding a database is not a trivial process. It increases the system complexity by a significant amount and makes the management harder.
Regardless, the database had to be sharded to meet the increase in QPS. After the developers sharded the database, the data got spread across multiple machines. This increased the write throughput of the system. Now instead of just the single master instance handling the writes, write operations could be done on multiple sharded machines.
Also, for every machine separate replica were created for redundancy & throughput.
The platform continued to blow up in popularity, a large amount of data was continually added to the database by the content creators.
To avoid data loss or the service being unavailable, due to machine failures or any external unforeseen events, it was time to add the disaster management features into the system.
3.3 Disaster Management
Disaster management means having contingencies in place to survive power outages, natural disasters like earthquakes, fires etc. It entails having redundancies in place and the user data backed up in data centers located in different geographical zones across the world. Losing user data or the service being unavailable wasn’t an option.
Having several data centers across the world also helped YouTube reduce the latency of the system as user requests were routed to the nearest data center as opposed to being routed to the origin server located in a different continent.
You can imagine by now how complex the infrastructure had become.
Often unoptimized full table scan queries took down the entire database. It had to be protected from the bad queries. All the servers needed to be tracked to ensure an efficient service.
Developers needed a system in place that would abstract the complexities of the system, enable them to address the scalability challenges & manage the system with minimal effort. This led to the development of Vitess.
To educate yourself on software architecture from the right resources, to master the art of designing large scale distributed systems that would scale to millions of users, to understand what tech companies are really looking for in a candidate during their system design interviews. Read my blog post on master system design for your interviews or web startup.
4. Vitess – A Database Clustering System For Horizontal Scaling Of MySQL
Vitess is a database clustering system that runs on top of MySQL and enables it to scale horizontally. It has built-in sharding features that enable developers to scale their database without adding any sharding logic to the application. Something along the lines of what a NoSQL database does.
Vitess Architecture Image Source
Vitess also automatically handles failovers and backups. Administers the servers, improves the database performance by intelligently rewriting resource-intensive queries & by implementing caching. Besides YouTube, the framework is also used by other big guns in the industry such as GitHub, Slack, Square, New Relic & so on.
Vitess really stands out when you cannot let go of your relational database because you need support for ACID transactions & strong consistency & at the same time you also want to scale your relational database on the fly like a NoSQL database.
For having an understanding of the pros & cons of relational & NoSql databases. What type of database would fit best for your use case? Are NoSQL databases more performant than relational ones? check out my Web Application Architecture & Software Architecture 101 course here.
At YouTube, every MySQL connection had an overhead of 2MB. There was a computational cost associated with every connection, also additional RAM had to be added as the number of connections increased.
Vitess helped them manage these connections at low resource costs via its connection pooling feature built on Go programming language’s concurrency support. It uses Zookeeper to manage the cluster & to keep it up to date.
5. Deployment On Cloud
Vitess is cloud-native, suits well for cloud deployments as the capacity is incrementally added to the database just like it happens in the cloud. It can run as a Kubernetes aware cloud-native distributed database.
At YouTube, Vitess runs in a containerized environment with Kubernetes as the container orchestration tool.
In today’s computing era, every large-scale service runs on cloud in a distributed environment. Running a service on the cloud has numerous upsides.
Google Cloud Platform is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube.
Every large-scale online service has a polyglot persistence architecture as one data model be it a relational or a NoSQL isn’t equipped to handle all the use cases of a service.
During my research for this article, I couldn’t find the list of specific Google Cloud databases that YouTube uses, but I am pretty sure it would be leveraging GCP’s unique offerings such as Google Cloud Spanner, Cloud SQL, Cloud Datastore, Memorystore & many more to run different features of the service.
YouTube uses low-latency, low-cost content delivery using Google’s global network. It Leverages the globally distributed edge points of presence POP to enable its client to fetch data a lot faster as opposed to fetching it from the origin server.
So, upto this point I’ve talked about the database, frameworks, tech used at YouTube. Time to talk about storage.
How does YouTube store such an insane amount of data (500 hours of video content uploaded every single minute)?
7. Data Storage – How Does YouTube Stores Such An Insane Amount Of Data?
The videos are stored in the hard drives in warehouse-scale Google datacenters. The data is managed by the Google File System & BigTable.
GFS Google File System is a distributed file system developed by Google to manage large scale data in a distributed environment.
BigTable is a low latency distributed data storage system built on Google File System to deal with petabyte-scale data spread over thousands of machines. It’s used by over 60 Google products.
To understand how workloads are deployed in the cloud across multiple availability zones, how petabyte-scale data is stored & stuff, in detail; check out my Cloud Computing 101 – Master the Fundamentals course.
So, the videos get stored in the hard drives. Relationships, metadata, user preferences, profile information, account settings, relevant data required to fetch the video from the storage etc. goes into MySQL.
7.1 Plug & Play Commodity Servers
Google data centers have homogeneous hardware & the software is built-in house to manage the clusters of thousands of individual servers.
The servers, deployed, that augment the storage capacity of the data center are commodity servers also known as the commercial off-the-shelf servers. These are inexpensive, widely available servers that can be bought in large numbers & replaced or configured along with the same hardware in the data center with minimal cost & effort.
As the demand for additional storage rises, new commodity servers are plugged in into the system.
Commodity servers are typically replaced as opposed to being repaired. Since they are not custom-built, their use enables the business to cut down the infrastructure costs by quite an extent as opposed to when running the servers that are custom built.
7.2 Storage Disks Designed For Data Centers
YouTube needs more than a petabyte of new storage every single day. Spinning hard disk drives are the primary storage medium due to their low costs and reliability.
SSD Solid State Drives are more performant than the spinning disks as they are semi-conductor based but large-scale use of SSDs is not economical. They are pretty expensive also tend to gradually lose data over time. That makes them not so suitable for archival data storage.
Also, Google is working on a new line of disks that are designed for large-scale data centers.
There are five key metrics to judge the quality of hardware built for data storage:
- The hardware should have the ability to support a high rate of input-output operations per second.
- It should meet the security standards laid out by the organization.
- It should have a higher storage capacity as opposed to the regular storage hardware.
- The hardware acquisition cost, power cost and the maintenance overheads should be acceptable.
- The disks should be reliable and have consistent latency.
Read – Best resources to learn software architecture & system design
I’ve put together a list of resources (online courses + books) that I believe are super helpful in building a solid foundation in software architecture & designing large scale distributed systems like Facebook, YouTube, Gmail, Uber & so on.
Subscribe to the newsletter to stay notified of the new posts.
If you liked the article, share it with your folks. You can follow scaleyourapp.com on social media, links below, to stay notified of the new content published. I am Shivang, you can read about me here!
More On The Blog
- Distributed Systems, Scalability & System Design #1 – Heroku Client Rate Throttling
- Zero to Software/Application Architect – Learning Track
- Java Full Stack Developer – The Complete Roadmap – Part 2 – Let’s Talk
- Java Full Stack Developer – The Complete Roadmap – Part 1 – Let’s Talk
- Best Handpicked Resources To Learn Software Architecture, Distributed Systems & System Design