This write-up is a comprehensive insight into the persistence layer of Facebook and the first of the distributed systems write-ups. It takes a deep dive into questions like what database does Facebook use? Is it a NoSQL or a SQL DB? Is it just one or multiple databases? Or a much more complex polyglot database architecture? How is Facebook’s backend handling billions of data fetch operations every single day?

So, without further ado.
Let’s get started.

Distributed Systems
For a complete list of similar articles on distributed systems and real-world architectures here you go


1. What Database Does Facebook Use?

If you need a quick answer here it is:

MySQL is the primary database used by Facebook for storing all the social data. They started with the InnoDB MySQL database engine and then wrote MyRocksDB, which was eventually used as the MySQL Database engine.

Memcache sits in front of MySQL as a cache.

To manage BigData Facebook leverages Apache Hadoop, HBase, Hive, Apache Thrift and PrestoDB. All these are used for data ingestion, warehousing and running analytics.

Apache Cassandra is used for the inbox search

Beringei & Gorilla, high-performance time-series storage engines are used for infrastructure monitoring

LogDevice, a distributed data store is used for storing logs

Now let’s delve into the specifics.


1.1 PolyGlot Persistence Architecture

Facebook, today, is not a monolithic architecture. It might have been at one point in time like LinkedIn was but definitely not today.

The social network as a whole consists of several different loosely coupled components plugged in together like Lego blocks. For instance, photo sharing, messenger, social graph, user post, etc. are all different loosely coupled microservices running in conjunction with each other. And every microservice has a separate persistence layer to keep things easy to manage.

scaleyourapp.com Microservices architecture

Polyglot persistence architecture has several upsides. Different databases with different data models can be leveraged to implement different use cases. The system is more highly available, and easy to scale.


1.2 What is a Polyglot Persistence System?

Put in simple words, Polyglot Persistence means using different databases having unique features and data models for implementing different use cases of the system.

For instance, Cassandra, Memcache would serve different persistence requirements in contrast to a traditional MySQL DB.

If we have ACID requirements like for a financial transaction, MySQL would fit best. On the other hand, when we need fast data access, we would pick Memcache or when we are okay with data denormalization and it being eventually consistent but need a fast highly available database, a NoSQL solution would fit best.

Polyglot persistence

Web Application and Software Architecture 101
Master the Fundamentals Of Web Architecture and Large Scale Systems

> Master the concepts involved in designing the architecture of a web application.
> Learn to pick the right architecture and the technology stack for a use case. 
> Stand out amongst your peers with a clear understanding of software architecture.

Check out the course here

Moving on…

So, we know that no technology is a silver bullet. For this reason, Facebook leverages different persistence technologies to fulfill its different persistence requirements.


1.3 Does Facebook Use A Relational Database System?

MySQL

This is the primary database that Facebook uses with different engines. Different engines? I’ll get to that.

Facebook leverages a social graph to track and manage all the user events on the portal such as who liked whose post. Mutual friends. Which of your friends already ate at the restaurant you are visiting and so on. And this social graph is powered by MySQL.

Initially, the Facebook engineering team started with MySQL InnoDB engine to persist social data. Data took too much storage space despite being compressed. More space usage meant more hardware requirements which naturally spiked the data storage costs.

scaleyourapp.com Fb Inno db data storage chart
Image source: Facebook

 

InnoDB MySQL Storage Engine

InnoDB is the default MySQL storage engine that provides high reliability and performance.

InnoDB Architecture

scaleyourapp.com InnoDB Architecture
Image source: InnoDB

MyRocks MySQL Storage Engine Written by Facebook

To deal with the space issues. The engineering team at Facebook wrote a new MySQL database engine MyRocks which reduced the space usage by 50% also helped improve the write efficiency.

Over time Facebook migrated its user-facing database engine from InnoDB to MyRocks. The migration didn’t have much difficulty since just the DB engines changed and the core tech MySQL was the same.

scaleyourapp.com Facebook MyRocks DB
Image source: Facebook

After the migration, for a long while, the engineering team ran data consistency verification checks to check if everything went smooth.

Several benchmark processes were run to evaluate the DB performance and the results stated that MyRocks instance size turned out to be 3.5 times smaller than the InnoDB instance uncompressed and 2 times smaller than InnoDB instance compressed.

WebScale SQL

MySQL is the most popular persistence technology ever and is naturally deployed by big guns.

WebScale SQL is a collaboration amongst engineers from several different companies such as Google, Twitter, LinkedIn, Alibaba running MySQL in production at scale to build and enhance MySQL features that are required to run in large-scale production environments.

Facebook has one of the largest MySQL deployments in the world. And it shares a common WebScale SQL development codebase with the other companies.

The engineering team is Facebook is preparing to move its production tested versions of table, user and compression statistics into WebScaleSQL.


2. RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Initially, Facebook wrote an embeddable persistent key-value store for fast storage called RocksDB. Which being a key-value store had some advantages over MySQL. RocksDB was inspired by LevelDB a datastore written by Google.

It was sure fast but did not support replication or an SQL layer and the Facebook engineering team wanted those MySQL features in RocksDB. Eventually, they built MyRocks, an open-source project that had RocksDB as a MySQL storage engine.

RocksDB fits best when we need to store multiple terabytes of data in one single database.

Some of the typical use cases for RocksDB:

1. Implementing a message queue that supports a large number of inserts & deletes.
2. Spam detection where you require fast access to your dataset.
3. A graph search query that needs to scan a data set in real-time.


3. Memcache – Distributed Memory Caching System

Memcache is being used at Facebook right from the start. It sits in front of the database, acts as a cache, and intercepts all the data requests bolting towards the database.

Memcache helps reduce the request latency by a large extent eventually providing a smooth user experience. It also powers the Facebook social graph having trillions of objects and connections, growing every moment.

Memcache is a distributed memory caching system, used by big guns in the industry such as Google Cloud.

Facebook Caching Model

When a user updates the value of an object, the new value is written to the database and the old value is deleted from Memcache. The next time user requests that object, the updated value is fetched from the database and written to Memcache. Now after this for every request, the value is served from Memcache until it is modified.

This flow appears pretty solid until the database and the cache are deployed in a distributed environment. Now eventual consistency comes into effect.

The instances of an app are geographically distributed. When one instance of a distributed database is updated, say in Asia, it takes a while for the changes to cascade to all of the instances of the database running globally.

Now right at a point when the value of an object is updated in Asia, a person requesting that object in America will receive the old value from the cache.

This is typically a tradeoff between high availability and data consistency.

To learn more about strong and eventual consistency, CAP theorem and caching strategies, check out my web architecture course here.


4. How Does Facebook Manage Big Data?

4.1 Apache Hadoop

Well, this shouldn’t come as a surprise, Facebook has an insane amount of data that grows every moment. And they have an infrastructure in place to manage such an ocean of data.

Read this article on data ingestion to understand why it is super important for businesses to manage & make sense of large amounts of data?

Apache Hadoop is the ideal open-source utility to manage big data & Facebook uses it for running analytics, distributed storage & for storing MySQL database backups.

Besides Hadoop, there are also other tools like Apache Hive, HBase, Apache Thrift that are used for data processing

Facebook has open-sourced the exact versions of Hadoop which they run in production. They have possibly the biggest implementation of the Hadoop cluster in the world. Processing approx. 2 petabyte of data per day in multiple clusters at different data centres.

Facebook messages use a distributed database called Apache HBase to stream data to Hadoop clusters.
Another use case is collecting user activity logs in real-time in Hadoop clusters.


4.2 Apache HBase Deployment at Facebook

HBase is an open-source, distributed database, non-relational in nature, inspired by Google’s BigTable. It is written in Java.

Facebook Messaging Component originally used HBase, running on top of HDFS. It was chosen by the engineering team due to the high write throughput & low latency it provided. The other features, it being a distributed project, included horizontal scalability, strong consistency & high availability. Now the messenger service uses RocksDB to store user messages.

HBase is also used in production by other services such as the internal monitoring system, search indexing, streaming data analysis & data scraping.

scaleyourapp.com HBase At Facebook
Image source: Facebook

Migration Of Messenger Storage From HBase To RocksDB

The migration of the messenger service database from HBase to RocksDB enabled Facebook to leverage flash memory to serve messages to its users as opposed to serving messages from the spinning hard disks. Also, the replication topology of MySQL is more compatible with the way Facebook data centers operate in production. This enabled the service to be more available and have better disaster recovery capabilities.


4.3 Apache Cassandra – A Distributed Wide-Column Store

Apache Cassandra is a distributed wide-column store built in house at Facebook for the Inbox search system. Cassandra was written to manage structured data & scale to a very large size across multiple servers with no single point of failure.

The project runs on top of an infrastructure of hundreds of nodes spread across many data centres. Cassandra is built to maintain a persistent state in case of node failures. Being distributed features like scalability, high performance, high availability are inherent.


4.4 Apache Hive – Data Warehousing, Query & Analytics

Apache Hive is a data warehousing software project built on top of Hadoop for running data query & analytics.

At Facebook, it is used to run data analytics on petabytes of data. The analysis is used to gain an insight into the user behaviour, it helps in writing new component & services, & understanding user behaviour for Facebook Ad Network.

Hive inside Facebook is used to convert SQL queries to a sequence of map-reduce jobs that are then executed on Hadoop. Writing programmable interfaces & implementations of common data formats & types, to store metadata etc.


5. Presto DB – A High Performing Distributed Relational Database

PrestoDB is an open-source performant, distributed RDBMS primarily written for running SQL queries against massive amounts, like petabytes of data. It’s an SQL query engine for running analytics queries. A single Presto query can combine data from multiple sources, enabling analytics across the organization’s system.

The DB has a rich set of capabilities enabling data engineers, scientists, business analysts process Tera to Petabytes of data.

Facebook uses PrestoDB to process data via a massive batch pipeline workload in their Hive warehouse. It also helps in running custom analytics with low latency & high throughput. The project has also been adopted by other big guns such as Netflix, Walmart, Comcast etc.

The below diagram shows the system architecture of Presto

scaleyourapp.com PrestoDB Architecture
Image source: Facebook

The client sends SQL query to the Presto co-ordinator. The coordinator then parses, analyzes & plans the query execution. The scheduler assigns the work to the nodes located closest to the data & monitors the processes.

The data is then pulled back by the client at the output stage for results. The entire system is written in Java for speed. Also, it makes it really easy to integrate with the rest of the data infrastructure as they too are written in Java.

Presto connectors are also written to connect with different data sources.


6. Beringei: A High-Performance Time Series Storage Engine

Beringei is a time series storage engine & a component of the monitoring infrastructure at Facebook. The monitoring infrastructure helps in detecting the issues & anomalies as they arise in real-time.

Facebook uses the storage engine to store system measurements such as product stats like how many messages are sent per minute, the service stats, for instance, the rate of queries hitting the cache vs the MySQL database. Also, the system stats like the CPU, memory & network usage.

All the data goes into the time series storage engine & is available on dashboards for further analysis. Ideally in the industry Grafana is used to create custom dashboards for running analytics.


7. Gorilla: An In-Memory Time Series Database

Gorilla is Facebook’s in-memory time series database primarily used in the monitoring & analytics infrastructure. It is intelligent enough to handle failures ranging from a single node to entire regions with little to no operational overhead. The below figure shows how gorilla fits in the analytics infrastructure.

Since deployment Gorilla has almost doubled in size twice in the 18-month period without much operational effort which shows the system is pretty scalable. It acts as a write-through cache for monitoring data gathered across all of Facebook’s systems. Gorilla reduced Facebook’s production query latency by over 70x when compared with the previous stats.


8. LogDevice: A Distributed DataStore For Logs

Logs are the primary way to track the bugs occurring in production, they help in understanding the context & writing a fix. No system can run without logs. And a system of the size of Facebook where so many components are plugged in together generates a crazy amount of logs.

To store & manage all these logs Facebook uses a distributed data store for logs called LogDevice.

It’s a scalable and fault-tolerant distributed log system. In comparison to a file system which stores data as files, LogDevice stores data as logs. The logs are record-oriented, append-only & trimmable. The project has been written from ground up to serve multiple types of logs with high reliability & efficiency at scale.

The kind of workloads supported by LogDevice are event logging, stream processing, ML training pipelines, transaction logging, replicated state machines etc.


9. Conclusion

Folks, this is pretty much it. I did quite a bit of research on the persistence layer of Facebook & I think I’ve covered all the primary databases deployed at Facebook.

Also, this article will be continually updated as Facebook’s systems evolve. I will also cover the architecture & flows of different components of Facebook such as the messenger, news feed etc. in separate articles.

Recommended Read: Master System Design For Your Interviews Or Your Web Startup


Handpicked Resources to Learn Software Architecture and Large Scale Distributed Systems Design
I’ve put together a list of resources (online courses + books) that I believe are super helpful in building a solid foundation in software architecture and designing large-scale distributed systems like Facebook, YouTube, Gmail, Uber, and so on.  Check it out.


Subscribe to the newsletter to stay notified of the new posts.

Your subscription could not be saved. Please try again.
Please check your inbox to confirm the subscription. Double-check in your spam. Please help me out if it’s there, move it to your inbox.

Subscribe to my newsletter

Get new content on real-world engineering challenges and more in your inbox every month by subscribing to my newsletter.



If you liked the article, share it on the web. You can follow scaleyourapp.com on social media, links below, to stay notified of the new content published. I am Shivang, you can read about me here!