Distributed Data Processing 101 – The Only Guide You’ll Ever Need
This write-up is an in-depth insight into the distributed data processing. It will cover all the frequently asked questions about it such as What is it? How different is it in comparison to the centralized data processing? What are the pros & cons of it? What are the various approaches & architectures involved in distributed data processing? What are the popular technologies & frameworks used in the industry for processing massive amounts of data across several nodes running in a cluster? etc.
So, without any further ado.
Let’s get on with it.
Before delving right into the distributed part of it. Let’s have a quick insight into what is data processing?
1. What is Data Processing?
Data processing is ingesting massive amounts of data in the system from several different sources such as IoT devices, social platforms, satellites, wireless networks, software logs etc. & running the business logic/algorithms on it to extract meaningful information from it.
Running algorithms on the data & extracting information from it is also known as Data Analytics.
Data analytics helps businesses use the information extracted from the raw, unstructured, semi-structured data in terabytes, petabytes scale to create better products, understand what their customers want, understand their usage patterns, & subsequently evolve their service or the product.
There are different stages & system architectures involved in the entire data processing process. I’ll do cover all that in this write-up.
But before that let’s look into what is distributed data processing?
2. What is Distributed Data Processing? How Different Is It to Centralized Data Processing?
Distributed data processing is diverging massive amount of data to several different nodes running in a cluster for processing.
All the nodes execute the task allotted parallelly, they work in conjunction with each other connected by a network. The entire set-up is scalable & highly available.
Why Process Data in a Distributed Environment? What Are the Upsides?
Processing data in a distributed environment helps accomplish the task in a significantly less amount of time as opposed to when running on a centralized data processing system solely due to the reason that here the task is shared by a number of resources/machines & executed parallelly instead of being run synchronously arranged in a queue.
Since the data is processed in lesser time, it is cost-effective for businesses & helps them to move fast.
Data is made redundant & replicated across the cluster to avoid any sort of data loss.
3. How Does Distributed Data Processing Work?
In a distributed data processing system a massive amount of data flows through several different sources into the system. This process of data flow is known as Data ingestion.
Once the data streams in there are different layers in the system architecture which break down the entire processing into several different parts.
Let’s quickly have an insight into what they are:
Data Collection & Preparation Layer
This layer takes care of collecting data from different external sources & preparing it to be processed by the system.
When the data streams in it has no standard structure. It is raw, unstructured or semi-structured in nature.
It may be a blob of text, audio, video, image format, tax return forms, insurance forms, medical bills etc.
The task of the data preparation layer is to convert the data into a consistent standard format, also to classify it as per the business logic to be processed by the system.
The layer is intelligent enough to achieve all this without any sort of human intervention.
Data Security Layer
Moving data is vulnerable to security breaches. The role of the data security layer is to ensure that the data transit is secure by watching over it throughout, applying security protocols, encryption & stuff.
Data Storage Layer
Once the data streams in it has to be persisted. There are different approaches to do this.
If the analytics is run on streaming data in real-time in-memory distributed caches are used to store & manage data.
On the contrary, if the data is being processed in a traditional way like batch processing distributed databases built for handling big data are used to store stuff.
Data Processing Layer
This is the layer contains logic which is the real deal, it is responsible for processing the data.
The layer runs business logic on the data to extract meaningful information from it. Machine learning, predictive, descriptive, decision modelling are primarily used for this.
Data Visualization Layer
All the information extracted is sent to the data visualization layer which typically contains browser-based dashboards which display the information in the form of graphs, charts & infographics etc.
Kibana is one good example of a data visualization tool, pretty popular in the industry.
4. What Are the Types of Distributed Data Processing?
There are primarily two types of it. Batch Processing & Real-time streaming data processing.
Batch processing is the traditional data processing technique where chunks of data are streamed in batches & processed. The processing is either scheduled for a certain time of a day or happens in regular intervals or is random but not real-time.
Real-time Streaming Data Processing
In this type of data processing, data is processed in real-time as it streams in. Analytics is run on the data to get insights from it.
A good use case of this is getting insights from sports data. As the game goes on the data ingested from social media & other sources is analyzed in real-time to figure the viewers’ sentiments, players stats, predictions etc.
Up next, let’s talk about the technologies involved in both the data processing types.
5. What Are the Technologies Involved in Distributed Data Big Data Processing?
MapReduce – Apache Hadoop
MapReduce is a programming model written for managing distributed data processing across several different machines in a cluster, distributing tasks to several machines, running work in parallel, managing all the communication and data transfer within different parts of the system.
The Map part of the programming model involves sorting the data based on a parameter and the Reduce part involves summarizing the sorted data.
The most popular open source implementation of the MapReduce programming model is Apache Hadoop.
Apache Spark is an open-source cluster computing framework. It provides high performance for both batch & real-time in-stream processing.
It can work with diverse data sources & facilitates parallel execution of work in a cluster.
Spark has a cluster manager and distributed data storage. The cluster manager facilitates communication between different nodes running together in a cluster whereas the distributed storage facilitates storage of big data.
Spark seamlessly integrates with distributed data stores like Cassandra, HDFS, MapReduce File System, Amazon S3 etc.
Apache Storm is a distributed stream processing framework. In the industry, it is primarily used for processing massive amounts of streaming data.
It has several different use cases such as real-time analytics, machine learning, distributed remote procedure calls etc.
Apache Kafka is an open-source distributed stream processing & messaging platform. It’s written using Java & Scala & was developed by LinkedIn.
The storage layer of Kafka involves a distributed scalable pub/sub message queue. It helps read & write streams of data like a messaging system.
Kafka is used in the industry to develop real-time features such as notification platforms, managing streams of massive amounts of data, monitoring website activity & metrics, messaging, log aggregation.
Hadoop is preferred for batch processing of data whereas Spark, Kafka & Storm are preferred for processing real-time streaming data.
6. What Are the Architectures Involved in Distributed Data Processing?
There are two popular architectures involved in the distributed big data processing. Lambda & Kappa
Lambda is a distributed data processing architecture which leverages both the batch & the real-time streaming data processing approaches to tackle the latency issues arising out of the batch processing approach.
It joins the results from both the approaches before presenting it to the end user.
Batch processing does take time considering the massive amount of data businesses have today but the accuracy of the approach is high & the results are comprehensive.
On the contrary, real-time streaming data processing provides quick access to insights. In this scenario, the analytics is run over a small portion of data so the results are not that accurate & comprehensive when compared to that of the batch approach.
Lambda architecture makes the most of the two approaches. The architecture has typically three layers the Batch Layer, the Speed Layer & the Serving layer.
The Batch Layer deals with the results acquired via batch processing the data. The Speed layer gets data from the real-time streaming data processing & the Serving layer combines the results obtained from both the Batch & the Speed layers.
In this architecture, all the data flows through a single data streaming pipeline as opposed to the Lambda architecture which has different data streaming layers which converge into one.
The architecture flows the data of both real-time & batch processing through a single streaming pipeline reducing the complexity of not having to manage separate layers for processing data.
Kappa contains only two layers, Speed, which is the streaming processing layer, & the Serving which is the final layer.
Kappa is not an alternative for Lambda. Both the architectures have their use cases.
Kappa is preferred if the batch and the streaming analytics results are fairly identical in a system. Lambda is preferred if they are not.
Both the architectures can be implemented using the distributed data processing technologies I’ve talked about above.
7. What Are the Pros & Cons of Distributed Data Processing?
This section involves the pros and cons of distributed data processing, though I don’t consider the cons as cons. Rather they are the trade-offs of working with scalable distributed systems.
Let’s look into them.
Distributed data processing facilitates scalability, high availability, fault tolerance, replication, redundancy which is typically not available in centralized data processing systems.
Parallel distributed of work facilitates faster execution of work.
Enforcing security, authentication & authorization workflows becomes easier as the system is more loosely coupled.
Setting up & working with a distributed system is complex. Well, that’s expected having so many nodes working in conjunction with each other, maintaining a consistent shared state.
The management of distributed systems is complex. Since the machines are distributed it entails additional network latency which engineering teams have to deal with.
Strong consistency of data is hard to maintain when everything is so distributed.
Recommended Read: Master System Design For Your Interviews Or Your Web Startup
Subscribe to the newsletter to stay notified of the new posts.
I am Shivang, the author of this writeup. You can read more about me here.
8. More On the Blog
- Distributed Systems, Scalability & System Design #1 – Heroku Client Rate Throttling
- Zero to Software/Application Architect – Learning Track
- Java Full Stack Developer – The Complete Roadmap – Part 2 – Let’s Talk
- Java Full Stack Developer – The Complete Roadmap – Part 1 – Let’s Talk
- Best Handpicked Resources To Learn Software Architecture, Distributed Systems & System Design