Select Page

Comparing Hadoop and SQL

If you have been familiar to working with conventional SQL-databases, hearing someone talk about Hadoop can sound like a crazy mess. In this blog, we are going to try and simplify three of the many differences between conventional SQL sources and Hadoop. Which hopefully will provide context as to where each model is used.

Comparing Hadoop and SQL

Schema on Write vs. Schema on Read

The first difference we want to point-out is schema on-write vs. schema on-read. When we transfer data from SQL-database A to SQL-database B, and before we write the data to database B we need to have some information in hand. For instance, how to adapt the data from database A to fit the structure of database B.  We have to know what the structure of database B is.

Furthermore, we have to make sure that the data being transmitted meets the data types the SQL database is expecting. Database B will reject the data and spit out errors if we try to load something that does not meet its expectation. This is called Schema on-write.

On the other hand, Hadoop has a schema on-read approach. So, when data is written on the HDFS or the Hadoop Distributed File System, it’s just brought in without setting any gate-keeping rules. Then, we apply rules to the code that reads the data when we want to read the data. This code reads the data rather than pre-configuring their structure ahead of time. This concept of schema on-read vs. schema on-write has fundamental implications on how the data is stored in SQL vs. Hadoop, which leads us to our second difference.

How data is stored

In SQL, the data is stored in a logical form with inter-related tables and defined columns. In Hadoop, the data is a compressed file of either data (any type) or text. But, the moment the data enters into Hadoop, the data or file is replicated across multiple-nodes in the HDFS. For example, let’s assume that we’re loading Instagram data and we have a large Hadoop distributed file system cluster of 500 servers. Your Instagram data may be replicated across 50 of them, along with all the other profiles of Instagram users.

Hadoop keeps track of where all the copies of your profile are. This might seem like a waste of space, but it’s actually the secret to the massive-scalability magic in Hadoop. Which leads us to the third difference.

Big data in Hadoop vs. in SQL

A big data solution like Hadoop is architected for an unlimited number of servers. Back to our previous example, and if you’re searching Instagram data, and you want to see all the sentence that include the word “happy”; Hadoop will distribute the calculation of the search request across all the 50 servers where your profile data is replicated in the HDFS.

But, instead of each server conducting the exact same search, Hadoop will divide the workload, so that each server works on just a portion of your Instagram history. As each node finishes its assigned portion of history, a reducer program on the same node cluster adds up all the tallies and produces a consolidated list.

So what happens if any of the 50 servers break-down or has some sort of an issue while processing the request, should the response hold up because of the incomplete data set? The answer to this question is one of the main differences between SQL and Hadoop.

Now Hadoop would say no, and it would deliver an immediate response and finally it would have a consistent-answer.

SQL would say no, before releasing anything we must have consistency across all the nodes. This is called the two-phase commit. Now neither approach is wrong or right, as both approaches play an important role based on the data-type being used.

The Hadoop consistency methodology is far more realistic for reading constantly updating streams of unstructured data across thousands of servers. While the two-phase commit SQL databases methodology is well suited for rolling up and managing transactions and ensure we get the right answer.

Hadoop is so flexible that Facebook  created a program called “Hive” that allowed their team members who didn’t know Hadoop java code to write queries in standard SQL and mimic SQL behavior on-demand.  Today, some of the leading tools for analysis and data management are now beginning to natively write MapReduce programs and prepackaged schemas on-read so that organizations don’t have to hire expensive data-experts to get value from this powerful architecture.

To dive deeper into this subject, please download the following StoneFly white paper on Big Data & Hadoop.


The Spear Phishing Survival Guide

The Spear Phishing Survival Guide

Spear phishing stands as the favored gateway for ransomware delivery and infiltrating corporate networks. Shockingly, 36% of data breaches in 2022 involved phishing, with 25% utilizing email as the ransomware attack vector. Guarding against cyber threats and...

Understanding Detection and Response: EDR vs MDR vs XDR vs NDR

Understanding Detection and Response: EDR vs MDR vs XDR vs NDR

In a digitally transformed landscape fraught with ever-evolving cyber threats, the acronyms EDR (Endpoint Detection and Response), XDR (Extended Detection and Response), MDR (Managed Detection and Response), and NDR (Network Detection and Response) have become...

Trigona Ransomware: What is it and How to Defend Against it

Trigona Ransomware: What is it and How to Defend Against it

In an ever-evolving digital landscape, the specter of ransomware looms large, and Trigona stands as a significant player in the realm of cyber threats. This blog delves into the multifaceted world of Trigona ransomware, unraveling its origins, unique characteristics,...

Lockbit Ransomware: Inside the Cyberthreat and Defense Strategies

Lockbit Ransomware: Inside the Cyberthreat and Defense Strategies

In the constantly evolving arena of cybersecurity, the digital landscape is fraught with adversaries lurking in the shadows, ready to exploit vulnerabilities and disrupt the operations of organizations. Among these threats, LockBit ransomware has emerged as a...

What Defending Against Ransomware-as-a-Service (RaaS) Entails

What Defending Against Ransomware-as-a-Service (RaaS) Entails

Ransomware has evolved, becoming a thriving business model for cybercriminals. Ransomware-as-a-Service (RaaS) exemplifies this transformation—a lethal alliance between the creators and distributors of ransomware. It’s no longer a threat relegated to tech...

You May Also Like

WordPress PopUp Plugin

Subscribe To Our Newsletter

Join our mailing list to receive the latest news, updates, and promotions from StoneFly.

Please Confirm your subscription from the email