If you have been familiar to working with conventional SQL-databases, hearing someone talk about Hadoop can sound like a crazy mess. In this blog, we are going to try and simplify three of the many differences between conventional SQL sources and Hadoop. Which hopefully will provide context as to where each model is used.
Schema on Write vs. Schema on Read
The first difference we want to point-out is schema on-write vs. schema on-read. When we transfer data from SQL-database A to SQL-database B, and before we write the data to database B we need to have some information in hand. For instance, how to adapt the data from database A to fit the structure of database B. We have to know what the structure of database B is.
Furthermore, we have to make sure that the data being transmitted meets the data types the SQL database is expecting. Database B will reject the data and spit out errors if we try to load something that does not meet its expectation. This is called Schema on-write.
On the other hand, Hadoop has a schema on-read approach. So, when data is written on the HDFS or the Hadoop Distributed File System, it’s just brought in without setting any gate-keeping rules. Then, we apply rules to the code that reads the data when we want to read the data. This code reads the data rather than pre-configuring their structure ahead of time. This concept of schema on-read vs. schema on-write has fundamental implications on how the data is stored in SQL vs. Hadoop, which leads us to our second difference.
How data is stored
In SQL, the data is stored in a logical form with inter-related tables and defined columns. In Hadoop, the data is a compressed file of either data (any type) or text. But, the moment the data enters into Hadoop, the data or file is replicated across multiple-nodes in the HDFS. For example, let’s assume that we’re loading Instagram data and we have a large Hadoop distributed file system cluster of 500 servers. Your Instagram data may be replicated across 50 of them, along with all the other profiles of Instagram users.
Hadoop keeps track of where all the copies of your profile are. This might seem like a waste of space, but it’s actually the secret to the massive-scalability magic in Hadoop. Which leads us to the third difference.
Big data in Hadoop vs. in SQL
A big data solution like Hadoop is architected for an unlimited number of servers. Back to our previous example, and if you’re searching Instagram data, and you want to see all the sentence that include the word “happy”; Hadoop will distribute the calculation of the search request across all the 50 servers where your profile data is replicated in the HDFS.
But, instead of each server conducting the exact same search, Hadoop will divide the workload, so that each server works on just a portion of your Instagram history. As each node finishes its assigned portion of history, a reducer program on the same node cluster adds up all the tallies and produces a consolidated list.
So what happens if any of the 50 servers break-down or has some sort of an issue while processing the request, should the response hold up because of the incomplete data set? The answer to this question is one of the main differences between SQL and Hadoop.
Now Hadoop would say no, and it would deliver an immediate response and finally it would have a consistent-answer.
SQL would say no, before releasing anything we must have consistency across all the nodes. This is called the two-phase commit. Now neither approach is wrong or right, as both approaches play an important role based on the data-type being used.
The Hadoop consistency methodology is far more realistic for reading constantly updating streams of unstructured data across thousands of servers. While the two-phase commit SQL databases methodology is well suited for rolling up and managing transactions and ensure we get the right answer.
Hadoop is so flexible that Facebook created a program called “Hive” that allowed their team members who didn’t know Hadoop java code to write queries in standard SQL and mimic SQL behavior on-demand. Today, some of the leading tools for analysis and data management are now beginning to natively write MapReduce programs and prepackaged schemas on-read so that organizations don’t have to hire expensive data-experts to get value from this powerful architecture.
To dive deeper into this subject, please download the following StoneFly white paper on Big Data & Hadoop.