Zen of data modelling in Hadoop

The Zen of Python is well known tongue-in-cheek guidelines to writing Python code. If you haven’t read it, I would highly recommend reading it here Zen of Python.

There’s an “ester egg” in Python REPL. Type the following import into Python REPL and it will printout the Zen of Python.

import this

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -let’s do more of those!

I have been pondering about how would you model your data in Hadoop. There are no clear guideline or hard and fast rules. As in, if it is an OLTP database use 3NF or if you are building warehouse, use Kimball methodology. I had the same dilemma when SSAS Tabular came around in SQL Server 2012. The conclusion that I reached was that it was somewhere between the two extremes: a fully normalized > = 3NF and flat-wide-deep denormalized mode.
Hadoop ecosystem also throws additional complexities in the mix – what file format to choose (Avro, Paraquet, ORC, Sequence and so on), what compression works best (Snappy, bzip etc), access patterns (is it scan, is it more like search), redundancy, data distribution,partitioning, distributed/parallel processing and so on. Some would give the golden answer “it depends”. However, I think there are some basic guidelines we can follow (while, at the same time, not disagreeing with the fact that it depends on the use case).

So in the same vein as Zen of Python I would like to present my take on Zen of data modelling in Hadoop. In future posts – hopefully – I will expand further on each one.
So here it is

Zen of Data Modelling in Hadoop

Scans are better than joins
Distributed is better than centralised
Tabular view is better than nested
Redundancy is better than lookups
Compact is better than sparse
End user queries matter
No model is appropriate for all use cases so schema changes be graceful
Multiple views of the data are okay
Access to all data is not always needed, filtering at the source is great
Processing is better done closer to data
Chunk data into large blocks
ETL is not the end, so don’t get hung up on it, ETL exists to serve data
Good enough is better than none at all, Better is acceptable than Best

So there you have it. As mentioned previously, the model will depend on the situation and lets hope these will be helpful starting point. They come with full disclaimer – Use at your own risk and are always subject to change

Zen of data modelling in Hadoop

Zen of data modelling in Hadoop

Zen of Data Modelling in Hadoop

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List