HBase Overview#
Why HBase#
RDBMS get exponentially slow as the data becomes large
Expects data to be highly structured, i.e. ability to fit in a well-defined schema
Any change in schema might require a downtime
For sparse datasets, too much of overhead of maintaining NULL values
HBase Data Model#
HBase Commands#
Create: Creates a new table identified by ‘table1’ and Column Family identified by ‘colf’.
Put: Inserts a new record into the table with row identified by ‘row..’
Scan: returns the data stored in table
Get: Returns the records matching the row identifier provided in the table
Help: Get a list of commands
HBase and HDFS#
HDFS | HBase |
---|---|
HDFS is a distributed file system suitable for storing large files. | HBase is a database built on top of the HDFS. |
HDFS does not support fast individual record lookups. | HBase provides fast lookups for larger tables. |
It provides high latency batch processing; no concept of batch processing. | It provides low latency access to single rows from billions of records (Random access). |
It provides only sequential access of data. | HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups. |
HBase and RDBMS#
HBase | RDBMS |
---|---|
HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. | An RDBMS is governed by its schema, which describes the whole structure of tables. |
It is built for wide tables. HBase is horizontally scalable. | It is thin and built for small tables. Hard to scale. |
No transactions are there in HBase. | RDBMS is transactional. |
It has de-normalized data. | It will have normalized data. |
It is good for semi-structured as well as structured data. | It is good for structured data. |