HBASE:
- Wide-column store based on Apache Hadoop and on concepts of BigTable.
- Apache HBase is a NoSQL key/value store which runs on top of HDFS
- Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs
- HBase is partitioned to tables, and tables are further split into column families.
- Column Families in a schema have all columns together
- Each Key/Value pair remain as a cell
- Each key : key consists of row-key, column family, column, and time-stamp
- A row in HBase is a grouping of key/value mappings identified by the row-key.
- It can scalable horizontal
- Versioning available : 3
- Does four operations : put to add or update rows, scan to retrieve a range of cells, return cells for a specified row, and delete to remove rows, columns or column versions from the table.
- Schema have tables and column families
- Custom Queries FOR OPERATION , Using Phoenix can be SQL type operation
- Operation through ZOOKEEPER for controlling the operation Master Server , Region Server etc.
- Master server monitors the all region servers, keeps all metadata changes and maintenance
- It uses for CAP (CA:Consistency and Availability )
- Optimize for read , Single Write master
- It can use the Range base scan which support ordered based scan and can be use during the horizontal scalability
- Does not support secondary Index : But this can be achieved by trigger on "put up to date the
- Hbase Co Processors support out-of-the-box simple aggregations in HBase. SUM, MIN, MAX, AVG, STD. Other aggregations can be built by defining java-classes to perform the aggregation
- Good for real time analytics and massive data processing
- User: Facebook
Cassandra:
- Wide-column store based on ideas of BigTable and DynamoDB
- Cassandra has decentralized architecture. Any node can perform any operation. It provides AP(Availability,Partition-Tolerance) from CAP theorem.
- Cassandra has excellent single-row read performance
- Cassandra does not support Range based row-scans
- Cassandra is well suited for supporting single-row queries, or selecting multiple rows based on a Column-Value index.
- Practical limitation of a row size in Cassandra is 10's of Megabytes,If data is stored in columns in Cassandra to support range scans
- Rows larger than that causes problems with compaction overhead and time.
- Cassandra supports secondary indexes on column families .Where column name is available not on the dynamic column
- Aggregations in Cassandra are not supported by the Cassandra nodes - client must provide aggregations
- For Multiple row aggregation spans multiple rows, Random Partitioning makes aggregations very difficult . In this case Storm or Hadoop for aggregations
- User :Twitter
- Good for Logfiles processing
- Symmetric architecture makes it relatively easy to create and scale large clusters
- SQL-like Cassandra Query Language eases developers' transition from RDBMS
- Allows you to tune for performance or consistency or a balance of both
- Community edition of management GUI available
- Good documentation (provided by Datastax)