Blog Archives

Stinger.next for hive

1/21/2015

0 Comments

Stringer.next Ref: HN

0 Comments

10 ways to query Hadoop with SQL

1/11/2015

0 Comments

Great Link : http://www.infoworld.com/article/2683729/hadoop/10-ways-to-query-hadoop-with-sql.html

0 Comments

Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Apache Phoenix: Its developers call it a "SQL skin for HBase" -- a way to query HBase with SQL-like commands via an embeddable JDBC driver built for high performance and read/write operations. Consider it an almost no-brainer for those making use of HBase, thanks to it being open source, aggressively developed, and outfitted with useful features like bulk data loading.

0 Comments

APACHE DRILL -SQL TOOL

1/11/2015

0 Comments

Drill uses the ANSI SQL and this minimizes the data preparation time like ETL etc.
This also uses JSON like structure so Dynamic changes in the data model could be easily accepted
It can interact with HIVE and HIVE UDF

Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. Apache Drill is built from the ground up to provide low latency queries natively on such rapidly evolving multi-structured datasets at scale.

Apache Drill provides direct queries on self-describing and semi-structured data in files (such as JSON, Parquet) and HBase tables without needing to define and maintain schemas in a centralized store such as Hive metastore. This means that users can explore live data on their own as it arrives versus spending weeks or months on data preparation, modeling, ETL and subsequent schema management.
Drill provides a JSON-like internal data model to represent and process data

.

0 Comments

Thrift SERVER & HIVE

1/10/2015

0 Comments

Hive has an optional component known as HiveServer or HiveThrift that allows access to Hive over a single port. Thrift is a software framework for scalable cross-language services development. See http://thrift.apache.org/ for more details. Thrift allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.

0 Comments

HIVE AND IMPALA DIFFERENCE

1/6/2015

0 Comments

Impala:
Impala is integrated from the ground up as part of the Hadoop ecosystem and leverages the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack.

Designed to complement MapReduce which specializes in large-scale batch processing, Impala is an independent processing framework optimized for interactive queries. With Impala, analysts and data scientists now have the ability to perform real-time, “speed of thought” analytics on data stored in Hadoop via SQL or through Business Intelligence (BI) tools. The result is that large-scale data processing and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as built-in functions

HiveQL Features not Available in Impala:

Non-scalar data types such as maps, arrays, structs.
Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
XML and JSON functions.
Certain aggregate functions from HiveQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set; Impala supports the set of aggregate functions listed in Impala Aggregate Functions and analytic functions listed in Analytic Functions.
Sampling.
Lateral views.
Multiple DISTINCT clauses per query, although Impala includes some workarounds for this limitation.

User-defined functions (UDFs) are supported starting in Impala 1.2. See User-Defined Functions (UDFs) for full details on Impala UDFs.

HIVEQL NOT SUPPORTED BY HIVE

Impala does not currently support these HiveQL statements:

ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
DESCRIBE COLUMN
DESCRIBE DATABASE
EXPORT TABLE
IMPORT TABLE
SHOW TABLE EXTENDED
SHOW INDEXES
SHOW COLUMNS

Semantic Differences Between Impala and HiveQL Features
Impala utilizes the Apache Sentry (incubating) authorization framework, which provides fine-grained role-based access control to protect data against unauthorized access or tampering

The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL statement and clause names:

Impala uses different syntax and names for query hints, [SHUFFLE] and [NOSHUFFLE] rather than MapJoin or StreamJoin. See Joins for the Impala details.
Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY.
Impala does not require queries to include a FROM clause.

Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected casting behavior.
- Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.
- Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.
- Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and thefrom_unixtime() format string; see TIMESTAMP Data Type for details.
See Data Types for full details on implicit and explicit casting for all types, and Impala Type Conversion Functions for details about the CAST() function.

0 Comments

APACHE SENTRY FOR Regulatory Compliance AND ACCESS CONTROL

1/6/2015

0 Comments

Apache Sentry (incubating) is a unified authorization mechanism so you can store sensitive data in Hadoop. Sentry provides Fine-grained authorization and role-based access control all through a single system

An enterprise data hub, powered by Hadoop, is a single, low-cost platform where organizations can efficiently and securely store, process, analyze, govern, archive, and serve any and all of their enterprise data. The enterprise hub provides access through the BI Tools , SQL etc.

Main features are :

Improved Regulatory Compliance – Data Complicance can be possible for HIPAA, SOX, and PCI.
Role-Based Administration – Database administrators can unlock key role-based access control (RBAC) requirements and define what users and applications can do with data within a server, database, table, view, and search indexes.
Data Classification – Through fine grain control sensitive and non-sensitive data can be controlled easily for the same dataset
Expanded User Base –Operations staff can control the access through the central Administration based on the roles and the departments

0 Comments

January 06th, 2015

1/6/2015

0 Comments

Stinger.next for hive

10 ways to query Hadoop with SQL

APACHE PHOENIX

APACHE DRILL -SQL TOOL

Thrift SERVER & HIVE

HIVE AND IMPALA DIFFERENCE

APACHE SENTRY FOR Regulatory Compliance AND ACCESS CONTROL

January 06th, 2015

AUTHOR

Archives

Categories