HIVE AND IMPALA DIFFERENCE

1/6/2015

Impala:
Impala is integrated from the ground up as part of the Hadoop ecosystem and leverages the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack.

Designed to complement MapReduce which specializes in large-scale batch processing, Impala is an independent processing framework optimized for interactive queries. With Impala, analysts and data scientists now have the ability to perform real-time, “speed of thought” analytics on data stored in Hadoop via SQL or through Business Intelligence (BI) tools. The result is that large-scale data processing and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as built-in functions

HiveQL Features not Available in Impala:

Non-scalar data types such as maps, arrays, structs.
Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
XML and JSON functions.
Certain aggregate functions from HiveQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set; Impala supports the set of aggregate functions listed in Impala Aggregate Functions and analytic functions listed in Analytic Functions.
Sampling.
Lateral views.
Multiple DISTINCT clauses per query, although Impala includes some workarounds for this limitation.

User-defined functions (UDFs) are supported starting in Impala 1.2. See User-Defined Functions (UDFs) for full details on Impala UDFs.

HIVEQL NOT SUPPORTED BY HIVE

Impala does not currently support these HiveQL statements:

ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
DESCRIBE COLUMN
DESCRIBE DATABASE
EXPORT TABLE
IMPORT TABLE
SHOW TABLE EXTENDED
SHOW INDEXES
SHOW COLUMNS

Semantic Differences Between Impala and HiveQL Features
Impala utilizes the Apache Sentry (incubating) authorization framework, which provides fine-grained role-based access control to protect data against unauthorized access or tampering

The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL statement and clause names:

Impala uses different syntax and names for query hints, [SHUFFLE] and [NOSHUFFLE] rather than MapJoin or StreamJoin. See Joins for the Impala details.
Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY.
Impala does not require queries to include a FROM clause.

Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected casting behavior.
- Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.
- Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.
- Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and thefrom_unixtime() format string; see TIMESTAMP Data Type for details.
See Data Types for full details on implicit and explicit casting for all types, and Impala Type Conversion Functions for details about the CAST() function.

0 Comments

HIVE AND IMPALA DIFFERENCE

Leave a Reply.

AUTHOR

Archives

Categories