Blog Posts

dATA segregation based on usage

1/15/2019

0 Comments

For best data insight and data visibility it is important to seperate source data based on following factors:

Types of data :Information about types of data , frequency (Batch, real-time or near real-time mode)
Usage : What the data can be used for
End User :Who would be using and purpose
Long Term and Short Term usage :How long the data need to be used for
Maximum usability : Is there a limit of the data
Expiry date : Is there a time line where the data expiry
Base on the above points separate the data.
For the frequently use data keep in the faster access storage or db
Less frequently access or unused data store the different storage place
Take a current snap shot of the data or back up before purge the data from the fast access DB.

0 Comments

naming best pratices : AWS- S3 & AZURE STORAGE CONTAINER

7/23/2018

0 Comments

<processing Location> - <storage area> - <data container>

0 Comments

eXTRA lAYER sECURITY FOR DATA LAKE/DATA LAYER

7/23/2018

0 Comments

Seperate the Data Landing Zone and Data Process Zone by using seperate VPN .This will reduce the external world security threats. Also Landing zone could be use for the Data upload as well as Data Download zone.
Data upload : Where customer can upload their data
Data download: Where customer download the processed data

0 Comments

EMR Vs.EC2 Cloudera or hdp

9/26/2017

0 Comments

REDSHIFT SPECTRUM

5/17/2017

0 Comments

Redshift Spectrum

Enables to run the exabytes of data directly from S3
Directly can read the data from the S3 data lake& eliminates load and transfer of the data
Redshift Spectrum scale the query process and uses the node as per the requirement
Directly queries data in S3 with an open format including CSV, TSV, Parquet, Sequence, and RCFile.
it enables to read the data from Redshift as well as where you kept your data

Advantages

Pay per query no upfront cost based on the resources utilization
Directly read from S3 no need to load and process data before query it
Support multiple formats of the data
Easily scalable and easy to manage

Configuration:

You just need to register your Amazon Athena data catalog or Hive Metastore as an external schema.

0 Comments

HIPPA COMPLIaNCE in BIGDATA- using cloudera

1/25/2017

0 Comments

HIPPA Requiremnts :

Physical safe guards for EPHI (electronic protected health information) data as a source to Hadoop this includes hardware/software & equipment's have health information
Technical safeguard implemented to access the EPHI data
Administrative safe guards for EPHI data such as available Access Control (the ability or the means necessary to read, write, modify, or communicate data/information)
Unique User Identification & tractability
Data Security and code management
Encrypted and Decryption

Source : Reference : https://www.hhs.gov/hipaa/for-professionals/faq/security-rule

Cloudera :
Cloudera Navigator and Key trust provides the HIPPA requiremnts
Navigator Encrypt is an integrated part of the Cloudera and this provides a transparent security layer TED for any Linux applications without changing the data or application
Encryption uses standard FIPS 140-2 & NIST for valid solution and these are follows the AES standards
& Keys are strongly protected with several layers cryptography and stored separate from encrypted files. This providespProcess based access control for encrypted files allows only authorize system access using decrypt key in case of hack or physical compromised access to encrypted files can be avoided by access control.
Key Trust is an universal key manger used with Navigator Encrypt to manages all cryptographic keys, certificates, configuration files, and any other “opaque object” to secure its most sensitive data & secured layer on the top of existing security for authorize access to the data in cloud
Inbuilt security access features for Hadoop cluster through Kerbos & Sentry provides role based permission and security access for data in and out of Hadoop

Cloudera Navigator Lineage and Audit feature provides data traceability is a requirement for HIPAA unauthorized visibility control through data tokenization, masking and encryption
Source :Cloudera.com

0 Comments

LAMBDA PRICING EXAMPLE

9/25/2016

36 Comments

As per my previous post AWS Lambda price accordingly the memory usage here are the sample calculations (Please refer the pricing table below)

If you allocated 512MB of memory to your function, executed it 4 million times in one month, and it ran for 1 second each time, your charges would be calculated as follows:
Monthly compute charges
The monthly compute price is $0.00001667 per GB-s and the free tier provides 400,000 GB-s.
Total compute (seconds) =4M * (1s) = 4,000,000 seconds
Total compute (GB-s) = 4,000,000 * 512MB/1024 = 2,000,000 GB-s
Total compute – Free tier compute = Monthly billable compute GB- s
2,000,000 GB-s – 400,000 free tier GB-s = 1,600,000 GB-s
Monthly compute charges = 1,600,000 * $0.00001667 = $26.67

Monthly request charges
The monthly request price is $0.20 per 1 million requests and the free tier provides 1M requests per month.
Total requests – Free tier requests = Monthly billable requests
4M requests – 1M free tier requests = 3M Monthly billable requests
Monthly request charges = 3M * $0.2/M = $0.60

Total monthly charges
Total charges = Compute charges + Request charges = $26.67 + $0.60 = $27.27 per month

https://aws.amazon.com/lambda/pricing/

36 Comments

LAMBDA PRICING

9/25/2016

0 Comments

Request :
First 1 million request are free
Above 1 million request $.20 per 1 million request ($0.0000002 per request)

Duration :
Duration is calculated from the time your code beings executing until it returns or otherwise terminates. It rounded up to nearest 100ms. The price depends on the amount of memory you allocate to your function. You are charged $0.00001667 for every GB-second used.

Free tier :
The Lambda free tier includes 1M free requests per month and 400,000 GB-seconds of compute time per month. The Lambda free tier does not automatically expire at the end of your 12 month AWS Free Tier term, but is available to both existing and new AWS customers indefinitely.You are charged $0.00001667 for every GB-second used. (As per AWS) https://aws.amazon.com/lambda/pricing/

0 Comments

LAMBDA IN AWS

9/25/2016

0 Comments

LAMBDA: In AWS Lambda runs the available code on high availability compute infrastructure (You just need to provide your code)
It does the following activities

Administration of the computer resources such as auto provisioning /scaling , Operating system maintenance , code & security patch , code monitoring and logging etc.

Event Trigger Management :

It respond to the table update in Amazon Dynomo DB
Modification of objects in Amazon S3 buckets
Messages arriving from steaming Amazon Kenisis
API call from AWS Cloud Trail
Event from costume mobile applications etc
Event from Web Services
Event from Web Applications

0 Comments

Hadoop Utility

8/1/2016

0 Comments

Difference between : HBASE,CASSANDRA ,MONGODB

7/31/2016

1 Comment

HBASE:
Main Characteristics

Strong consistency
Built on top of Hadoop HDFS
CP on CAP

Suitable for:

Best Optimized for read
Best for row range based scan
Strict consistency
Fast read and write with scalability

Not Suitable for:

Analytics and Transactional
Application need full table scan

Cassandra :

Key Characteristics:

High Availability
AP on CAP
No SPF (Single point of failure) – all nodes are the same in Cassandra
Data is automatically replicated to multiple nodes for fault-tolerance.
Replication across multiple data centers is supported
Failed nodes can be replaced with no downtime.
Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.
Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

Suitable for :

Simple setup, maintenance code
Fast random read/write
No multiple secondary index needed

Not Suitable for :

Secondary index
Relational data
Transactional operations (Rollback, Commit)
Dynamic queries/searching on column data
Low Latency

Mongodb :
Key Characteristics :

Schemas to change as applications evolve (Schema-free)
Index: Full index support for high performance
High Availability : Replication and failover
Auto Sharding for easy Scalability :Sharing data across multiple node for high optimization operation :(Sharding is the process of storing data records across multiple machines and is MongoDB's approach to meeting the demands of data growth :Wiki)
Rich document based queries for easy readability
Master-slave model
CP on CAP

Suitable for :

Semi structured content
Replacement of RDBMS for web applications
Real Time analytics ,High logging and caching

Not Suitable for :

Highly transactional systems
System require with foreign key

1 Comment

Cassandra & HBASE

7/31/2016

0 Comments

HBASE:

Wide-column store based on Apache Hadoop and on concepts of BigTable.
Apache HBase is a NoSQL key/value store which runs on top of HDFS
Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs
HBase is partitioned to tables, and tables are further split into column families.
Column Families in a schema have all columns together
Each Key/Value pair remain as a cell
Each key : key consists of row-key, column family, column, and time-stamp
A row in HBase is a grouping of key/value mappings identified by the row-key.
It can scalable horizontal
Versioning available : 3
Does four operations : put to add or update rows, scan to retrieve a range of cells, return cells for a specified row, and delete to remove rows, columns or column versions from the table.
Schema have tables and column families
Custom Queries FOR OPERATION , Using Phoenix can be SQL type operation
Operation through ZOOKEEPER for controlling the operation Master Server , Region Server etc.
Master server monitors the all region servers, keeps all metadata changes and maintenance
It uses for CAP (CA:Consistency and Availability )
Optimize for read , Single Write master
It can use the Range base scan which support ordered based scan and can be use during the horizontal scalability
Does not support secondary Index : But this can be achieved by trigger on "put up to date the

secondary index

Hbase Co Processors support out-of-the-box simple aggregations in HBase. SUM, MIN, MAX, AVG, STD. Other aggregations can be built by defining java-classes to perform the aggregation
Good for real time analytics and massive data processing
User: Facebook

Cassandra:

Wide-column store based on ideas of BigTable and DynamoDB
Cassandra has decentralized architecture. Any node can perform any operation. It provides AP(Availability,Partition-Tolerance) from CAP theorem.
Cassandra has excellent single-row read performance
Cassandra does not support Range based row-scans
Cassandra is well suited for supporting single-row queries, or selecting multiple rows based on a Column-Value index.
Practical limitation of a row size in Cassandra is 10's of Megabytes,If data is stored in columns in Cassandra to support range scans
Rows larger than that causes problems with compaction overhead and time.
Cassandra supports secondary indexes on column families .Where column name is available not on the dynamic column
Aggregations in Cassandra are not supported by the Cassandra nodes - client must provide aggregations
For Multiple row aggregation spans multiple rows, Random Partitioning makes aggregations very difficult . In this case Storm or Hadoop for aggregations
User :Twitter
Good for Logfiles processing
Symmetric architecture makes it relatively easy to create and scale large clusters
SQL-like Cassandra Query Language eases developers' transition from RDBMS
Allows you to tune for performance or consistency or a balance of both
Community edition of management GUI available
Good documentation (provided by Datastax)

0 Comments

Lambda Architecture

7/26/2016

0 Comments

Lambda Architecture can be use in Batch, Real Time and Combining both .

Works in three modes:

1.Batch Mode : Using Map Reduce :Provides analytics upto near real time semi aggregated
2 Real time : Stream Mode (Spark) : Provides real time analytics
3.Mix Mode: Integrating both Stream and Batch files :Mix mode serving layer may be data pull from NoSQL DB.

Image reference from MapR

0 Comments

Cloudera & HORTON

7/14/2016

0 Comments

Cloudera and Hortonworks: The SimilaritiesCloudera as well as Hortonworks are both built upon the same core of Apache Hadoop. As such, they have more similarities than differences.

Both offer enterprise-ready Hadoop distributions. The distributions have stood the test of time as well as consumers, ensuring security and stability. Besides, they provide paid training and services to familiarize the newcomers treading the path of Big Data and Analytics.
Both have established communities that actively participate and help with the problems faced as well as demonstrations needed.
Both distributions have master-slave architecture.
Both have a shared-nothing computing framework.
Both support MapReduce as well as YARN.

Cloudera vs. Hortonworks: The DifferencesThat being said, the differences are the ones that play a deciding role of choosing one vendor over the other. Broadly, Cloudera and Hortonworks differ in the following aspects:

Cloudera has announced that its long term goal is to become an “enterprise data hub,” thus diminishing the need of data warehouse. Hortonworks, on the other hand, remains firmly a provider of Hadoop distro, and has partnered with data warehousing company Teradata.
While Cloudera CDH can be run on windows server, HDP is available as a native component on the windows server. A Windows-based Hadoop cluster can be deployed on Windows Azure through HDInsight Service.
Cloudera has a proprietary management software Cloudera Manager, SQL query handling interface Impala, as well as Cloudera Search for easy and real-time access of products. Hortonworks has no proprietary software, uses Ambari for management and Stinger for handling queries, and Apache Solr for searches of data.
Cloudera has a commercial license, while Hortonworks has open source license. Cloudera also allows the use of its open- source projects free of cost, but the package doesn’t include the management suite Cloudera Manager or any other proprietary software.
Cloudera has a free 60-day trial, Hortonworks is completely free.

Cloudera has been the oldest player in the market, with more than 350 customers. But Hortonworks is fast catching up and has made more innovations in the Hadoop ecosystem in the recent past. Cloudera has several enterprise softwares overlaid on its open source distributions to aid the consumers, whereas Hortonworks strives to provide a framework comprising only of open source projects.

* May contain third party collection information
**Information collected from best practices articles

0 Comments

in-memory db for faster processing

6/30/2016

0 Comments

ACTIONALBLE INTELLIGENCE - DATA IN MOTION - DATA AT REST

4/28/2016

0 Comments

So far most of the analytic data are available based on the rest data. But analytic will be more accurate if it associate with the Data In Motion (Live Data) .This could be possible with the HORTON Data platform this platform will provide the Actionable analytic using the Live data (Data in motion) and Rest data.

Actionable Intelligence: Using the IOT data how the data get processed.

Hortonworks DataFlow for Data In Motion & Hotonworks Data Platform for Data At Rest create actionable intelligence

0 Comments

December 15th, 2015

12/15/2015

0 Comments

Hortonworks DataFlow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available.An ideal solution for the Internet of Any Thing (IoAT), HDF enables simple, fast data acquisition, secure data transport, prioritized data flow and clear traceability of data from the very edge of your network all the way to the core data center.

0 Comments

Data Lake

10/16/2015

0 Comments

Data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

0 Comments

EDW & DATA LAKE

10/16/2015

0 Comments

BIG DATA + Data warehouse feedback loop

7/13/2015

0 Comments

Compare no sql

7/2/2015

0 Comments

NO SQL DB

7/2/2015

0 Comments

Splice Machine

Best Suited : Splice machine can we use top of the Hadoop with HBASE . This will give all the feature of the RDBMS and OLAP and OLTP reporting could be done easily . Processing time is much faster than traditional NoSql DB.

Example : Real time data processing , OLTP and OLAP reporting

HBASE

Best Used: Data storage is huge and Map reduce used for the processing

Example : Search Engine , Analysis huge data

Cassandra

Best Used : Huge Data storage , Real time process analysis with Spark

Example : Real time log or feed analysis

DynamoDB

Best Used : Fast responses and Heavy query

Used: Best for scalable , latency control and first query

CouchDB

Best Used : For collection, occasionally changing data, on which pre-defined queries are to be run. Best for versioning

Example : CRM , CMS

MongoDB

Best Used: Best for dynamic queries. best for define indexes, without the Map/Reduced. Best for big DB.

Example : For bulk storage

Couch Base

Best Used : Application with low-latency data access, high availability and high concurrence

Example : Web Applications with high concurrency, games

neo4j

Best Used : For graph, interconnected data

Example :Search routes , network topology and maps etc

0 Comments

DATA LAKE ? AND HOW TO DESIGN

6/8/2015

0 Comments

Data Lake :
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
A traditional hierarchical data warehouse stores data in files and folders ,a data lake use a flat architecture to store the data.When a data element store in the data lake its assigned with a unique identifier and it's information stores in a meta data .Theses information can be easily queried for respective requirement.

Basically the ground infrastructure is used for data storage is HADOOP technologies. Usually data load takes place from different sources to the Data Lake takes place using ELT tools (Scoop, Command Line , Scripts , Talend , Pentaho etc..) .Here the data load process takes very fast as the load takes place parallel and no schema check happens while loading data only schema check happens while read.

Data lake is a marketing term and uses to large set of data

Splice Machine Example : Using Splice Machine as a Operational DB (Photo Courtesy Splice Machine )

0 Comments

Live Analysis using Spark & Hadoop USING MOBILE APPS

5/21/2015

0 Comments

Implement insert, update, and delete in Hive with full ACID support

4/25/2015

0 Comments

Hive 0.14 allow CRUD (Create -Read-Update -Delete )operation.
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
DELETE FROM tablename [WHERE expression]

0 Comments

<<Previous

Forward>>

dATA segregation based on usage

naming best pratices : AWS- S3 & AZURE STORAGE CONTAINER

eXTRA lAYER sECURITY FOR DATA LAKE/DATA LAYER

EMR Vs.EC2 Cloudera or hdp

REDSHIFT SPECTRUM

HIPPA COMPLIaNCE in BIGDATA- using cloudera

LAMBDA PRICING EXAMPLE

LAMBDA PRICING

LAMBDA IN AWS

Hadoop Utility

Difference between : HBASE,CASSANDRA ,MONGODB

Cassandra & HBASE

Lambda Architecture

Cloudera & HORTON

in-memory db for faster processing

ACTIONALBLE INTELLIGENCE - DATA IN MOTION - DATA AT REST

December 15th, 2015

Data Lake

EDW & DATA LAKE

BIG DATA + Data warehouse feedback loop

Compare no sql

NO SQL DB

DATA LAKE ? AND HOW TO DESIGN

Live Analysis using Spark & Hadoop USING MOBILE APPS

Implement insert, update, and delete in Hive with full ACID support

AUTHOR

Archives

Categories