Getting Started with Amazon RedShift

For many years till now a Data Warehouse was confined to biggies of the industries. It was very difficult for mid sized or smaller companies to think or imagine about data warehouse for various reasons like cost, technology, infrastructure etc. Data Warehouse was considered a luxury for the industries but it not so form now. Amazon has changed the game play.

Server with the characteristics of high performance, high scale, high availability went on to the cloud. This is the period where the data and high volume data will be moved to cloud for easy storage, cost effectiveness, high scale computing and data crunching. In evident to this Amazon Web Services has offered Peta Byte Scale Data WareHouse on Cloud called Amazon RedShift. It was launched in AWS ReInvent event 2012.

Brief Intro About RedShift

  • Peta Byte Scale Data WareHouse
  • Columnar Based
  • MPP – Massively Parallel Processing Architecture
  • Advanced Compression
  • Easy to Scale up or down
  • Connectivity using popular BI and ETL tools like MicroStrategy and JasperSoft
  • Currently available for US-East, Virginia Data Center
  • Can be provisioned inside VPC
  • XLarge cluster can be of the size 1(Single Node) to  2 – 16 (Multi-Node Cluster) 
  • 8XLarge can be of the size 2 to 16 (Multi-Node Cluster)
  • Leader node automatically provisioned for entry point contact and operations, apart from the regular nodes
  • Leader node is not charged or doens’t come under billing
  • Multi AZ option not available right now
  • All nodes are placed in single AZ but the selection AZ is done by Amazon
  • Can be connected with existing AWS Data Components like Data PipeLine, Elastic Map Reduce, DynomoDB, S3 etc.

Components under the Management Console for RedShift

  • Clusters
  • SnapShots
  • Security Groups
  • Parameter Groups
  • Subnet Groups
  • Reserved Nodes
  • Events

How Powerful is RedShift Cluster

Amazon RedShift comes in 2 node types to provision a Cluster, ExtraLarge node comes with 2 TB of compressed storage with nodes from 1 node to 32 nodes and 8ExtraLarge node comes with 16 TB of compress storage which can be from 2 nodes way up to 100 nodes.

High Storage Extra Large (XL) DW Node:

  • CPU : 2 Virtual Cores – Intel Xeon E5
  • ECU : 4.4
  • Memory : 15 GB
  • Storage : 3 HDD with 2TB of local attached storage
  • Network : Moderate
  • Disk I/O : Moderate
  • API : dw.hs1.xlarge

High Storage Eight Extra Large (8XL) DW Node : 

  • CPU : 16 Virtual Cores – Intel Xeon E5
  • ECU : 35
  • Memory : 120 GB
  • Storage : 24 HDD with 16 TB of local attached storage
  • Network : 10 Gigabit Ethernet with support for placement groups
  • Disk I/O : Very High
  • API : dw.hs1.8xlarge

Cluster with Extra Large Node (XL)

Extra Large Node cluster can be a single node or multi node cluster ranging from 2 to 16 nodes i.e. each of the node will be equipped with the above infra of 15 GB of memory, 2TB of local attached storage etc.

Cluster with Eight Extra Large Node (8XL)

Regarding the 8XL DW it would be multi node cluster ranging from 2 to 100 nodes i.e. each of the node will be equipped with the above infra of 120 GB, 16 TB of local attached storage etc.

PS : The existing management console shows only the node cluster count of minimum of 2 and maximum of 16. Actually you can go beyond 16 and all the way up to 100. Amazon has that explicitly done that to avoid over provisioning. You need to fill in a form / contact Amazon to increase your limit.

Costing

As stated above, Amazon RedShift is currently offered only at US East 1, Virginia Data Center. The pricing is only for that region.

Node Size Cost Per Hour Per Node
XL Node – 2 TB Storage (Per Node)  $0.850
8XL Node – 16 TB Storage (Per Node)  $6.800

Reserved Instance Costing

As Data WareHouses run in legacy for years together it is both sensible and cost effective to have a long term commitment with Amazon Reserved Instance to have cheaper running cost of the clusters.
1 Year Reserved Instance Pricing
Node Upfront Investment Discounted Hourly Running Cost Regular Hourly Running Cost Total Yearly Running Cost with Reserved Capacity Total Yearly Running Cost without Reserved Capacity Annual Savings
XL Large
$2500
$0.215 
$0.850
$4383.40
($2500 + 365 x 24 x $0.215 )
$ 7446
( 365 x 24 x $0.850 )
$3062.60
8XL Large
$20000
$1.720 
$6.800
$35067.20 
($20000 + 365 x 24 x $1.720 )
$ 59568
( 365 x 24 x $6.800 )
$24500.80
3 Year Reserved Instance Pricing
Node Upfront Investment Discounted Hourly Running Cost Regular Hourly Running Cost Total Yearly Running Cost with Reserved Capacity Total Yearly Running Cost without Reserved Capacity Annual Savings
XL Large
$3000
$0.114 
$0.850
$1998.64
($3000/3) + (365 x 24 x $0.215 )
$ 7446
( 365 x 24 x $0.850 )
$5447.36
8XL Large
$24000
$0.912 
$6.800
$15989.12 
($24000/3) + (365 x 24 x $0.912 )
$ 59568
( 365 x 24 x $6.800 )
$43578.88

Scaling in RedShift

To begin with you can start with XL Single Node (supports till 2TB) and grow out to XL Multi Node Cluster till 16 nodes (supports till 2 TB). On reaching the bottleneck of performance or storage, you can always scale out for 8XL instance anytime. All the scaling patterns i.e. count and node size are possible as shown in the diagram.

During the process of the scaling out or scaling in, there would be small time period where the Node Clusters would be in the read only mode and once the new cluster with new infra setting is ready, the cluster will get back to the normal operational state. Essential there is a copy operation performed during the transition from old cluster config to new cluster configuration.

Back Up and Storage

RedShift supports both manual backups and automated / scheduled backups for the node clusters. The back up is placed in S3 – Amazon Simple Storage Service and it can restored to the cluster anytime. With Object expiry in place for S3, you schedule the archival or migration to Amazon Glacier after few days say 30 days. With this the entire setup of RedShift, S3, Glacier becomes full automated and self serviced. 

Connectivity

You can connect to the RedShift via standard JDBC – ODBC tools and drivers like SQL Workbench/J. Amazon has a good documentation of how do that here.

Apart from the regular connectivity, Amazon RedShift has partners for tools support like Actuate, Birst, Jaspersoft, MicroStrategy, Pentaho, Pervasive, tableau, Attunity, Informatica, Talend. Complete list of partners can be found here.

Update

Every thing what wrote above has been made a SlideShare presentation by Dr.Matt Wood of AWS.

References

Advertisements
Categories AWS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s