[AWS Certificate]-Amazon Redshift

Amazon Redshift - Overview

OLAP database (Data warehousing solution) based on PostgreSQL
OLAP = Online Analytical Processing
Can query petabytes of structured and semi-structured data across your data warehouse and your data lake using standard SQL
10x performance than other data warehouses
Columnar storage of data (instead of row based)
Massively Parallel Query Execution (MPP), highly available
Has a SQL interface for performing the queries
BI tools such as AWS Quicksight or Tableau integrate with it

Demo

Redshift Architecture

Massively parallel columnar database, runs within a VPC
Single leader node and multiple compute nodes
You can connect to Redshift using any application supporting JDBC or ODBC driver for PostgreSQL
Client query the leader node using SQL endpoint
A job is distributed across compute nodes
Compute nodes partition the job into slices.
Leader node then aggregates the results and returns them to the client

Redshift node types

Dense compute nodes (DC2)
- For compute-intensive DW workloads with local SSD storage
Dense storage nodes (DS2)
- For large DWs, uses hard disk drives (HDDs)
RA3 nodes with managed storage
- For large DWs, uses large local SSDs
- Recommended over DS2
- Automatically offloads data to S3 if node grows beyond its size
- Compute and managed storage is billed independently

Loading data into Redshift

Typically, data from OLTP systems is loaded into Redshift for analytics and BI purposes
- Data from OLTP systems can be loaded into S3 and data from S3 can then be loaded into Redshift
- Data from Kinesis Firehose can also be loaded in the same way
COPY command
- Loads data from files stored in S3 into Redshift
- Data is stored locally in the Redshift cluster (persistent storage = cost)
- DynamoDB table data and EMR data can also be loaded using COPY command

Loading data from S3 with COPY command

copy users from 's3://my_bucket/tickit/allusers_pipe.txt'
credentials 'aws_iam_role=arn:aws;iam::0123456789:role/MyRedshiftRole'
delimiter '|' region 'us-west-2'

Create an IAM Role
Create your Redshift cluster
Attach the IAM role to the cluster
The cluster can then temporarily assume the IAM role on your behalf
Load data from S3 using COPY command

Demo

Querying external data with Redshift

Two ways
- Redshift Spectrum
- Redshift Federated Query

Redshift Spectrum

Query exabytes of data from S3 without loading it into Redshift
Must have a Redshift cluster available to start the query
The query is then submitted to thousands of Redshift Spectrum nodes
External table structure/schemas can be created in external data catalog like Athena / Glue / Apache Hive metastore (EMR)
These external tables are read-only (insert / update / delete operations not possible)
Redshift cluster and S3 bucket must be in the same region

Demo

Redshift Federated Query

Query and analyze data across different DBs, DWs, and data lakes
Currently works with Redshift, PostgreSQL on RDS, Aurora PostgreSQL and S3

Star Schema

Typical way of organizing data in a data warehouse
- Two types of table - fact table and dimension tables
- A star pattern consists of a fact table and multiple dimension tables
- Fact table has foreign key relationships with multiple dimension tables
- Dimension tables are generally small (fewer records, but often with many fields)
Recommended to store:
- smaller dimension tables as local Redshift tables
- larger. fact tables as external Spectrum tables

Example of a Star Schema

Redshift Key Constraints

Redshift does not support indexes (clustered or non-clustered)
Has a concept of sort keys
- data is stored on disk in a sorted order by sort key
Primary Key, Unique keys, and Foreign key contraints are not enforced (are informational only)
Query optimizer uses these contraints to generate more efficient query plans
You can enforce PK/FK relationship through your application

Redshift Table Design

Key factors for efficient table design
- Data distribution - how data is distributed across nodes
- Sort strategies - how data is sorted in the tables
- Compression - to reduce storage and I/O needs

Redshift Data Distribution

Data is stored in columns (as against rows in a typical OLTP database)
Data corresponding to a given field (across rows) is stored together and can be queried easily
Data distribution refers to how data is distributed across nodes in the clusters
When creating a Redshift table, you specify a distribution style

Distribution style	Description
EVEN	Spreads data evenly across all nodes in the cluster (default option, decent performance)
ALL	Table data is put on each node (good for smaller dimension tables, or for frequently used tables that are heavily joined)
KEY	Rows with the same DISTKEY column value are placed on the same node (good for known table relationships)
AUTO	Initially assign ALL to a small table, changes to EVEN as table size grows

Redshift Sort Styles

Single-column sort key (e.g. Dept)
Compound sort key
- more that one column as. sort key
- e.g. Dept + Location
- Hierarchical (order of the column in the sort key is important)
Interleaved sort key
- gives equal weight to each column (or subset of columns) in the sort key
- In effect, you can have multiple sort key combinations
- e.g. Dept + Location, Location + Dept
Must be defined at the table creation time

VACUUM operation in the tables

As you delete row from or add more rows to a sorted table containing data, performance might deteriorate over time
VACUUM operation re-sorts rows in one or all DB tables
And reclaims space from table rows marked for deletion
Redshift automatically sorts data and runs VACUUM DELETE in the background
Need not run manually, but you can if required

Redshift Compression

Column-level operation to reduce size of stored data
Reduces disk I/O and improves query performance
Compression type = encoding
Each column can be separately compressed with different encodings (manually / automatically)
COPY command applies compression by default
RAW = No compression (default for sort keys)
LZO = Very high compression with good performance (default encoding)
Encoding cannot be changed after table creation

Redshift Workload Management (WLM)

Helps you prioritize workloads
Can prevent long-running queries from impacting short-running ones
Two modes - Automatic WLM and Manual WLM
Automatic WLM supports queue priorities
SQL (Short query acceleration)
- prioritize selected short running queries w/o creating dedicated queue

Modifying the WLM configuration

Switching between Manaual and Auto WLM modes requires cluster reboot
WLM uses parameter groups for its config
WLM configuration properties are either dynamic or static
Dynamic property changes do not require cluster reboot
Static property changes require cluster reboot

Redshift Concurrency Scaling

Automatically scales cluster capacity to support increase in concurrent reads
Can scale to a virtually unlimited # of queries
IT's a dynamic parameter in WLM queue config
To enable, set Concurrency Scaling mode = auto for the given WLM queue
Can continue R/W during concurrency scaling
You get one-hour of free concurrency scaling credits per day

Scaling in Redshift

Elastic Resize
- Add/Remove nodes to/from an existing cluster in minutes (4 to 8 mins)
- Is a manual process, cluster is unavailable udring resize op
Classic resize
- Change node type, the number of nodes, or both
- Copies data to a new cluster
- Source cluster will be in read-only mode during resize op
Snapshot, restore, and classic resize
- Copy the cluster and resize the copied cluster
- Source cluster remains available throughout
- Manually copy the delta to the new cluster.

Redshift Backup and Restore

Redshift maintains at least three copies of your data - the original, replica on the compute nodes, and a backup in S3
Snapshots are point-in-time backups of a cluster, stored internally in S3
Snapshots are incremental (only what has changed is saved)
You can restore a snapshot into a new cluster
Automated - every 8 hours / every 5 GB / or on a schedule. Set retention
Manual - snapshot is retained until you delete it
Can configure Redshift to automatically copy snapshots (automated or manual) to another AWS Region

Copying Redshift snapshots to another Region

For unencrypted snapshots
- configure cross-region snapshots
For encrypted snapshots
- Configure cross-region snapshots and additionally specify a snapshot copy grant
- snapshot copy grant requires a name and a KMS key and gets created in the destination region

Redshift Multi-AZ Deployment

Multi-AZ Deployments are not supported
Alternative way:
- Run Redshift clusters in multiple AZs by loading same data into different AZs
- Restore a cluster snapshot. to a different AZ
- Use Spectrum to read S3 data from Redshift clusters spread across AZs

Redshift Availability and Durability

Drive failure
- Redshift cluster remains available (with decline in query performance)
- Transparently uses a replica stored on other drives within the node
- Single node clusters do not support data replication (must restore from snapshot on S3)
Node failure
- Automatically detects and replaces a failed node
- Cluster remains unavailable until node is replaced
AZ outage
- Cluster remains unavailable until AZ outage resolves, data is preserved
- Can restore from snapshot to another AZ within region (frequently accessed data is restored first)

Redshift Security - Encryption

Encryption at rest - uses hardware-accelerated AES-256 encryption
Can use your own KMS keys or HSMs (hardware security module)
Encryption in transit - uses SSL connections
API requests must be signed using SigV4 process
Spectrum supports S3's Server-Side Encryption (SSE)
Can use client-side encryption using CMK before sending data to S3

Redshift Security - IAM & Network

Access to Redshift resources is controlled at four levels
- Cluster management - using IAM policies
- Cluster connectivity - using cluster seucrity groups / VPC
- Database access - using DB user accounts/groups and permissions
  - CREATE USER / GROUP statements
  - GRANT / REVOKE statements
- IAM Authentication (temporary credentials and SSO)
Compute nodes can only be accessed via leader node (not directly)

Enhanced VPC Routing in Redshift

To manage data traffic between your Redshift cluster and other resources
All COPY / UNLOAD traffic can go through VPC
If disabled, all traffic is routed. through internet (including traffic within AWS)
S3 bucket within the same region - use VPC endpoints (S3 endpoints)
S3 bucket in different region (or another service within AWS N/W) - use NAT GW
To connect to AWS services outside your VPC - Use IGW

Redshift Spectrum w/ Enhanced VPC Routing

Redshift Spectrum doesn't use enhanced VPC routing by default
Additional config is required
Modify your cluster's IAM role to allow traffic to the S3 b ucket from Redshift Spectrum
Modify the policy attached to the S3 bucket
- Spectrum can't access S3 buckets with policy that restrict access to only specific VPC endpoints
- Use a bucket policy that restricts access to only specific principals (users / accounts)
Configure your VPC to allow your cluster to access AWS Glue / Athena

Redshift Monitoring

Integrates with CloudWatch
CloudWatch metrics
- Accessible within Redshift console (or in CloudWatch)
- Monitor CPU utilization, latency, and throughput etc.
Query and Load performance
- Not published as CloudWatch metrics
- Displayed only in the Amazon redshift console
- Cluster performance, query history, DB performance, concurrency scaling data etc.
Redshift Advisor
- offers recommendations for cost optimization and performance tuning

Database audit logging in Redshift

Audit logs (stored in S3 buckets, must be enabled)
- Connection log - connections / disconnections / auth attempts
- User logs - changes to user info
- User activity log - logs each query
STL / SVL tables -
- Similar to audit logs
- Available by default on every node in the cluster
- STL_CONNECTION_LOG logs connections / disconnections / auth attempts
CloudTrail - captures all Redshift API calls as events

Redshift Pricing

You pay only for what you use
- Compute node hours
- Backup storage - manual and automated
- Data transfer (except data xfer b/w Redshift and S3 within region)
- Spectrum data scanned - amount of S3 data scanned by your query
- Managed Storage (per GB-month rate, only for RA3 node types)
- Concurrency Scaling - per-second on-demand rate (in excess of free credits)
No charges for Spectrum when you're not running queries

'AWS Database' 카테고리의 다른 글

[AWS Certificate]-AWS Database Encryption, Security and Auditing (0)	2022.01.22
[AWS Ceritificate]-Database Migration, DMS and SCT (0)	2022.01.20
Comparison of AWS Database (0)	2022.01.20
[AWS Certificate]-Athena & Quicksight (0)	2022.01.15

Clark의 IT Container

[AWS Certificate]-Amazon Redshift

'AWS Database' 카테고리의 다른 글

티스토리툴바

[AWS Certificate]-Amazon Redshift

'AWS Database' 카테고리의 다른 글

관련글

티스토리툴바