본문 바로가기

AWS Database/AWS RDS & Aurora

[AWS Certificate]-Amazon RDS Replication & DR

Multi-AZ Deployments in RDS

  • For high availability, data durability and fault-tolerance (not used for scaling)
  • Offers SYNC replication to standby instance in another AZ over low latency links
  • Performs automatic failover to standby instance in another AZ in case of planned or unplanned outage
  • Uses DNS routing to point to the new master (no need to update connection strings)
  • Failover time (RTO) are typically 60-120 seconds (minimal downtime)
  • Backups are taken from standby instead of primary to ensure performance level during backup activity
  • Recommended for production use cases
  • To force a failover or simulate AZ-failure, reboot the master instance and choose Reboot with failover

RDS Read Replicas

  • Read-only copies of master(primary) DB instance
  • Up to 5 Read Replicas
  • Within AZ, Cross AZ or Cross Region
  • Replication is ASYNC, so reads are eventually consistent
  • Applications must update the connection string to leverage read replicas

RDS Read Replicas

  • Boost DB performance and durability
  • Useful for scaling of read-heavy workloads
  • Can be promoted to primary (complements Multi-AZ)
  • To create a replica, you must enable automatic backups with at least one day retention period
  • Replica can be Multi-AZ (= a replica with its own standby instance)

Multi-AZ Replicas in RDS


RDS Read Replicas as Multi-AZ

  • Supported for MySQL / MariaDB / PostgreSQL / Oracle
  • Works as a DR target. When promoted to primary, it works as Multi-AZ
  • There's added network cost when data goes from one AZ to another

RDS Read Replicas - Use Case

  • You have a production database that is taking on normal load
  • You want to run a reporting application to run some analytics
  • You create a Read Replica to run the new workload there
  • The production application is unaffected
  • Read replicas are used for SELECT (=read) only kind of statements (not INSERT, UPDATE, DELETE)

Promoting a Read Replica to a Standalone DB Instance

  • Promoted instance is rebooted and becomes an independent DB instance (separate from its source)
  • Will no longer work as a replica. Does not affect other replicas of the original DB instance
  • You cannot promote a replica to standalone instance while a backup is running

Promoting a Read Replica to a Standalone DB Instance - Use cases

  • Use as a DR strategy
  • Avoid performance penalty of DDL operations (like rebuilding indexes)
    • perform DDL ops on a read replica and promote it to a standalone instance. Then point your app to this new instance.
  • Sharding (splitting a large DB into multiple smaller DBs)

--- Demo ---

 


Enabling writes on a read replica

  • For MySQL / MariaDB read replica, set the parameter read_only = 0 for the read replica to make it writable
  • You can then perform DDL operations on the read replica as needed without affecting the source DB
  • Actions taken on the read replica don't affect the performance of the source DB instance
  • You can then promote the replica to a standalone DB

RDS Read Replica Capabilities

  • Can create multiple read replicas in q uick succession
  • Can use DB snapshot to perform PITR of a Read Replica
  • Can create a replica from an existing replica
    • reduces replication load from the master DB instance
    • second-tier replica can have higher replication lag

Demo


Cross-Region Read Replicas in RDS

  • Supported for MariaDB, MySQL, Oracle, and PostgreSQL
  • Not supported for SQL Server
  • Advantages
    • Enahanced DR capability
    • Scale read operations closer to the end-users
  • Limitations
    • Higher replica lag times
    • AWS does not guarantee more than five cross-region read replica instances

RDS replicas with an external database

  • Replication b/w an external DB and an RDS replica
  • Supported for MySQL / MariaDB engines 
  • Two ways
    • Binlog replication
    • GTID based Replication


RDS Disaster Recovery Strategies

  • To ensure business continuity despite unexpected failures/events
  • Multi-AZ is not enough (it can't protect from logical DB corruption, malicious attacks etc.)
  • Key metrics for DR plan - RTO and RPO
  • RDS PITR offers RPO of 5 minutes (typically)
  • RTO (Recovery time objective)
    • How long it takes you to recover after a disaster
    • Expressed in hours
  • RPO (Recovery point objective)
    • How much data you could lose due to a disaster
    • Expressed in hours (e.g. RPO of 1 hour means you could lose an hour worth of data)

Comparing RDS DR Strategies

  RTO RPO Cost Scope
Automated backups Good Better Low Single Region
Manual snapshots Better Good Medium Cross-Region
Read replicas Best Best High Cross-Region
  • Replica lag - the amount of time that the replica is behind the source DB
  • Replica lag can impact your recovery
  • Failover to an RDS read replica is a manual process (not automated)
  • A good DR plan should include a combination of backups, replicas and Multi-AZ/Multi-region deployment

Troubleshooting high replica lag

  • Asynchronous logical replication typically results in replica lag
  • You can monitor ReplicaLag metrics in CloudWatch
  • ReplicaLag metric reports Seconds_Behind_Master values
  • Replication deplays can happen due to:
    • Long-running queries on the primary instance (slow query log can help)
    • Insufficient instance class size or storage
    • Parallel queries executed on the primary instance

Troubleshooting replication errors

Recommendations:

  • Size the replica to match the source DB (storage size and DB instance class)
  • Use compatible DB parameter group settings for source DB and replica
  • Ex.max_allowed_packet for read replica must same as that of the source DB instance
  • Monitor the Replication State field of the replica instance
  • If Replication State = Error, then see error details in the Replication Error field
  • Use RDS event notifications to get alerts on such replica issues
  • Writing to tables on a read replica
    • Set read_only=0 to make read replica writable
    • Use only for maintenance tasks (like creating indexes only on replica)
    • If you write to tables on read replica, it might make it incompatible with source DB and break the replication
    • So set read_only=1 immediately after completing mainetance tasks
  • Replication is only supported with transactional storage engines like InnoDB. Using engines like MyISAM will cause replication errors
  • Using unsafe nondeterministic queries such as SYSDATE() can b reak replication
  • You can either skip replication errors (if its not a major one) or delete and recreate the replica

Troubleshooting MySQL read replica issues

  • Errors or data inconsistencies b/w source instance and replica
    • Can happen due to binlog events or InnoDB redo logs aren't flushed during a replica or source instance failure
    • Must manually delete and recreate the replica
  • Preventive recommendations:
    • sync_binlog=1
    • innodb_flush_log_at_trx_commit=1
    • innodb_support_xa=1
  • These settings might reduce performance (so test before moving to production)

Performance hit on new read replicas

  • RDS snapshots are EBS snapshots stored in S3
  • When you spin up a new replica, its EBS volume loads lazily in the background
  • This results in first-touch penalty (when you query any data, it takes longer to retrieve it for the first time)
  • Suggestions:
    • If DB is small, run "SELECT * FROM <table>" query on each table on the replica
    • Initiate a full table scan with VACUUM ANALYZE (in PostgreSQL)
  • Another reason could be an empty buffer pool (cache for table and index data)

Scaling in RDS

  • Vertical Scaling (Scaling up)
    • Single-AZ instance will be unavailable during scaling op
    • Multi-AZ setup offers minimal downtime during scaling op-standby DB gets upgraded first and then primary will failover to the upgraded instance

  • Horizontal Scaling (Scaling out)
    • Useful for read-heavy workloads
    • Use read-replicas
    • Replicas also act as a DR target

Sharding in RDS

  • Sharding = horizontal partitioning
  • Split and distribute data across multiple DBs (called shards)
  • Mapping / routing logic maintained at application tier
  • Offers additional fault tolerance (since no single point of failure)
  • If any shard goes through failover, other shards are not impacted