[AWS Certificate]-Amazon Neptune

Amazon Neptune - Overview

Fully managed graph database service (non-relational)
Relationships are first-class citizens
Can quickly navigate relationships and retrieve complex relations between highly connected datasets
Can query bilions of relationships. with milisecond latency
ACID compliant with immediate consistency
Supports transaction semantics for highly concurrent OLTP workloads (ACID transactions)
Supported graph query languages - Apache TinkerPop Gremlin and RDF/SPARQL
Support 15 low-latency read replicas (Multi-AZ)
Use cases:
- Social graph / Knowledge graph
- Fraud detection
- Real-time big data mining
- Customer interests and recommendations (Recommendation engines)

Graph Database

Models relationships between data
- e.g. Subject / predicate / object / graph (quad)
- Joe likes pizza
- Sarah is friends with Joe
- Sarah likes pizza too
- Joe is a student and lives in London
- Let's you ask questions like "identify Londoners who like pizza" or "identify friends of Londoners who like pizza"
Uses nodes (vertices) and edges (actions) to describe the data and relationships between them
DB stores - person / action / object (and a graph ID or edge ID)
Can filter or discover data based on strenght, weight, or quality of relationships

Graph query languages

Neptune supports two popular modeling frameworks - Apache TinkerPop and RDF/SPARQL
TinkerPop uses Gremlin traversal language
RDF (W3C standard) uses SPARQL
SPARQL is great for multiple data sources, has large variety of datasets available
We can use Gremlin or SPARQL to load data into Neptune and then to query it
You can store both Gremlin and SPARQL graph data on the same Neptune cluster
It gets stored separately on the cluster
Graph data inserted using one query language can only be queried with that query language (and not with the other)

Neptune Architecture

6 copies of your data across 3 AZ (distributed design)
- Lock-free optimistic algorithm (quorum model)
- 4 copies out of 6 needed for writes (4/6 write quorum -data considered durable when at least 4/6 copies acknowledge the write)
- 3 copies out of 6 needed for reads (3/6 read quorum)
- Self healing with peer-to-peer replication, Storage is striped across 100s of volumes
One Neptune Instance takes writes (master)
Compute nodes on replicas do not need to write/replicate(=improved read performance)
Log-structured distributed storage layer-passes incremental log records from compute to storage layer (=faster)
Master + up to 15 Read Replicas serve reads
Data is continuously backed up to S3 in real time, using storage nodes (compute node performance is unaffected)

Neptune cluster

Loader endpoint - to load the data into Neptune (say, from S3)
- e.g. https://<cluster_endpoint>:8182/loader
Gremlin endpoint - for Gremlin queries
- e.g. https://<cluster_endpoint>:8182/gremlin
Sparql endpoint - for Sparql queries
- e.g. https://<cluster_endpoint>:8182/sparql

Demo

Bulk loading data into Neptune

Use the loader endpoint (HTTP POST to the loader endpoint)

e.g.

curl -X POST -H 'Content-Type: application/json'
https://<cluster_endpoint>:8182/loader -d
'{
    "source": "s3://bucket_name/key_name,
    ...
 }'

S3 data can be accessed using an S3 VPC endpoint (allows access to S3 resources from your VPC)
Neptune cluster must assume an IAM role with S3 read access
S3 VPC endpoint can be created using the VPC management console
S3 bucket must be in the same region as the Neptune cluster
Load data formats
- csv (for gremlin), ntripples / nquads / rdfxml / turtle (for sparql)
All files must be UTF-8 encoded
Multiple files can be loaded in a single job

Demo

Neptune. Replication

Up to 15 read replicas
ASYNC replication
Replicas share the same underlying storage layer
Typically take 10s of milliseconds (replication lag)
Minimal performance impact on the primary due to replication process
Replicas double up as failover targets (standby instance is not needed)

Neptune High Availability

Failovers occur automatically
A replica is automatically promoted to be the new primary during DR
Neptune flips the CNAME of the DB instance to point to the replica and promotes it
Failover to a replica typically takes under 30-120 seconds (minimal downtime)
Creating a new instance takes about 15 minutes(post failover)
Failover to a new instance happens on a best-effort basis and can take longer

Neptune Backup and Restore

Supports automatic backup
Continuously backs up your data to S3 for PITR (max retention period of 35 days)
latest restorable time for a PITR can be up to 5 mins in the past (RPO = 5 minutes)
The first backup is full backup. Subsequent backups are incremental
Take manual snapshots to retain beyond 35 days
Backup process does not impact cluster performance

Neptune Backup and Restore

Can only restore to a new cluster
Can restore an unencrypted snapshot to an encrypted cluster (but not the other way round)
To restore a cluster from an encrypted snapshot, you must have access to the KMS key
Can only share manual snapshots (can copy and share automated ones)
Can't share a snapshot encrypted using the default KMS key of the a/c
Snapshots can be shared across accounts, but within the same region

Neptune Scaling

Vertical scaling (scale up / down) - by resizing instances
Horizontal scaling (Scale out / in) - by adding /removing up to 15 read replicas
Automatic scaling storage - 10GB to 64TB (no manual intervention needed)

Database Cloning in Neptune

Different from creating read replicas - clones support both reads and writes
Different from replicating a cluster - clones use same storage layer as the source cluster
Requires only minimal additional storage
Quick and cost-effective
Only within region (can be in different VPC)
Can be created from existing clones
Uses a copy-on-write protocol
- both source and clone share the same data initially
- data that changes, is then copied at the time it chages either on the source or on the clone (i.e. stored separately from the shared data)
- delta of writes after cloning is not shared

Neptune Security - IAM

Uses IAM for authentication and authroization to manage Neptune resources
Supports IAM Authentication (with AWS SigV4)
You use temporary credentials using an assumed role
- Create an IAM role
- Setup trust relationship
- Retrieve temp creds
- Sign the requests using the creds

Neptune Security - Encryption & Network

Encryption in transit - using SSL/TLS
- Cluster parameter neptune_enforce_ssl = 1 (is default)
Encryption at rest - with AES-256 using KMS
- encrypts data, automated backups, snapshots, and replicas in the same cluster
Neptune clusters are VPC-only (use private subnets)
Clients can run on EC2 in public subnets within VPC
Can connect to your on-premises IT infra via VPN
Use security groups to control access

Neptune Monitoring

Integrated with CloudWatch
can use Audit log files by enabling DB cluster parameter netpune_enable_audit_log
must restart DB cluster after enabling audit logs
audit log files are rotated beyond 100MB (not configurable)
audit logs are not stored in sequential order (can be ordered using the timestamp value of each record)
audit log data can be published (exported) to a CloudWatch Logs log groups by enabling Log exports for your cluster
API calls logged with CloudTrail

Query Queuing in Neptune

Max 8192 queries can be queued up per Neptune instance
Queries beyond 8192 will result in ThrottlingException
Use CloudWatch metric MainRequestQueuePendingRequests to get number of queries queued (5 min granularity)
Get acceptedQueryCount value using Query Status API
- For Gremlin, acceptedQueryCount = current count of queries queued
- For SPARQL, acceptQueryCount = all queries accepted since the server started

Neptune Service Errors

Graph engine errors
- Errors related to cluster endpoints, are HTTP error codes
- Query errors - QueryLimitException / MemoryLimitExeededException / TooManyRequestsException etc.
- IAM Auth errors - Missing Auth / Missing token / Invalid Signature / Missing headers / Incorrect Policy etc
API errors
- HTTP errors related to APIs (CLI / SDK)
- InternalFailure / AccessDeniedException / MalformedQueryString / ServiceUnavailable etc
Loader Error
- LOAD_NOT_STARTED / LOAD_FAILED / LOAD_S3_READ_ERROR / LOAD_DATA_DEADLOCK etc

SPARQL federated query

Query across multiple Neptune clusters or external data sources that support the protocol, and aggregate the results
Supports only read operations

Neptune Streams

Capture changes to your graph (change logs)
Similar to DynamoDB streams
Can be processed with Lambda (use Neptune Streams API)
SPARQL
- https://<cluster_endpoint>:8182/sparql/stream
Gremlin
- https://<cluster_endpoint>:8182/gremlin/stream
Only GET method is allowed

Use Cases

Amazon ES Integration
- To perform full-text search queries on Neptune data
- Uses Streams + federated queries
- Supported for both gremlin and SPARQL
Neptune-to-Neptune Replication

Neptune Pricing

You only pay for what you use
On-demand instances - per hour pricing
IOPS - per million IO requests
- Every DB page read operation = one IO
- Each page is 16KB in Neptune
- Write IOs are counted in 4KB units
DB Storage - per GB per month
Backups (automated and manual) - per GB per month
Data transfer - per GB
Neptune Workbench - per instance hour

'AWS Database > AWS Other Database' 카테고리의 다른 글

[AWS Certificate]-Amazon QLDB (0)	2022.01.16
[AWS Certificate]-Amazon Timestream (0)	2022.01.16
[AWS Certificate]-Amazon Elasticsearch Service (0)	2022.01.16
[AWS Certificate]-DocumentDB (0)	2022.01.15
[AWS Certificate]-ElastiCache (0)	2022.01.15

Clark의 IT Container

[AWS Certificate]-Amazon Neptune

'AWS Database > AWS Other Database' 카테고리의 다른 글

티스토리툴바

[AWS Certificate]-Amazon Neptune

'AWS Database > AWS Other Database' 카테고리의 다른 글

관련글

티스토리툴바