Skip to main content

Data Lake Integration

Expand your RPC node's capabilities by connecting it to a data lake for complete historical ledger access.

Integration Overview

RPC version 23.0 introduces data lake integration for the getLedgers endpoint, enabling access to the historical ledgers outside your node's local retention period (typically 7 days). All other RPC endpoints will still operate based on the HISTORY_RETENTION_WINDOW configured on your node.

The process involves setting up a ledger data lake and then configuring your RPC node to use it.

1. Accessing a Data Lake

You have two options for utilizing a data lake:

  • Public Data Lake: The simplest way is to use a publicly available data lake. For example, Stellar ledger data lake is available through the AWS Open Data program at s3://aws-public-blockchain/v1.1/stellar/ledgers/pubnet.
  • Self-Hosted Data Lake: This method lets you have more control over data integrity, availability and access, but requires you to create and manage your own data lake. The Galexie tool can help you deploy a data lake on either AWS S3 or Google Cloud Storage (GCS). For detailed instructions, refer to the Galexie Admin Guide.

2. Configuring RPC for Data Lake Integration

Pre-requisite

Before you begin, configure your RPC node with cloud provider credentials and ensure it has read permissions for the data lake bucket.

Configuration Steps

Update your RPC node's configuration file with the following settings:

  1. Specify Storage Path: Define the storage backend (GCS or S3) and provide the full path to the bucket (e.g., my-bucket/path/to/data).
  2. Enable the Feature Flag: Set SERVE_LEDGERS_FROM_DATASTORE to true.
  3. Configure Retention Window: The HISTORY_RETENTION_WINDOW must be a non-zero value, even though data is fetched from the datastore. Ledgers within this retention window are served from the RPC’s local storage, while older ledgers are retrieved from the data lake. A minimum value of 1 is sufficient.

Configuration Examples

Below are examples for configuring GCS and S3 backends.

A. GCS Configuration Example

# External datastore configuration for GCS
[datastore_config]
type = "GCS"

[datastore_config.params]
destination_bucket_path = "your-bucket/path/to/data"

[datastore_config.schema]
ledgers_per_file = 1
files_per_partition = 64000

# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true

B. S3 Configuration Example

# External datastore configuration for S3
[datastore_config]
type = "S3"

[datastore_config.params]
destination_bucket_path = "your-bucket/path/to/data`"
region = "your_s3_region" # e.g., "us-east-1"

[datastore_config.schema]
ledgers_per_file = 1
files_per_partition = 64000

# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true

3. Verifying the Setup

After configuring your RPC node, you can verify that the integration is working by making a GetLedgers request for a ledger sequence number that's older than your node’s standard retention window.The RPC should successfully return the ledger data from the data lake.

Example Request:

curl -X POST https://<rpc-host>/ \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": "1",
"method": "getLedgers",
"params": {
"startLedger": 100,
"pagination": {
"limit": 1
}
}
}'