Data Lake Integration
Expand your RPC node's capabilities by connecting it to a data lake for complete historical ledger access.
Integration Overview
RPC version 23.0 introduces data lake integration for the getLedgers
endpoint, enabling access to the historical ledgers outside your node's local retention period (typically 7 days). All other RPC endpoints will still operate based on the HISTORY_RETENTION_WINDOW
configured on your node.
The process involves setting up a ledger data lake and then configuring your RPC node to use it.
1. Accessing a Data Lake
You have two options for utilizing a data lake:
- Public Data Lake: The simplest way is to use a publicly available data lake. For example, Stellar ledger data lake is available through the AWS Open Data program at
s3://aws-public-blockchain/v1.1/stellar/ledgers/pubnet
. - Self-Hosted Data Lake: This method lets you have more control over data integrity, availability and access, but requires you to create and manage your own data lake. The Galexie tool can help you deploy a data lake on either AWS S3 or Google Cloud Storage (GCS). For detailed instructions, refer to the Galexie Admin Guide.
2. Configuring RPC for Data Lake Integration
Pre-requisite
Before you begin, configure your RPC node with cloud provider credentials and ensure it has read permissions for the data lake bucket.
Configuration Steps
Update your RPC node's configuration file with the following settings:
- Specify Storage Path: Define the storage backend (
GCS
orS3
) and provide the full path to the bucket (e.g.,my-bucket/path/to/data
). - Enable the Feature Flag: Set
SERVE_LEDGERS_FROM_DATASTORE
totrue
. - Configure Retention Window: The
HISTORY_RETENTION_WINDOW
must be a non-zero value, even though data is fetched from the datastore. Ledgers within this retention window are served from the RPC’s local storage, while older ledgers are retrieved from the data lake. A minimum value of1
is sufficient.
Configuration Examples
Below are examples for configuring GCS and S3 backends.
A. GCS Configuration Example
# External datastore configuration for GCS
[datastore_config]
type = "GCS"
[datastore_config.params]
destination_bucket_path = "your-bucket/path/to/data"
[datastore_config.schema]
ledgers_per_file = 1
files_per_partition = 64000
# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true
B. S3 Configuration Example
# External datastore configuration for S3
[datastore_config]
type = "S3"
[datastore_config.params]
destination_bucket_path = "your-bucket/path/to/data`"
region = "your_s3_region" # e.g., "us-east-1"
[datastore_config.schema]
ledgers_per_file = 1
files_per_partition = 64000
# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true
3. Verifying the Setup
After configuring your RPC node, you can verify that the integration is working by making a GetLedgers
request for a ledger sequence number that's older than your node’s standard retention window.The RPC should successfully return the ledger data from the data lake.
Example Request:
curl -X POST https://<rpc-host>/ \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"id": "1",
"method": "getLedgers",
"params": {
"startLedger": 100,
"pagination": {
"limit": 1
}
}
}'