Data Lake Integration

Expand your RPC node's capabilities by connecting it to a data lake for complete historical ledger access.

Integration Overview

RPC version 23.0 introduces data lake integration for the getLedgers endpoint, enabling access to the historical ledgers outside your node's local retention period (typically 7 days). All other RPC endpoints will still operate based on the HISTORY_RETENTION_WINDOW configured on your node.

The process involves setting up a ledger data lake and then configuring your RPC node to use it.

info

The metadata of a ledger returned by getLedgers may vary depending on its source. When a ledger is retrieved from the RPC's local datastore, its metadata is subject to your RPC configuration. However, ledgers fetched from the data lake are typically stored with all metadata included.

1. Accessing a Data Lake

You have two options for utilizing a data lake:

Public Data Lake: The simplest way is to use a publicly available data lake. For example, Stellar ledger data lake is available through the AWS Open Data program at s3://aws-public-blockchain/v1.1/stellar/ledgers/pubnet.
Self-Hosted Data Lake: This method lets you have more control over data integrity, availability and access, but requires you to create and manage your own data lake. The Galexie tool can help you deploy a data lake on either AWS S3 or Google Cloud Storage (GCS). For detailed instructions, refer to the Galexie Admin Guide.

2. Configuring RPC for Data Lake Integration

Pre-requisite

Before you begin, configure your RPC node with cloud provider credentials and ensure it has read permissions for the data lake bucket.

Configuration Steps

Update your RPC node's configuration file with the following settings:

Specify Storage Path: Define the storage backend (GCS or S3) and provide the full path to the bucket (e.g., my-bucket/path/to/data).
Enable the Feature Flag: Set SERVE_LEDGERS_FROM_DATASTORE to true.
Configure Ledger Backend: Configure how data is read from the datastore through BufferedStorageBackend.

Configuration Examples

Below are examples for configuring GCS and S3 backends.

A. GCS Configuration Example

# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true

# External datastore configuration for GCS
[datastore_config]
  type = "GCS"

[datastore_config.params]
  destination_bucket_path = "your-bucket/path/to/data"

[datastore_config.schema]
  ledgers_per_file = 1
  files_per_partition = 64000

[buffered_storage_backend_config]
  buffer_size = 100
  num_workers = 10
  retry_limit = 3
  retry_wait = "5s"

B. S3 Configuration Example

# Enable fetching historical ledgers from the datastore when not available locally
SERVE_LEDGERS_FROM_DATASTORE = true

# External datastore configuration for S3
[datastore_config]
  type = "S3"

[datastore_config.params]
  destination_bucket_path = "your-bucket/path/to/data`"
  region = "your_s3_region" # e.g., "us-east-1"

[datastore_config.schema]
  ledgers_per_file = 1
  files_per_partition = 64000

[buffered_storage_backend_config]
  buffer_size = 100
  num_workers = 10
  retry_limit = 3
  retry_wait = "5s"

3. Verifying the Setup

After configuring your RPC node, you can verify that the integration is working by making a GetLedgers request for a ledger sequence number that's older than your node’s standard retention window.The RPC should successfully return the ledger data from the data lake.

Example Request:

curl -X POST https://<rpc-host>/ \
-H "Content-Type: application/json" \
-d '{
    "jsonrpc": "2.0",
    "id": "1",
    "method": "getLedgers",
    "params": {
        "startLedger": 100,
        "pagination": {
            "limit": 1
        }
    }
}'

Integration Overview​

1. Accessing a Data Lake​

2. Configuring RPC for Data Lake Integration​

Pre-requisite​

Configuration Steps​

Configuration Examples​

3. Verifying the Setup​