Skip to main content

Full History Exporting

This page outlines best practices for using Galexie to build a data lake with the complete history of ledger metadata.

Why Full History Export?

Exporting the full history of Stellar ledger metadata provides a complete data lake of everything that has occurred on-chain. This makes it easy and fast to retrieve data at any point in the network's history.

This enables:

  • Analytics - run historical trend analysis with stellar-etl
  • Full History RPC - data lake backend supplying ledger metadata to RPC instances to enable full history data access
  • Real-Time Data - access to real-time data on top of the full historical data access

Costs and Storage Requirements

The estimates are based on GCP Compute Engine and Google Cloud Storage costs

  • Total Cost ~= $1,100 USD
    • Compute Costs ~= $500 USD
    • GCS Class A Operations (writes) Costs ~= $600 USD
  • Total Storage Size ~= 3 TB

Export Strategy

The best way to export full history with Galexie is by running multiple individual instances of Galexie in parallel. For reference, it is estimated to take approximately 150 days to export full history using a single Galexie instance. Running in parallel with 40-50 Galexie instances takes roughly 4-5 days.

  1. Make sure you have set up a storage system and have appropriate hardware available as defined in the Galexie Prerequisites
  2. Determine how many parallel instances of Galexie that you'd want to run
  3. Remember to pass non-overlapping ledger ranges to each of your Galexie instances

Note that earlier ledgers in history are smaller and export faster than newer, more recent ledgers. This performance difference becomes apparent around ledger 30,000,000. Because of this performance difference, it is generally better to allocate more Galexie instances for more recent ledgers.

How this Looks in Practice

Let's say there are 50,000,000 ledgers that Galexie needs to export.

  • For 50 instances, this split could look like:
    • 15 instances to process genesis to 29,999,999
    • 35 instances to process 30,000,000 to 50,000,000

Each instance will follow the same Running Galexie instructions

galexie append --start <start_ledger> --end <end_ledger>

Where your first instance would run

galexie append --start 2 --end 1999999

The second would run

galexie append --start 2000000 --end 3999999

and so on

Methods for Running Multiple Galexie Instances

There are different ways to start up multiple Galexie instances that can vary depending on your cloud provider or local hardware.

GCP Batch

Within GCP you can use Batch that accepts a job JSON or YAML file that can parameterize the start and end ledger ranges for each Galexie instance

Example GCP Batch job YAML

job:
taskGroups:
- taskSpec:
computeResource:
cpuMilli: 2000
memoryMib: 8000
maxRetryCount: 1
container:
imageUri: "stellar/stellar-galexie:23.0.0"
entrypoint: "galexie"
commands: ["append", "--start", "${START}", "--end", "#{END}"]
tasks:
# It is possible to use the GCP batch index instead of manually naming each task
- name: "galexie-1"
environments:
START: "2"
END: "1999999"
- name: "galexie-2"
environments:
START: "2000000"
END: "3999999"
...

requireHostsFile: true
requireTaskHostsFile: true
allocationPolicy:
instances:
- policy:
machineType: "e2-standard-2"
disks:
- newDisk:
type: "pd-standard"
sizeGb: 100
mountPoint: "/mnt/shared"

GCP Compute Instances

You can spin up multiple individual compute instances manually

Example GCP Compute Instance

#container-declaration-0.yaml
spec:
restartPolicy: Always
containers:
- name: galexie
image: stellar/stellar-galexie:23.0.0
command:
- galexie
args:
- append
- --start
- "2"
- --end
- "1999999"
securityContext:
privileged: true

Then create the instance by running the following gcloud command

gcloud compute instances create "galexie-0" \
--zone=us-central1-a \
--machine-type=e2-standard-2 \
--image-family=cos-stable \
--image-project=cos-cloud \
--boot-disk-size=100GB \
--boot-disk-type=pd-standard \
--boot-disk-device-name="galexie-0" \
--tags=http-server,https-server \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--service-account=<service-account> \
--metadata-from-file=gce-container-declaration="container-declaration-0.yaml"

Repeat process for as many parallel instances of galexie as desired

Local Galexie Instances

You can run multiple Galexie instances locally with a locally built Galexie executable

./galexie append --start 2 --end 1999999 & \
./galexie append --start 2000000 --end 3999999 & \
./galexie append --start 4000000 --end 5999999 &

...