Skip to content

Please submit any typos you come across to GitHub issues.


Migrating From 1.x

If you aren’t coming from an existing deployment of Horizon, feel free to skip this section and move on to installing Horizon!

Introduction

Starting with version v1.6.0, Horizon allows using Stellar Core in “captive” mode for ingestion. This mode has been enabled by default since Horizon 2.0, so even though you can enable captive mode on 1.6+, this migration guide is catered towards upgrading to 2.x due to the stability and configuration improvements introduced in the later versions.

Please note that Horizon team will support the previous non-captive mode for the time being. To use the previous method, set ENABLE_CAPTIVE_CORE_INGESTION=false in your ingesting instances. After 6 months, this feature flag will be removed.

Please give the blog post a read to understand the major changes that Horizon 2.0 introduces with the Captive Core architecture. In summary, Captive Core is a specialized, narrowed-down Stellar-Core instance with the sole aim of emitting transaction metadata to Horizon. It means:

  • no separate Stellar Core instance
  • no Core database: everything done in-memory
  • much faster ingestion

Captive Stellar Core completely eliminates all Horizon issues caused by connecting to Stellar Core’s database, but it requires extra time to initialize and manage its Stellar Core subprocess. Captive Core can be used in both reingestion (horizon db reingest range) and normal Horizon operation (horizon serve). In fact, using Captive Core to reingest historical data is considerably faster than without it.

How It Works

The blog post linked above gives a high-level overview, while this section dives a little deeper into the technical differences relative to Horizon’s relationship with standalone, “Watcher” Core.

When using Captive Core, Horizon runs the stellar-core binary as a subprocess. Then, both processes communicate over filesystem pipe: Core sends xdr.LedgerCloseMeta structs with information about each ledger and Horizon reads it.

The behaviour is slightly different when reingesting old ledgers and when reading recently closed ledgers:

  • When reingesting, Stellar Core is started in a special catchup mode that simply replays the requested range of ledgers. This mode requires an additional 3GiB of RAM because all ledger entries are stored in memory, making it extremely fast. This mode only depends on the history archives, so a Captive Core configuration (see below) is not required.

  • When reading recently closed ledgers, Core is started with a normal run command. This mode also requires an additional 3GiB of RAM for in-memory ledger entries. In this case, a configuration file (again, read on below) is required in order to configure a quorum set so that it can connect to the Stellar network.

Known Limitations

As discussed earlier, Captive Core provides much better decoupling for Horizon at the expense of persistence. You should be aware of the following consequences:

  • Captive Core requires a couple of minutes to complete the “apply buckets” stage first time Horizon is started, but it should reuse the cached buckets on subsequent restarts (as of Horizon 2.5 and Core 17.1).
  • If the Horizon process terminates, Stellar Core is also terminated (unless you are using Remote Captive Core).
  • Running Horizon now requires more RAM and less disk space. You can refer to the earlier Prerequisites page for details.

To hedge against these limitations, we recommend running multiple ingesting Horizon servers in a single cluster. This allows other ingesting instances to maintain service without interruptions if a Captive Core instance is restarted.

Migration

Now, we’ll discuss migrating existing systems running the pre-2.0 versions of Horizon to the new Captive Core world.

Configuration

The first major change from 1.x is how you will configure Horizon. You will no longer need your Stellar Core configuration, but will rather need to craft a configuration file describing Captive Core’s behavior. Read this section to understand what the stub should contain.

Your old configuration cannot be used directly: Horizon needs special settings for Captive Core. Otherwise, running Horizon may fail with the following error, or errors like it:

Invalid captive core toml file: LOG_FILE_PATH in captive core config file does not match Horizon captive-core-log-path flag

Again, while the Captive Core configuration file may appear to just be a subset of Stellar Core’s configuration, you shouldn’t think about it that way and treat it as its own format. It may diverge in the future, and not all of Core’s options are available to Captive Core.

You should pass the location of this new TOML configuration to the --captive-core-config-path/CAPTIVE_CORE_CONFIG_PATH command-line flag / environmental variable.

If you want to continue to have access to the underlying Stellar Core subprocess (like you did previously with a standalone Watcher Core), you should set the HTTP_PORT field in your configuration file accordingly.

Installation

Once you have a configuration file ready, you can install the Captive Core package to prepare the Horizon configuration:

bash
sudo apt install stellar-captive-core

Note the new block inserted into your /etc/default/stellar-horizon file:

bash
# Captive Core Ingestion Config
ENABLE_CAPTIVE_CORE_INGESTION=true
STELLAR_CORE_BINARY_PATH=/usr/bin/stellar-core
CAPTIVE_CORE_CONFIG_PATH=/etc/default/stellar-captive-core.toml
CAPTIVE_CORE_STORAGE_PATH=/var/lib/stellar
# end Captive Core

You will need to adjust these accordingly, for example by pointing CAPTIVE_CORE_CONFIG_PATH to your configuration file and possibly CAPTIVE_CORE_STORAGE_PATH to where you’d like Captive Core to store its bucket files (but keep in mind the disk space and permissions requirements).

Finally, the process for upgrading both Stellar Core and Horizon is covered here.

Depending on the version you’re migrating from, you may need to include an additional step here: manual reingestion. This can still be accomplished with Captive Core; see below.

Restarting Services

Now, we can stop Core and restart Horizon:

bash
sudo systemctl stop stellar-core
sudo systemctl restart stellar-horizon

The logs should show Captive Core running successfully as a subprocess, and eventually Horizon will be running as usual except with Captive Core rapidly generating transaction metadata in-memory!

A Multi-Machine Setup

If you plan on running Horizon and Captive Core on separate machines, you’ll need to read the page on Remote Captive Core for setting that up.

The most important change is that you will need to point Horizon at Captive Core via the REMOTE_CAPTIVE_CORE_URL (for the wrapper API) and STELLAR_CORE_URL (for the raw Core API, if enabled). For example, via:

bash
echo "STELLAR_CORE_URL='http://captivecore.local:11626'
REMOTE_CAPTIVE_CORE_URL='http://captivecore.local:8000'
" | sudo tee -a /etc/default/stellar-horizon

Where captivecore.local is the machine on which the Remote Captive Core API wrapper is running.

Private Networks

If you want your Captive Core instance to connect to a private Stellar network, you will need to specify the validator(s) of the private network in the Captive Core configuration file.

Assuming the validator of your private network has a public key of GD5KD2KEZJIGTC63IGW6UMUSMVUVG5IHG64HUTFWCHVZH2N2IBOQN7PS and can be accessed at private1.validator.com, then the Captive Core config would consist of the following:

TOML
UNSAFE_QUORUM=true
FAILURE_SAFETY=0

[[VALIDATORS]]
NAME="private"
HOME_DOMAIN="validator.com"
PUBLIC_KEY="GD5KD2KEZJIGTC63IGW6UMUSMVUVG5IHG64HUTFWCHVZH2N2IBOQN7PS"
ADDRESS="private1.validator.com"
QUALITY="MEDIUM"

UNSAFE_QUORUM=true and FAILURE_SAFETY=0 are required when there are too few validators in the private network to form a quorum.

You will also need to set RUN_STANDALONE=false in the Stellar Core configuration for the validator. Otherwise, the validator will not accept connections on its peer port, which means Captive Core will not be able to connect to the validator.

On a new Stellar network, the first history archive snapshot is published after ledger 63 is closed. Captive Core depends on the history archives, which means that Horizon ingestion via Captive Core will not begin until after ledger 63 is closed. Assuming the standard 5 second delay in between ledgers, it will take ~5 minutes for the network to progress from the genesis ledger to ledger 63.

There are cases where you may need to repeatedly create new private networks (e.g. spawning a private network during integration tests) and this 5 minute delay is too costly. In that case, you can consider including ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING=true in both the validator configuration and the Captive Core configuration. When this parameter is set, Stellar Core will publish a new ledger every second. It will also publish history archive snapshots every 8 ledgers, so you will need to set Horizon’s checkpoint frequency parameter (--checkpoint-frequency/CHECKPOINT_FREQUENCY) to 8.

Reingestion

If you need to manually reingest some ledgers (for example, you want history for some ledgers that closed before your asset got issued), you can still do this with Captive Core.

For example, suppose we’ve ingested from ledger 811520, but would like another 1000 ledgers before it to be ingested as well. Nothing really changes from the execution perspective relative to the “old” way (given the configuration updates from before are done):

bash
stellar-horizon-cmd db reingest range 810520 811520

The biggest change is simply how much faster this gets done! For example, a full reingestion of the entire network only takes ~1.5 days (as opposed to weeks previously) on an m5.8xlarge instance.

Using Captive Core to reingest the full public network history

In some cases, it can be convenient to (re)ingest the full Stellar Public Network history into Horizon (e.g. when running Horizon for the first time).

This process used to take weeks. However, using multiple Captive Core workers on a high performance environment (powerful machines on which to run Horizon + a powerful database) makes this possible in ~1.5 days.

The following instructions assume the reingestion is done on AWS. However, they should be applicable to any other environment with equivalent capacity. In the same way, the instructions can be adapted to reingest only specify parts of the history.

Prerequisites

  1. An m5.8xlarge (32 cores, 64GB of RAM) EC2 instance with at least 200 GB of disk capacity from which to run Horizon. This is needed to fit 24 Horizon parallel workers (each with its own Captive Core instance). Each Core instance can take up to 3GB of RAM and a full core (more on why 24 workers below). If the number of workers is increased, you may need a larger machine.

  2. Horizon version 1.6.0 or newer (ideally, 2.x) installed in the machine from (1).

  3. Core version 17.0 or newer installed in the machine from (1), given the ledger is on Protocol 17.

  4. A Horizon database where to reingest the history. Preferably, the database should be at least an RDS r4.8xlarge instance or better (to take full advantage of its IOPS write capacity) and should be empty to minimize storage (Postgres accumulates data during usage, which is only deleted when VACUUMed). When using an RDS instance with General Purpose SSD storage, the reingestion throughput of the DB (namely Write IOPS) is determined by the storage size (3 IOPS per GB). With 5TB you get 15K IOPS, which can be saturated with 24 Horizon workers. As the DB storage grows, the IO capacity will grow along with it. The number of workers (and the size of the instance created in (1), should be increased accordingly if we want to take advantage of it. To make sure we are minimizing reingestion time, we should look at the RDS Write IOPS CloudWatch graph. The graph should ideally always be close to the theoretical limit of the DB (3000 IOPS per TB of storage).

Reingestion

Once the prerequisites are satisfied, we can spawn two Horizon reingestion processes in parallel:

  1. One for the first 17 million ledgers (which are almost empty).
  2. Another one for the rest of the history.

This is due to first 17 million ledgers being almost empty whilst the rest are much more packed. Having a single Horizon instance with enough workers to saturate the IO capacity of the machine for the first 17 million would kill the machine when reingesting the rest (during which there is a higher CPU and memory consumption per worker).

64 workers for (1) and 24 workers for (2) saturates the IO capacity of an RDS instance with 5TB of General Purpose SSD storage. Again, as the DB storage grows, a larger number of workers should be considered.

In order to run the reingestion, first set the following environment variables in the configuration:

bash
export DATABASE_URL=postgres://postgres:[email protected]:5432/horizon
export APPLY_MIGRATIONS=true
export HISTORY_ARCHIVE_URLS=https://s3-eu-west-1.amazonaws.com/history.stellar.org/prd/core-live/core_live_001
export NETWORK_PASSPHRASE="Public Global Stellar Network ; September 2015"
export STELLAR_CORE_BINARY_PATH=$(which stellar-core)
export ENABLE_CAPTIVE_CORE_INGESTION=true
# Number of ledgers per job sent to the workers.
# The larger the job, the better performance from Captive Core's perspective,
# but, you want to choose a job size which maximizes the time all workers are
# busy.
export PARALLEL_JOB_SIZE=100000
# Retries per job
export RETRIES=10
export RETRY_BACKOFF_SECONDS=20

(Naturally, you can also edit the configuration file at /etc/default/stellar-horizon directly.) If Horizon was previously running, first ensure it is stopped. Then, run the following commands in parallel:

  1. stellar-horizon db reingest range --parallel-workers=64 1 16999999
  2. stellar-horizon db reingest range --parallel-workers=24 17000000 <latest_ledger>

When saturating an RDS instance with 15K IOPS capacity:

(1) should take a few hours to complete.

(2) should take about 1.5 days to complete.

Although there is a retry mechanism, reingestion may fail half-way. Horizon will print the recommended range to use in order to restart it.

Monitoring reingestion process

This script should help monitor the reingestion process by printing the ledger subranges being reingested:

bash
#!/bin/bash
echo "Current ledger ranges being reingested:"
echo
I=1
for S in $(ps aux | grep stellar-core | grep catchup | awk '{print $15}' | sort -n); do
    printf '%15s' $S
    if [ $(( I % 5 )) = 0 ]; then
      echo
    fi
    I=$(( I + 1))
done

Ideally we would be using Prometheus metrics for this, but they haven’t been implemented yet.

Here is an example run:

Current ledger ranges being reingested:
    99968/99968   199936/99968   299904/99968   399872/99968   499840/99968
   599808/99968   699776/99968   799744/99968   899712/99968   999680/99968
  1099648/99968  1199616/99968  1299584/99968  1399552/99968  1499520/99968
  1599488/99968  1699456/99968  1799424/99968  1899392/99968  1999360/99968
  2099328/99968  2199296/99968  2299264/99968  2399232/99968  2499200/99968
  2599168/99968  2699136/99968  2799104/99968  2899072/99968  2999040/99968
  3099008/99968  3198976/99968  3298944/99968  3398912/99968  3498880/99968
  3598848/99968  3698816/99968  3798784/99968  3898752/99968  3998720/99968
  4098688/99968  4198656/99968  4298624/99968  4398592/99968  4498560/99968
  4598528/99968  4698496/99968  4798464/99968  4898432/99968  4998400/99968
  5098368/99968  5198336/99968  5298304/99968  5398272/99968  5498240/99968
  5598208/99968  5698176/99968  5798144/99968  5898112/99968  5998080/99968
  6098048/99968  6198016/99968  6297984/99968  6397952/99968 17099967/99968
 17199935/99968 17299903/99968 17399871/99968 17499839/99968 17599807/99968
 17699775/99968 17799743/99968 17899711/99968 17999679/99968 18099647/99968
 18199615/99968 18299583/99968 18399551/99968 18499519/99968 18599487/99968
 18699455/99968 18799423/99968 18899391/99968 18999359/99968 19099327/99968
 19199295/99968 19299263/99968 19399231/99968

Last updated Jun. 10, 2021

Next Up: Installing
Page Outline