Skip to main content

Ingestion

Horizon API provides most of its utility through ingested data, and your Horizon server can be configured to listen for and ingest transaction results from the Stellar network. Ingestion enables API access to both current state (e.g. someone's balance) and historical state (e.g. someone's transaction history).

Ingestion Types

There are two primary ingestion use cases for Horizon operations:

  • Ingesting live data to stay up to date with the latest ledgers from the network, accumulating a sliding window of aged ledgers;
  • Ingesting historical data to retroactively add network data from a time range in the past to the database.

Determine Storage Space

You should think carefully about the historical timeframe of ingested data you'd like to retain in Horizon's database. The storage requirements for transactions on the Stellar network are substantial and are growing unbounded over time. This is something that you may need to continually monitor and reevaluate as the network continues to grow. We have found that most organizations need only a small fraction of recent historical data to satisfy their use cases. Through analyzing traffic patterns on SDF's Horizon instance, we see that most requests are for very recent data.

To keep your storage footprint small, we recommend the following:

  • Use live ingestion, use historical ingestion only in limited exceptional cases.
  • If your application requires access to all network data, no filtering can be done, we recommend limiting historical retention of ingested data to a sliding window of 1 month (HISTORY_RETENTION_COUNT=518400) which is default set by Horizon.
  • If your application can work on a filtered network dataset based on specific accounts and assets, then we recommend applying ingestion filter rules. When using filter rules, it provides benefit of choice in longer historical retention timeframe since the filtering is reducing the overall database size to such a degree, historical retention(HISTORY_RETENTION_COUNT) can be set in terms of years rather than months or even disabled(HISTORY_RETENTION_COUNT=0).
  • If you cannot limit your history retention window to 30 days and cannot use filter rules, we recommend considering Stellar Hubble Data Warehouse for any historical data.

Ingesting Live Data

This option is enabled by default and is the recommended mode of ingestion to run. It is controlled with environment configuration flag INGEST. Refer to Configuration for how an instance of Horizon performs the ingestion role.

For high availability requirements, we recommend deploying more than one live ingesting instance, as this makes it easier to avoid downtime during upgrades and adds resilience, ensuring you always have the latest network data (refer to Ingestion Role Instance).

Ingesting Historical Data

Import network data from a past date range into the database:

stellar-horizon db reingest range <start> <end>

Running any historical range of ingestion requires coordination with the data retention configuration chosen. When setting a temporal limit on history with HISTORY_RETENTION_COUNT=<LEDGER_COUNT>, the temporal limit takes precedence, and any data ingested beyond that limit will be automatically purged.

Typically the only time you need to run historical ingestion is once when boot-strapping a system after first deployment, from that point forward live ingestion will keep the database populated with the expected sliding window of trailing historical data. Maybe one exception is if you think you have a gap in the database caused by the live ingestion being down, in which case you can run historical ingestion range to essentially gap fill.

You can run historical ingestion in parallel in background while your main Horizon server separately performs live ingestion. If the range specified overlaps with data already in the database, it is ok and will simply be overwritten, effectively idempotent.

Parallel Ingestion Workers

You can parallelize the ingestion of target historical ledger range by dividing it into sequential slices of smaller ranges and run the db reingest range command for each sub-range in parallel as a separate process on the same or a different machine. The shorthand rule for best performance is to identify the number of CPU cores available per target machine, if multi-core, then add --parallel-workers <num of cores> to the command, this will enable the command to further parallelize internally within a single process using multiple threads and sub-divided smaller ranges.

# target range 1 30000, on single machine with 1 CPU core
horizon1> stellar-horizon db reingest range 1 30000

# target range 1 30000, on single machine with 4 CPU cores
horizon1> stellar-horizon db reingest range 1 30000 --parallel-workers 4

# target range 1 30000, on two machines, each has 2 CPU cores
horizon1> stellar-horizon db reingest range 1 15000 --parallel-workers 2
horizon2> stellar-horizon db reingest range 15001 30000 --parallel-workers 2

Notes

Some endpoints may report not available during live ingestion

  • Endpoints that display current state information from live ingestion may return 503 Service Unavailable/Still Ingesting error. An example is the /paths endpoint (built using offers). Such endpoints will become available after live ingestion has finished network synchronization and catch up (usually within a couple of minutes).

If more than five minutes has elapsed with no new ingested data:

  • Verify the host machine meets recommended Prerequisites.

  • Check Horizon log output.

    • If there are many level=error messages, it may point to an environmental issue, inability to access the database.
    • Live ingestion will emit two key log lines about once every 5 seconds based on latest ledger emitted from network. Tail the Horizon log output and grep for presence of these lines with a filter:
      tail -f horizon.log | | grep -E 'Processed ledger|Closed ledger'
      If you don't see output from this pipeline every couple of seconds for a new ledger then ingestion is not proceeding, look at full logs and see if any alternative messages are printing reasons to the contrary. May see lines mentioning 'catching up' When connecting to pubnet, as it can take up to 5 minutes for the captive core process started by Horizon to catch up to pubnet network.
    • Check RAM usage on the machine, it's possible that system ran low on RAM and is using swap memory which will result in slow performance. Verify host machine meets minimum RAM prerequisites.
    • Verify the read/write throughput speeds on the volume that current working directory for horizon process is using. Based on prerequisites, volume should have at least 10mb/s, one way to roughly verify this on host machine(linux/mac) command line:
      sudo dd if=/dev/zero of=/tmp/test_speed.img bs=1G count=1

Monitoring Ingestion Process

For high-availability deployments, it is recommended to implement monitoring of ingestion process for visibility on performance/health. Refer to Monitoring for accessing logs and metrics from Horizon. Stellar publishes the example Horizon Grafana Dashboard, which demonstrates queries against key horizon ingestion metrics, specifically look at the Local Ingestion Delay [Ledgers] and Last ledger age in the Health Summary panel.