Monitoring

Metrics

Metrics are emitted from Horizon over HTTP in the de facto text-based exposition format. The Metrics are published on the private /metrics path of Horizon Admin port, which is an optional service to be started and will be bound by the Horizon process onto the host machine loopback network(localhost or 127.0.0.1). To enable the Admin port, add environment configuration parameter ADMIN_PORT=XXXXX, the metrics endpoint will be reachable on the host machine as localhost:<ADMIN_PORT>/metrics. You can verify this by pointing any browser that can reach this address, it will print out all metrics keys.

Exporting

Once the Admin port is enabled, the Horizon metrics endpoint can be 'scraped' by external monitoring infrastructure. Since the metrics output is encoded to standard text-based format it will be compatible for usage with many types of monitoring infrastructure that interoperate with the same standard format.

In the case of Horizon, metrics are published on the Admin HTTP port which is bound to the host machine's loop back network interface(127.0.0.1), so external monitoring systems or services cannot reach the port directly. To expose the metrics securely, we recommend following the exporter pattern, which is common metrics scraping strategy.

One real-world example for exporting from a bare metal Horizon installation (Horizon has been installed directly onto an operating system), use FluentBit and the Prometheus Exporter on the same host machine as Horizon is running. FluentBit will perform a simple port forwarding pipeline on the host machine. Configure the input to be Horizon's localhost:<ADMIN_PORT>/metrics and the output to be the host and port representing the target network interface and port on the host machine, and then configure your monitoring infrastructure to scrape that host address.

In container-orchestrated environments such as Kubernetes, you can use the same exporter strategy. We assume you already have a metrics infrastructure deployment like Prometheus and Grafana setup on the cluster via the Prometheus Operator, and will just need to configure that infrastructure to scrape the Horizon pod based on the ADMIN_PORT.

Data model

There are numerous application metrics keys emitted by Horizon at runtime, encoded in four types of exposition formats: counter, gauge, histogram, summary. Each key is further qualified with labels for more granularity. To summarize, we can highlight the groupings of metrics keys by common function denoted in the prefix of their name:

go_: golang specfic runtime performance
horizon_txsub_: attributes of Horizon transaction submission sub system if enabled.
horizon_stellar_core_: runtime attributes of stellar network reported by the captive core.
horizon_order_book_: runtime attributes of the in memory order book maintained by Horizon of the current stellar network
horizon_log_: counters of how many log messages printed at each severity level
horizon_ingest_: performance measurements and stateful aspects of Horizon's internal ingestion sub system
horizon_http_: statistics and measurements of Horizon's HTTP API service, all aspects of request/response load and timings.
horizon_history_: statistics on Horizon ingested historical ledgers
horizon_db_: measurements on database performance, query times per endpoints, pooling stats
process_: generic host machine compute measurements

And for each key, there will be a possibility of 0 or more labels, the serialized output(exposition) format follows this template:

<metrics_key_name>{label_1="value",label_2="value",,,} <value>

Rather than listing all individual metrics keys in docs, as they change often, the recommendation is to perform an HTTP GET against the Horizon metrics endpoint,localhost:<ADMIN_PORT>/metrics, using any http client(browser, curl, wget, etc) and the response will have the metrics keys and additional meta information on each metric key for description and type(counter, gauge, histogram, summary), as an example for one key, horizon_http_requests_duration_seconds:

# HELP horizon_http_requests_duration_seconds HTTP requests durations, sliding window = 10m
# TYPE horizon_http_requests_duration_seconds summary
horizon_http_requests_duration_seconds{method="GET",route="/",status="200",streaming="false",quantile="0.5"} 0.000186958
horizon_http_requests_duration_seconds{method="GET",route="/",status="200",streaming="false",quantile="0.9"} 0.00043625
horizon_http_requests_duration_seconds{method="GET",route="/",status="200",streaming="false",quantile="0.99"} 0.000645
...

Queries

Build queries against the metrics data model to highlight the performance of a given Horizon deployment. Refer to Stellar's Grafana Horizon Dashboard for examples of metrics queries to derive application performance:

Number of requests per minute.
Number of requests per route (the most popular routes).
Average response time per route.
Maximum response time for non-streaming requests.
Number of streaming vs. non-streaming requests.
Number of rate-limited requests.
List of rate-limited IPs.
Unique IPs.
The most popular SDKs/apps sending requests to a given Horizon node.
Average ingestion time of a ledger.
Average ingestion time of a transaction.

Choose the revisions tab, and download the dashboard source file to have access to the Grafana dashboard source code and metrics queries that build each panel in dashboards.

Alerts

Once queries are developed on a Grafana dashboard, it enables a convenient follow-on step to add alert rules based on specific queries to trigger notifications when thresholds are exceeded.

Here are some example alerts to consider with potential causes and solutions.

Alert	Cause	Solution
Spike in number of requests	Potential DoS attack	network load balance or content switch configurations
Ingestion is slow	host server compute resources are low	increase compute specs
HTTP API responses are returning errors	host server compute resources are low or networking to DB is lost	check the Horizon logs to see what errors are being emitted, narrow down root cause from there

Logs

Horizon will output logs to operating system's standard out. It will log on all aspects of runtime, including HTTP requests and ingestion. Typically, there are very few warn or error severity level messages emitted. The default severity level logged in Horizon is configured to LOG_LEVEL=info, this environment configuration parameter can be set to one of trace, debug, info, warn, error. The verbosity of log output is inverse of the severity level chosen. I.e. for most verbose logs use 'trace', for least verbose logs use 'error'.

For production deployments, we recommend using the default severity setting of info level and choose a log capture strategy depending on the deployment.

Bare metal deployment direct to operating system, redirect the standard out from Horizon process to a file on disk and apply a log rotation tool on the file such as logrotate to manage disk space usage.
Orchestrated deployment on Kubernetes, use an EFK/ELK stack on the cluster and it can be configured to capture the standard out from Horizon pod.

Runtime Profiling

Horizon is written in Golang, therefore it has been enabled to optionally emit the Golang runtime diagnostics and profiling output pprof. The pprof HTTP endpoints are hosted on Horizon's admin HTTP port, it can be enabled by adding environment configuration parameter ADMIN_PORT=XXXXX, since the admin port binding is disabled by default.

Two of the standard predefined profiles are published:

localhost:<ADMIN_PORT>/debug/pprof/heap - heap profiling

localhost:<ADMIN_PORT>/debug/pprof/profile - cpu profiling

Use Go's pprof command line tool to access the published endpoints and visualize the profiled diagnostic data that is emitted. A brief example usage of the pprof tool from command line to get started, using web to display a graphical representation of current heap allocations:

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ./pprof/pprof.stellar-horizon.alloc_objects.alloc_space.inuse_objects.inuse_space.022.pb.gz
File: stellar-horizon
Type: inuse_space
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) web

I'm Stuck! Help!

If any of the above steps don't work or you are otherwise prevented from correctly setting up Horizon, please join our community and let us know. Either post a question at our Stack Exchange or chat with us on Horizon Discord to ask for help.

Metrics​

Exporting​

Data model​

Queries​

Alerts​

Logs​

Runtime Profiling​

I'm Stuck! Help!​