Skip to main content

Monitoring

Once your node is up and running, it's important to keep an eye on it to make sure it stays afloat and continues to contribute to the health of the overall network. To help with that, Stellar Core exposes vital information that you can use to monitor your node and diagnose potential problems.

You can access this information using commands and inspecting Stellar Core's output. The first half of this page will cover this approach. You can also connect Prometheus to make monitoring easier, combine it with Alertmanager to automate notification, and use pre-built Grafana dashboards to create visual representations of your node's well-being.

info

Stellarbeat, a community-run monitoring dashboard, can also be useful for viewing network health as a whole. You should keep a very close eye on your own node(s) using some of the tools suggested on this page, but monitoring performance of all the network's nodes can also be useful to understand how they all interact with each other.

However you decide to monitor, the most important thing is that you have a system in place to ensure that your integration keeps ticking.

General Node Information

If you run $ stellar-core http-command 'info', the output will look something like this:

{
"info": {
"build": "v20.4.0",
"ledger": {
"age": 0,
"baseFee": 100,
"baseReserve": 100000000,
"closeTime": 0,
"hash": "39c2a3cd4141b2853e70d84601faa44744660334b48f3228e0309342e3f4eb48",
"maxTxSetSize": 100,
"num": 1,
"version": 20
},
"network": "Public Global Stellar Network ; September 2015",
"peers": {
"authenticated_count": 8,
"pending_count": 2
},
"protocol_version": 20,
"quorum": {
"node": "GCRQF",
"qset": {
"agree": 22,
"cost": 37256128,
"delayed": 0,
"disagree": 0,
"fail_at": 6,
"hash": "5c464e",
"lag_ms": 1942,
"ledger": 51251628,
"missing": 1,
"phase": "EXTERNALIZE"
},
"transitive": {
"critical": null,
"intersection": true,
"last_check_ledger": 51251539,
"node_count": 24
}
},
"startedOn": "2024-04-15T16:16:25Z",
"state": "Synced!"
}
}

Some notable fields from this info endpoint are:

  • build: the build number for this Stellar Core instance
  • ledger: a representation of the local state of your node, which may be different from the network state if your node was disconnected from the network for example. Some important sub-fields:
    • age: time elapsed since this ledger closed (during normal operation less than 10 seconds)
    • num: ledger number
    • version: protocol version of this ledger
  • network the network passphrase for the network this core instance is using
  • peers: information on the connectivity to the network
    • authenticated_count: the number of live connections
    • pending_count: the number of connections that are not fully established yet
  • protocol_version: the maximum version of the protocol that this instance recognizes
  • state: the node's synchronization status relative to the network
  • quorum: summary of the state of the SCP protocol participants, which is the same information returned by the quorum command (see below).

Peer Information

The peers command returns information on the peers your node is connected to.

This list is the result of both inbound connections from other peers and outbound connections from this node to other peers. If compact=false is used in the command, then it also returns some extra metrics on each peer such as the number of dropped messages.

stellar-core http-command 'peers'

The output will look something like:

{
"authenticated_peers": {
"inbound": [
{
"address": "18.234.41.75",
"elapsed": 6,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 175,
"peer_capacity_bytes": 291340
},
"id": "SDF 1",
"latency": 172,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
}
],
"outbound": [
{
"address": "3.238.239.100:11625",
"elapsed": 105,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 175,
"peer_capacity_bytes": 291340
},
"id": "SDF 3",
"latency": 172,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
},
{
"address": "85.190.254.217:11625",
"elapsed": 295,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 169,
"peer_capacity_bytes": 288408
},
"id": "SatoshiPay Frankfurt",
"latency": 282,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
}
]
},
"pending_peers": {
"inbound": ["211.249.63.74:11625", "45.77.5.118:11625"],
"outbound": ["178.21.47.226:11625", "178.131.109.241:11625"]
}
}

Overlay Topology Survey

There is a survey mechanism in the overlay that allows a validator to request connection information from other nodes on the network. The survey can be triggered from a validator, and will flood through the network like any other message, but will request information from other nodes about which nodes it is connected to and a brief summary of their per-connection traffic volumes.

By default, a node will relay or respond to a survey message if the message originated from a node in the receiving node's transitive quorum. This behavior can be overridden by setting the SURVEYOR_KEYS field in the config file to a more restrictive set of nodes to relay or respond to. Set SURVIVOR_KEYS to ["$self"] to opt-out of responding to survey requests entirely.

The survey works in two phases: the collecting phase, and the reporting phase. During the collecting phase, nodes record information about themselves and their peers, such as the number of messages sent to a given peer. During the reporting phase, the surveyor requests the results of the collecting phase from nodes on the network.

The surveyor begins the collecting phase by broadcasting a TimeSlicedSurveyStartCollectingMessage. The surveyor ends the collecting phase and initiates the reporting phase by broadcasting a TimeSlicedSurveyStopCollectingMessage. These "start/stop collecting" messages ensure that the collecting phase is roughly equal in duration for all nodes present during the entire collecting phase. We recommend sending the "stop collecting" message about 20 minutes after the "start collecting" message. If 30 minutes elapse without receiving a "stop collecting" message, the survey will automatically transition to the reporting phase.

Additionally, the "stop/start collecting" messages contain a nonce field identifying the survey instance. The nonce in the "stop collecting" message must match the nonce from the "start collecting" message. The surveyor should choose a random 32-bit unsigned integer for the nonce.

During the reporting phase, the surveyor sends TimeSlicedSurveyRequestMessages to individual nodes to gather the information the node recorded during the collecting phase.

Overlay Survey Script

To simplify running an overlay survey, stellar-core ships with a script OverlaySurvey.py in the scripts directory. This script walks the network using the overlay survey HTTP endpoints to build a graph containing the topology of the overlay network. The script outputs this graph both in JSON format, as well as GraphML. You can analyze the GraphML file using a GraphML viewer such as Gephi.

An example usage of the survey script to run an overlay survey is as follows:

$ python3 OverlaySurvey.py survey -n http://127.0.0.1:11626 -c 20 -sr sr.json -gmlw gmlw.graphml

The arguments this example uses are:

  • sub command survey - run survey and analyze
    • -n NODE, --node NODE - address of initial survey node
    • -c DURATION, --collectDuration DURATION - duration of survey collecting phase in minutes
    • -gmlw GRAPHMLWRITE, --graphmlWrite GRAPHMLWRITE - output file for graphml file
    • -sr SURVEYRESULT, --surveyResult SURVEYRESULT - output file for survey results

Therefore, this example will run a survey from a stellar-core node running on the local machine with a collecting phase duration of 20 minutes and output the results to sr.json and gmlw.graphml.

Attaching to a Running Survey

Use the --startPhase option to attach the script to an already running survey. This may be necessary if something happened during the running of the script that caused the script to terminate early (such as losing connection with the surveyor node). --startPhase has three possible values:

  • startCollecting: Start a survey from the beginning of the collecting phase. This is the default value when --startPhase is unspecified. It indicates you would like to start a new survey from the beginning (that is, you are not attaching the script to an existing survey).
  • stopCollecting: Immediately broadcast a TimeSlicedSurveyStopCollectingMessage for the currently running survey and begin surveying individual nodes for results. Use this option if your survey is currently in the collecting phase and you'd like to move it to the reporting phase.
  • surveyResults: Begin surveying individual nodes for results. Use this option if your survey is in the reporting phase.

More Survey Script Options

The survey script contains additional subcommands and options to further analyze the survey results. You can find a complete list of subcommands by running:

$ python3 OverlaySurvey.py -h

From there, you can run:

$ python3 OverlaySurvey.py <subcommand> -h

for more info about any given subcommand.

Example Survey Command Using HTTP Endpoints

This section walks through an example of running an overlay survey by calling the survey HTTP endpoints directly. We highly recommend using the overlay survey script instead. This section may be useful to anyone who wants to modify the survey script, or anyone who is curious about the lower-level details of how the survey works and the data it includes.

In this example, we have three nodes GBBN, GDEX, and GBUI (we'll refer to them by the first four letters of their public keys). We will execute the commands below from GBUI, and note that GBBN has SURVEYOR_KEYS=["$self"] in it's config file, so GBBN will not relay or respond to any survey messages.

# 1. Begin the surveyor collecting phase
stellar-core http-command 'startsurveycollecting?nonce=1234'
# 2. Stop the surveyor collecting phase, and begin the reporting phase
stellar-core http-command 'stopsurveycollecting?nonce=1234'
# 3. Request survey results from the `GBBN` node
stellar-core http-command 'surveytopologytimesliced?node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL&inboundpeerindex=0&outboundpeerindex=0'
# 4. Request survey results from the `GDEX` node
stellar-core http-command 'surveytopologytimesliced?node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4&inboundpeerindex=0&outboundpeerindex=0'
# 3. Retrieve and display the results of issued survey commands
stellar-core http-command 'getsurveyresult'

Once the responses are received, the getsurveyresult command will return a result like this:

{
"backlog": [],
"badResponseNodes": null,
"surveyInProgress": true,
"topology": {
"GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL": null,
"GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4": {
"inboundPeers": [
{
"bytesRead": 26392,
"bytesWritten": 26960,
"duplicateFetchBytesRecv": 0,
"duplicateFetchMessageRecv": 0,
"duplicateFloodBytesRecv": 10424,
"duplicateFloodMessageRecv": 43,
"messagesRead": 93,
"messagesWritten": 96,
"nodeId": "GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL",
"secondsConnected": 22,
"uniqueFetchBytesRecv": 0,
"uniqueFetchMessageRecv": 0,
"uniqueFloodBytesRecv": 11200,
"uniqueFloodMessageRecv": 46,
"version": "v12.2.0-46-g61aadd29"
},
{
"bytesRead": 32204,
"bytesWritten": 31212,
"duplicateFetchBytesRecv": 0,
"duplicateFetchMessageRecv": 0,
"duplicateFloodBytesRecv": 11200,
"duplicateFloodMessageRecv": 46,
"messagesRead": 115,
"messagesWritten": 112,
"nodeId": "GBUICIITZTGKL7PUBHUPWD67GDRAIYUA4KCOH2PUIMMZ6JQLNVA7C4JL",
"secondsConnected": 23,
"uniqueFetchBytesRecv": 176,
"uniqueFetchMessageRecv": 2,
"uniqueFloodBytesRecv": 14968,
"uniqueFloodMessageRecv": 62,
"version": "v12.2.0-46-g61aadd29"
}
],
"numTotalInboundPeers": 2,
"numTotalOutboundPeers": 0,
"maxInboundPeerCount": 64,
"maxOutboundPeerCount": 8,
"addedAuthenticatedPeers": 0,
"droppedAuthenticatedPeers": 0,
"p75SCPFirstToSelfLatencyMs": 72,
"p75SCPSelfToOtherLatencyMs": 112,
"lostSyncCount": 0,
"isValidator": false,
"outboundPeers": null
}
}
}

In this example, note that the node GBBN under the topology field has a null value because it's configured to not respond to the survey message.

Some notable fields from this getsurveyresult endpoint are:

  • backlog: List of nodes for which the survey request are yet to be sent
  • badResponseNodes: List of nodes that sent a malformed response
  • topology: Map of nodes to connection information
    • inboundPeers/outboundPeers: List of connection information by nodes
      • averageLatencyMs: Average latency with this peer in milliseconds.
      • bytesRead: The total number of bytes read from this peer.
      • bytesWritten: The total number of bytes written to this peer.
      • duplicateFetchBytesRecv: The number of bytes received that were duplicate transaction sets and quorum sets.
      • duplicateFetchMessageRecv: The count of duplicate transaction sets and quorum sets received from this peer.
      • duplicateFloodBytesRecv: The number of bytes received that were transactions and SCP votes duplicates.
      • duplicateFloodMessageRecv: The count of duplicate transactions and SCP votes received from this peer.
      • messagesRead: The total number of messages read from this peer.
      • messagesWritten: The total number of messages written to this peer.
      • nodeId: Node's public key.
      • secondsConnected: The total number of seconds this peer has been connected to the surveyed node.
      • uniqueFetchBytesRecv: The number of bytes received that were unique transaction sets and quorum sets.
      • uniqueFetchMessageRecv: The count of unique transaction sets and quorum sets received from this peer.
      • uniqueFloodBytesRecv: The number of bytes received that were unique transactions and SCP votes.
      • uniqueFloodMessageRecv: The count of unique transactions and SCP votes received from this peer.
      • version: stellar-core version.
  • numTotalInboundPeers/numTotalOutboundPeers: The number of total inbound and outbound peers this node is connected to. The response will have a random subset of 25 connected peers per direction (inbound/outbound). These fields tell you if you're missing nodes so you can send another request out to get another random subset of nodes.
  • maxInboundPeerCount/maxOutboundPeerCount: The number of total inbound and outbound peers that this node can accept. These fields correspond to stellar-core configurations MAX_ADDITIONAL_PEER_CONNECTIONS and TARGET_PEER_CONNECTIONS, respectively.
  • addedAuthenticatedPeers: The number of authenticated peers added.
  • droppedAuthenticatedPeers: The number of authenticated peers dropped.
  • p75SCPFirstToSelfLatencyMs: 75th percentile latency to hear about new SCP messages in milliseconds.
  • p75SCPSelfToOtherLatencyMs: 75th percentile latency for other nodes to hear this node's SCP messages in milliseconds.
  • lostSyncCount: The number of times this node lost sync.
  • isValidator: Is this node a validator?

Quorum Health

To help node operators monitor their quorum sets and maintain the health of the overall network, Stellar Core also provides metrics on other nodes in your quorum set. You should monitor them to make sure they're up and running, and that your quorum set is maintaining good overlap with the rest of the network.

Quorum Set Diagnostics

The quorum command allows you to diagnose problems with the quorum set of the local node.

If you run:

stellar-core http-command 'quorum'

The output will look something like:

{
"node": "GCTSF",
"qset": {
"agree": 6,
"cost": 20883268,
"delayed": null,
"disagree": null,
"fail_at": 2,
"fail_with": ["sdf_watcher1", "sdf_watcher2"],
"hash": "d5c247",
"lag_ms": {
"sdf_watcher1": 192,
"sdf_watcher2": 215,
"sdf_watcher3": 79,
"stronghold1": 321,
"eno": 266,
"tempo.eu.com": 225,
"satoshipay": 249
},
"ledger": 24311847,
"missing": ["stronghold1"],
"phase": "EXTERNALIZE",
"value": {
"t": 3,
"v": [
"sdf_watcher1",
"sdf_watcher2",
"sdf_watcher3",
{
"t": 3,
"v": ["stronghold1", "eno", "tempo.eu.com", "satoshipay"]
}
]
}
},
"transitive": {
"critical": [["GDM7M"]],
"intersection": true,
"last_check_ledger": 24311536,
"node_count": 21
}
}

This output has two main sections: qset and transitive. The former describes the node and its quorum set. The latter describes the transitive closure of the node's quorum set.

Per-node Quorum-set Information

Entries to watch for in the qset section, which describe the node and its quorum set, are:

  • agree: the number of nodes in the quorum set that seem to be up and running as expected. The local node has no reason to believe that this node is delayed, disagree or missing. Note that agree has nothing to do with SCP terms such as "accept" or "confirming".
  • delayed: the nodes that are participating in consensus but seem to be behind.
  • disagree: the nodes that are participating but disagree with this instance.
  • fail_at: the number of failed nodes that would cause this instance to halt.
  • fail_with: an example of such potential failure.
  • missing: the nodes that seem down during this consensus round.
  • value: the quorum set used by this node (t is the threshold expressed as a number of nodes).

In the example above, 6 nodes are functioning properly, one is down (stronghold1), and the instance will fail if any two nodes still working (or one node and one inner-quorum-set) fail as well.

If a node is stuck in state Joining SCP, this command allows to quickly find the reason:

  • too many validators missing (down or without a good connectivity), solutions are:
  • network split would cause SCP to stick because of nodes that disagree. This would happen if either there is a bug in SCP, the network does not have quorum intersection, or the disagreeing nodes are misbehaving (compromised, etc.).

Note that the node not being able to reach consensus does not mean that the network as a whole will be unable to reach consensus (and the opposite is true, the network may fail because of a different set of validators failing).

You can get a sense of the quorum set health of a different node using using:

# the `NAME` of a validator
stellar-core http-command 'quorum?node=$sdf1'
# OR the `PUBLIC_KEY` of a validator
stellar-core http-command 'quorum?node=@GABCDE'

Overall network health can be evaluated by walking through all nodes and looking at their health. Note that this is only an approximation, as remote nodes may not have received the same messages (in particular: missing for other nodes is not reliable).

Transitive Closure Summary Information

When showing quorum-set information about the local node rather than some other node, a summary of the transitive closure of the quorum set is also provided in the transitive field. This has several important sub-fields:

  • last_check_ledger: the last ledger in which the transitive closure was checked for quorum intersection. This will reset when the node boots and whenever a node in the transitive quorum changes its quorum set. It may lag behind the last-closed ledger by a few ledgers depending on the computational cost of checking quorum intersection.
  • node_count: the number of nodes in the transitive closure, which are considered when calculating quorum intersection.
  • intersection: whether or not the transitive closure enjoyed quorum intersection at the most recent check. This is of utmost importance in preventing network splits. It should always be true. If it is ever false, one or more nodes in the transitive closure of the quorum set is currently misconfigured, and the network is at risk of splitting. Corrective action should be taken immediately, for which two additional sub-fields will be present to help suggest remedies:
    • last_good_ledger: this will note the last ledger for which the intersection field was evaluated as true; if some node reconfigured at or around that ledger, reverting that configuration change is the easiest corrective action to take.
    • potential_split: this will contain a pair of lists of validator IDs, which is a potential pair of disjoint quorums allowed by the current configuration. In other words, a possible split in consensus allowed by the current configuration. This may help narrow down the cause of the misconfiguration: likely it involves too-low a consensus threshold in one of the two potential quorums, and/or the absence of a mandatory trust relationship that would bridge the two.
  • critical: an "advance warning" field that lists nodes that could cause the network to fail to enjoy quorum intersection, if they were misconfigured sufficiently badly. In a healthy transitive network configuration, this field will be null. If it is non-null then the network is essentially "one misconfiguration" (of the quorum sets of the listed nodes) away from no longer enjoying quorum intersection, and again, corrective action should be taken: careful adjustment to the quorum sets of nodes that depend on the listed nodes, typically to strengthen quorums that depend on them.

Detailed Transitive Quorum Analysis

The quorum endpoint can also retrieve detailed information about the transitive quorum. This is a format that's easier to process than what scp returns, as it doesn't contain all SCP messages.

stellar-core http-command 'quorum?transitive=true'

The output looks something like:

{
"critical": null,
"intersection": true,
"last_check_ledger": 121235,
"node_count": 4,
"nodes": [
{
"distance": 0,
"heard": 121235,
"node": "GB7LI",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf2",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf3",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf1",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
}
]
}

The output begins with the same summary information as in the transitive block of the non-transitive query (if queried for the local node), but also includes a nodes array that represents a walk of the transitive quorum centered on the query node.

Notable fields contained in this response are:

  • node: the identity of the validator
  • distance: how far that node is from the root node (i.e., how many quorum set hops)
  • heard: the latest ledger sequence number at which this node cast a vote
  • qset: the node's quorum set
  • status: one of behind|tracking|ahead (compared to the root node) or missing|unknown (when there are no recent SCP messages for that node)
  • value_id: a unique ID for what the node is voting for (allows you to quickly tell if nodes are voting for the same thing)
  • value: what the node is voting for

Using Prometheus

Monitoring stellar-core using Prometheus is by far the simplest solution, especially if you already have a Prometheus server within your infrastructure. Prometheus is a free and open source time-series database with a simple yet incredibly powerful query language PromQL. Prometheus is also tightly integrated with Grafana, so you can render complex visualisations with ease.

In order for Prometheus to scrape stellar-core application metrics, you will need to install the stellar-core-prometheus-exporter (apt-get install stellar-core-prometheus-exporter) and configure your Prometheus server to scrape this exporter (default port: 9473). On top of that grafana can be used to visualize metrics.

Install a Prometheus Server Within your Infrastructure

Installing and configuring a Prometheus server is out of scope of this document, however it is a fairly simple process: Prometheus is a single Go binary which you can download from https://prometheus.io/docs/prometheus/latest/installation/.

Install the stellar-core-prometheus-exporter

The stellar-core-prometheus-exporter is an exporter that scrapes the stellar-core metrics endpoint (http://localhost:11626/metrics) and renders these metrics in the Prometheus text-based format available for Prometheus to scrape and store in its time series database.

The exporter needs to be installed on every Stellar Core node you wish to monitor.

apt-get install stellar-core-prometheus-exporter

You will need to open up port 9473 between your Prometheus server and all your Stellar Core nodes for your Prometheus server to be able to scrape metrics.

Point Prometheus to stellar-core-prometheus-exporter

Pointing your Prometheus instance to the exporter can be achieved by manually configuring a scrape job; however, depending on the number of hosts you need to monitor this can quickly become unwieldy. Luckily, the process can also be automated using Prometheus' various "service discovery" plugins. For example with AWS hosted instance you can use the ec2_sd_config plugin.

Manual

- job_name: "stellar-core"
scrape_interval: 10s
scrape_timeout: 10s
static_configs:
- targets: [
"core-node-001.example.com:9473",
"core-node-002.example.com:9473",
] # stellar-core-prometheus-exporter default port is 9473
- labels:
application: "stellar-core"

Using Service Discovery (EC2)

- job_name: stellar-core
scrape_interval: 10s
scrape_timeout: 10s
ec2_sd_configs:
- region: eu-west-1
port: 9473
relabel_configs:
# ignore stopped instances
- source_labels: [__meta_ec2_instance_state]
regex: stopped
action: drop
# only keep with `core` in the Name tag
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*core.*)"
action: keep
# use Name tag as instance label
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*)"
action: replace
replacement: "${1}"
target_label: instance
# set application label to stellar-core
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*core.*)"
action: replace
replacement: stellar-core
target_label: application

Create Alerting Rules

Once Prometheus scrapes metrics we can add alerting rules. Recommended rules are here (require Prometheus 2.0 or later). Copy rules to /etc/prometheus/stellar-core-alerting.rules on the Prometheus server and add the following to the prometheus configuration file to include the file:

rule_files:
- "/etc/prometheus/stellar-core-alerting.rules"

Rules are documented in-line,and we strongly recommend that you review and verify all of them as every environment is different.

Configure Notifications Using Alertmanager

Alertmanager is responsible for sending notifications. Installing and configuring an Alertmanager server is out of scope of this document, however it is a fairly simple process. Official documentation is here.

All recommended alerting rules have "severity" label:

  • critical normally require immediate attention. They indicate an ongoing or very likely outage. We recommend that critical alerts notify administrators 24x7
  • warning normally can wait until working hours. Warnings indicate problems that likely do not have production impact but may lead to critical alerts or outages if left unhandled

The following example alertmanager configuration demonstrates how to send notifications using different methods based on severity label:

global:
smtp_smarthost: localhost:25
smtp_from: [email protected]
route:
receiver: default-receiver
group_by: [alertname]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes:
- receiver: critical-alerts
match:
severity: critical
- receiver: warning-alerts
match:
severity: warning
receivers:
- name: critical-alerts
pagerduty_configs:
- routing_key: <PD routing key>
- name: warning-alerts
slack_configs:
- api_url: https://hooks.slack.com/services/slack/warning/channel/webhook
- name: default-receiver
email_configs:
- to: alerts-[email protected]

In the above examples alerts with severity "critical" are sent to pagerduty and warnings are sent to slack.

Useful Exporters

You may find the below exporters useful for monitoring your infrastructure as they provide incredible insight into your operating system and database metrics. Installing and configuring these exporters is out of the scope of this document but should be relatively straightforward.

Visualize Metrics Using Grafana

Once you've configured Prometheus to scrape and store your stellar-core metrics, you will want a nice way to render this data for human consumption. Grafana offers the simplest and most effective way to achieve this. Installing Grafana is out of scope of this document but is a very simple process, especially when using the prebuilt apt packages

We recommend that administrators import the following two dashboards into their grafana deployments:

  • Stellar Core Monitoring - shows the most important metrics, node status and tries to surface common problems. It's a good troubleshooting starting point
  • Stellar Core Full - shows a simple health summary as well as all metrics exposed by the stellar-core-prometheus-exporter. It's much more detailed than the Stellar Core Monitoring and might be useful during in-depth troubleshooting