Skip to main content

# Monitoring

Once your node is up and running, it's important to keep an eye on it to make sure it stays afloat and continues to contribute to the health of the overall network. To help with that, Stellar Core exposes vital information that you can use to monitor your node and diagnose potential problems.

You can access this information using commands and inspecting Stellar Core's output, which is what the first half of this doc covers. You can also connect Prometheus to make monitoring easier, combine it with Alertmanager to automate notification, and use pre-built Grafana dashboards to create visual representations of your node's well-being.

However you decide to monitor, the most important thing is that you have a system in place to ensure that your integration keeps ticking.

If you run $stellar-core http-command 'info', the output will look something like this: { "build" : "v11.1.0", "history_failure_rate" : "0", "ledger" : { "age" : 3, "baseFee" : 100, "baseReserve" : 5000000, "closeTime" : 1560350852, "hash" : "40d884f6eb105da56bea518513ba9c5cda9a4e45ac824e5eac8f7262c713cc60", "maxTxSetSize" : 1000, "num" : 24311579, "version" : 11 }, "network" : "Public Global Stellar Network ; September 2015", "peers" : { "authenticated_count" : 5, "pending_count" : 0 }, "protocol_version" : 10, "quorum" : { "qset" : { "agree" : 6, "delayed" : 0, "disagree" : 0, "fail_at" : 2, "hash" : "d5c247", "ledger" : 24311579, "missing" : 1, "phase" : "EXTERNALIZE" }, "transitive" : { "critical" : null, "intersection" : true, "last_check_ledger" : 24311536, "node_count" : 21 } }, "startedOn" : "2019-06-10T17:40:29Z", "state" : "Catching up", "status" : [ "Catching up: downloading and verifying buckets: 30/30 (100%)" ] }} Some notable fields in info are: • build: the build number for this Stellar Core instance • ledger: the local state of your node, which may be different from the network state if your node was disconnected from the network. Some important sub-fields: • age: time elapsed since this ledger closed (during normal operation less than 10 seconds) • num: ledger number • version: protocol version supported by this ledger • network is the network passphrase that this core instance is using to decide whether to connect to the testnet or the public network • peers: information on the connectivity to the network • authenticated_count: the number of live connections • pending_count: the number of connections that are not fully established yet • protocol_version: the maximum version of the protocol that this instance recognizes • state: the node's synchronization status relative to the network • quorum: summarizes the state of the SCP protocol participants, the same as the information returned by the quorum command (see below). ## Overlay information​ The peers command returns information on the peers your node is connected to. This list is the result of both inbound connections from other peers and outbound connections from this node to other peers. $ stellar-core http-command 'peers'

{  "authenticated_peers": {    "inbound": [      {        "address": "54.161.82.181:11625",        "elapsed": 6,        "id": "sdf1",        "olver": 5,        "ver": "v9.1.0"      }    ],    "outbound": [      {        "address": "54.211.174.177:11625",        "elapsed": 2303,        "id": "sdf2",        "olver": 5,        "ver": "v9.1.0"      },      {        "address": "54.160.175.7:11625",        "elapsed": 14082,        "id": "sdf3",        "olver": 5,        "ver": "v9.1.0"      }    ]  },  "pending_peers": {    "inbound": ["211.249.63.74:11625", "45.77.5.118:11625"],    "outbound": ["178.21.47.226:11625", "178.131.109.241:11625"]  }}

## Quorum Health​

To help node operators monitor their quorum sets and maintain the health of the overall network, Stellar Core also provides metrics on other nodes in your quorum set. You should monitor them to make sure they're up and running, and that your quorum set is maintaining good overlap with the rest of the network.

### Quorum set diagnostics​

The quorum command allows to diagnose problems with the quorum set of the local node.

If you run:

$stellar-core http-command 'quorum' The output will look something like: { "node": "GCTSFJ36M7ZMTSX7ZKG6VJKPIDBDA26IEWRGV65DVX7YVVLBPE5ZWMIO", "qset": { "agree": 6, "delayed": null, "disagree": null, "fail_at": 2, "fail_with": ["sdf_watcher1", "sdf_watcher2"], "hash": "d5c247", "ledger": 24311847, "missing": ["stronghold1"], "phase": "EXTERNALIZE", "value": { "t": 3, "v": [ "sdf_watcher1", "sdf_watcher2", "sdf_watcher3", { "t": 3, "v": ["stronghold1", "eno", "tempo.eu.com", "satoshipay"] } ] } }, "transitive": { "critical": [["GDM7M262ZJJPV4BZ5SLGYYUTJGIGM25ID2XGKI3M6IDN6QLSTWQKTXQM"]], "intersection": true, "last_check_ledger": 24311536, "node_count": 21 }} This output has two main sections: qset and transitive. The former describes the node and its quorum set; the latter describes the transitive closure of the node's quorum set. ### Per-node Quorum-set Information​ Entries to watch for in the qset section — which describe the node and its quorum set — are: • agree : the number of nodes in the quorum set that agree with this instance. • delayed : the nodes that are participating in consensus but seem to be behind. • disagree: the nodes that are participating but disagreed with this instance. • fail_at : the number of failed nodes that would cause this instance to halt. • fail_with: an example of such potential failure. • missing : the nodes that were missing during this consensus round. • value : the quorum set used by this node (t is the threshold expressed as a number of nodes). In the example above, 6 nodes are functioning properly, one is down (stronghold1), and the instance will fail if any two nodes still working (or one node and one inner-quorum-set) fail as well. If a node is stuck in state Joining SCP, this command allows to quickly find the reason: • too many validators missing (down or without a good connectivity), solutions are: • network split would cause SCP to stick because of nodes that disagree. This would happen if either there is a bug in SCP, the network does not have quorum intersection, or the disagreeing nodes are misbehaving (compromised, etc). Note that the node not being able to reach consensus does not mean that the network as a whole will not be able to reach consensus (and the opposite is true: the network may fail because of a different set of validators failing). You can get a sense of the quorum set health of a different node using using: $ stellar-core http-command 'quorum?node=$sdf1 or $ stellar-core http-command '[email protected]

Overall network health can be evaluated by walking through all nodes and looking at their health. Note that this is only an approximation, as remote nodes may not have received the same messages (in particular: missing for other nodes is not reliable).

### Transitive Closure Summary Information​

When showing quorum-set information about the local node, a summary of the transitive closure of the quorum set is also provided in the transitive field. This has several important sub-fields:

• last_check_ledger : the last ledger in which the transitive closure was checked for quorum intersection. This will reset when the node boots and whenever a node in the transitive quorum changes its quorum set. It may lag behind the last-closed ledger by a few ledgers depending on the computational cost of checking quorum intersection.
• node_count : the number of nodes in the transitive closure, which are considered when calculating quorum intersection.
• intersection : whether or not the transitive closure enjoyed quorum intersection at the most recent check. This is of utmost importance in preventing network splits. It should always be true. If it is ever false, one or more nodes in the transitive closure of the quorum set is currently misconfigured, and the network is at risk of splitting. Corrective action should be taken immediately, for which two additional sub-fields will be present to help suggest remedies:
• last_good_ledger : this will note the last ledger for which the intersection field was evaluated as true; if some node reconfigured at or around that ledger, reverting that configuration change is the easiest corrective action to take.
• potential_split : this will contain a pair of lists of validator IDs, which is a potential pair of disjoint quorums allowed by the current configuration. In other words, a possible split in consensus allowed by the current configuration. This may help narrow down the cause of the misconfiguration: likely it involves too-low a consensus threshold in one of the two potential quorums, and/or the absence of a mandatory trust relationship that would bridge the two.
• critical: an "advance warning" field that lists nodes that could cause the network to fail to enjoy quorum intersection, if they were misconfigured sufficiently badly. In a healthy transitive network configuration, this field will be null. If it is non-null then the network is essentially "one misconfiguration" (of the quorum sets of the listed nodes) away from no longer enjoying quorum intersection, and again, corrective action should be taken: careful adjustment to the quorum sets of nodes that depend on the listed nodes, typically to strengthen quorums that depend on them.

### Detailed transitive quorum analysis​

The quorum endpoint can also retrieve detailed information for the transitive quorum.

This is a format that's easier to process than what scp returns as it doesn't contain all SCP messages.

### Create Alerting Rules​

Once Prometheus scrapes metrics we can add alerting rules. Recommended rules are here (require Prometheus 2.0 or later). Copy rules to /etc/prometheus/stellar-core-alerting.rules on the Prometheus server and add the following to the prometheus configuration file to include the file:

rule_files:  - "/etc/prometheus/stellar-core-alerting.rules"

Rules are documented in-line,and we strongly recommend that you review and verify all of them as every environment is different.

### Configure Notifications Using Alertmanager​

Alertmanager is responsible for sending notifications. Installing and configuring an Alertmanager server is out of scope of this document, however it is a fairly simple process. Official documentation is here.

All recommended alerting rules have "severity" label:

• critical normally require immediate attention. They indicate an ongoing or very likely outage. We recommend that critical alerts notify administrators 24x7
• warning normally can wait until working hours. Warnings indicate problems that likely do not have production impact but may lead to critical alerts or outages if left unhandled

The following example alertmanager configuration demonstrates how to send notifications using different methods based on severity label:

global:  smtp_smarthost: localhost:25  smtp_from: [email protected]route:  receiver: default-receiver  group_by: [alertname]  group_wait: 30s  group_interval: 5m  repeat_interval: 1h  routes:    - receiver: critical-alerts      match:        severity: critical    - receiver: warning-alerts      match:        severity: warningreceivers:  - name: critical-alerts    pagerduty_configs:      - routing_key: <PD routing key>  - name: warning-alerts    slack_configs:      - api_url: https://hooks.slack.com/services/slack/warning/channel/webhook  - name: default-receiver    email_configs:      - to: alerts-[email protected]

In the above examples alerts with severity "critical" are sent to pagerduty and warnings are sent to slack.

### Useful Exporters​

You may find the below exporters useful for monitoring your infrastructure as they provide incredible insight into your operating system and database metrics. Installing and configuring these exporters is out of the scope of this document but should be relatively straightforward.

### Visualize metrics using Grafana​

Once you've configured Prometheus to scrape and store your stellar-core metrics, you will want a nice way to render this data for human consumption. Grafana offers the simplest and most effective way to achieve this. Installing Grafana is out of scope of this document but is a very simple process, especially when using the prebuilt apt packages

We recommend that administrators import the following two dashboards into their grafana deployments:

• Stellar Core Monitoring - shows the most important metrics, node status and tries to surface common problems. It's a good troubleshooting starting point
• Stellar Core Full - shows a simple health summary as well as all metrics exposed by the stellar-core-prometheus-exporter. It's much more detailed than the Stellar Core Monitoring and might be useful during in-depth troubleshooting