Monitoring
Once your node is up and running, it's important to keep an eye on it to make sure it stays afloat and continues to contribute to the health of the overall network. To help with that, Stellar Core exposes vital information that you can use to monitor your node and diagnose potential problems.
You can access this information using commands and inspecting Stellar Core's output. The first half of this page will cover this approach. You can also connect Prometheus to make monitoring easier, combine it with Alertmanager to automate notification, and use pre-built Grafana dashboards to create visual representations of your node's well-being.
Stellarbeat, a community-run monitoring dashboard, can also be useful for viewing network health as a whole. You should keep a very close eye on your own node(s) using some of the tools suggested on this page, but monitoring performance of all the network's nodes can also be useful to understand how they all interact with each other.
However you decide to monitor, the most important thing is that you have a system in place to ensure that your integration keeps ticking.
General Node Information
If you run sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'info'
, the output will look something like this:
{
"info": {
"build": "v20.4.0",
"ledger": {
"age": 0,
"baseFee": 100,
"baseReserve": 100000000,
"closeTime": 0,
"hash": "39c2a3cd4141b2853e70d84601faa44744660334b48f3228e0309342e3f4eb48",
"maxTxSetSize": 100,
"num": 1,
"version": 20
},
"network": "Public Global Stellar Network ; September 2015",
"peers": {
"authenticated_count": 8,
"pending_count": 2
},
"protocol_version": 20,
"quorum": {
"node": "GCRQF",
"qset": {
"agree": 22,
"cost": 37256128,
"delayed": 0,
"disagree": 0,
"fail_at": 6,
"hash": "5c464e",
"lag_ms": 1942,
"ledger": 51251628,
"missing": 1,
"phase": "EXTERNALIZE"
},
"transitive": {
"critical": null,
"intersection": true,
"last_check_ledger": 51251539,
"node_count": 24
}
},
"startedOn": "2024-04-15T16:16:25Z",
"state": "Synced!"
}
}
Some notable fields from this info
endpoint are:
build
: the build number for this Stellar Core instanceledger
: a representation of the local state of your node, which may be different from the network state if your node was disconnected from the network for example. Some important sub-fields:age
: time elapsed since this ledger closed (during normal operation less than 10 seconds)num
: ledger numberversion
: protocol version of this ledger
network
the network passphrase for the network this core instance is usingpeers
: information on the connectivity to the networkauthenticated_count
: the number of live connectionspending_count
: the number of connections that are not fully established yet
protocol_version
: the maximum version of the protocol that this instance recognizesstate
: the node's synchronization status relative to the networkquorum
: summary of the state of the SCP protocol participants, which is the same information returned by thequorum
command (see below).
Peer Information
The peers
command returns information on the peers your node is connected to.
This list is the result of both inbound connections from other peers and outbound connections from this node to other peers. If compact=false
is used in the command, then it also returns some extra metrics on each peer such as the number of dropped messages.
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'peers'
The output will look something like:
{
"authenticated_peers": {
"inbound": [
{
"address": "18.234.41.75",
"elapsed": 6,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 175,
"peer_capacity_bytes": 291340
},
"id": "SDF 1",
"latency": 172,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
}
],
"outbound": [
{
"address": "3.238.239.100:11625",
"elapsed": 105,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 175,
"peer_capacity_bytes": 291340
},
"id": "SDF 3",
"latency": 172,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
},
{
"address": "85.190.254.217:11625",
"elapsed": 295,
"flow_control": {
"local_capacity": {
"flood": 200,
"reading": 200
},
"local_capacity_bytes": {
"flood": 300000
},
"peer_capacity": 169,
"peer_capacity_bytes": 288408
},
"id": "SatoshiPay Frankfurt",
"latency": 282,
"olver": 32,
"ver": "stellar-core 20.4.0 (7fc7671b8bc1ccc3b1f16a6ab83bc9f671db8b70)"
}
]
},
"pending_peers": {
"inbound": ["211.249.63.74:11625", "45.77.5.118:11625"],
"outbound": ["178.21.47.226:11625", "178.131.109.241:11625"]
}
}
Overlay Topology Survey
There is a survey mechanism in the overlay that allows a validator to request connection information from other nodes on the network. The survey can be triggered from a validator, and will flood through the network like any other message, but will request information from other nodes about which nodes it is connected to and a brief summary of their per-connection traffic volumes.
By default, a node will relay or respond to a survey message if the message originated from a node in the receiving node's transitive quorum. This behavior can be overridden by setting the SURVEYOR_KEYS
field in the config file to a more restrictive set of nodes to relay or respond to. Set SURVIVOR_KEYS
to ["$self"]
to opt-out of responding to survey requests entirely.
The survey works in two phases: the collecting phase, and the reporting phase. During the collecting phase, nodes record information about themselves and their peers, such as the number of messages sent to a given peer. During the reporting phase, the surveyor requests the results of the collecting phase from nodes on the network.
The surveyor begins the collecting phase by broadcasting a TimeSlicedSurveyStartCollectingMessage
. The surveyor ends the collecting phase and initiates the reporting phase by broadcasting a TimeSlicedSurveyStopCollectingMessage
. These "start/stop collecting" messages ensure that the collecting phase is roughly equal in duration for all nodes present during the entire collecting phase. We recommend sending the "stop collecting" message about 20 minutes after the "start collecting" message. If 30 minutes elapse without receiving a "stop collecting" message, the survey will automatically transition to the reporting phase.
Additionally, the "stop/start collecting" messages contain a nonce
field identifying the survey instance. The nonce in the "stop collecting" message must match the nonce from the "start collecting" message. The surveyor should choose a random 32-bit unsigned integer for the nonce.
During the reporting phase, the surveyor sends TimeSlicedSurveyRequestMessage
s to individual nodes to gather the information the node recorded during the collecting phase.
Overlay Survey Script
To simplify running an overlay survey, stellar-core ships with a script OverlaySurvey.py
in the scripts
directory. This script walks the network using the overlay survey HTTP endpoints to build a graph containing the topology of the overlay network. The script outputs this graph both in JSON format, as well as GraphML. You can analyze the GraphML file using a GraphML viewer such as Gephi.
An example usage of the survey script to run an overlay survey is as follows:
$ python3 OverlaySurvey.py survey -n http://127.0.0.1:11626 -c 20 -sr sr.json -gmlw gmlw.graphml
The arguments this example uses are:
- sub command
survey
- run survey and analyze-n NODE
,--node NODE
- address of initial survey node-c DURATION
,--collectDuration DURATION
- duration of survey collecting phase in minutes-gmlw GRAPHMLWRITE
,--graphmlWrite GRAPHMLWRITE
- output file for graphml file-sr SURVEYRESULT
,--surveyResult SURVEYRESULT
- output file for survey results
Therefore, this example will run a survey from a stellar-core node running on the local machine with a collecting phase duration of 20 minutes and output the results to sr.json
and gmlw.graphml
.
Attaching to a Running Survey
Use the --startPhase
option to attach the script to an already running survey. This may be necessary if something happened during the running of the script that caused the script to terminate early (such as losing connection with the surveyor node). --startPhase
has three possible values:
startCollecting
: Start a survey from the beginning of the collecting phase. This is the default value when--startPhase
is unspecified. It indicates you would like to start a new survey from the beginning (that is, you are not attaching the script to an existing survey).stopCollecting
: Immediately broadcast aTimeSlicedSurveyStopCollectingMessage
for the currently running survey and begin surveying individual nodes for results. Use this option if your survey is currently in the collecting phase and you'd like to move it to the reporting phase.surveyResults
: Begin surveying individual nodes for results. Use this option if your survey is in the reporting phase.
More Survey Script Options
The survey script contains additional subcommands and options to further analyze the survey results. You can find a complete list of subcommands by running:
$ python3 OverlaySurvey.py -h
From there, you can run:
$ python3 OverlaySurvey.py <subcommand> -h
for more info about any given subcommand.
Example Survey Command Using HTTP Endpoints
This section walks through an example of running an overlay survey by calling the survey HTTP endpoints directly. We highly recommend using the overlay survey script instead. This section may be useful to anyone who wants to modify the survey script, or anyone who is curious about the lower-level details of how the survey works and the data it includes.
In this example, we have three nodes GBBN
, GDEX
, and GBUI
(we'll refer to them by the first four letters of their public keys). We will execute the commands below from GBUI
, and note that GBBN
has SURVEYOR_KEYS=["$self"]
in it's config file, so GBBN
will not relay or respond to any survey messages.
# 1. Begin the surveyor collecting phase
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'startsurveycollecting?nonce=1234'
# 2. Stop the surveyor collecting phase, and begin the reporting phase
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'stopsurveycollecting?nonce=1234'
# 3. Request survey results from the `GBBN` node
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'surveytopologytimesliced?node=GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL&inboundpeerindex=0&outboundpeerindex=0'
# 4. Request survey results from the `GDEX` node
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'surveytopologytimesliced?node=GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4&inboundpeerindex=0&outboundpeerindex=0'
# 3. Retrieve and display the results of issued survey commands
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'getsurveyresult'
Once the responses are received, the getsurveyresult
command will return a result like this:
{
"backlog": [],
"badResponseNodes": null,
"surveyInProgress": true,
"topology": {
"GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL": null,
"GDEXJV6XKKLDUWKTSXOOYVOYWZGVNIKKQ7GVNR5FOV7VV5K4MGJT5US4": {
"inboundPeers": [
{
"bytesRead": 26392,
"bytesWritten": 26960,
"duplicateFetchBytesRecv": 0,
"duplicateFetchMessageRecv": 0,
"duplicateFloodBytesRecv": 10424,
"duplicateFloodMessageRecv": 43,
"messagesRead": 93,
"messagesWritten": 96,
"nodeId": "GBBNXPPGDFDUQYH6RT5VGPDSOWLZEXXFD3ACUPG5YXRHLTATTUKY42CL",
"secondsConnected": 22,
"uniqueFetchBytesRecv": 0,
"uniqueFetchMessageRecv": 0,
"uniqueFloodBytesRecv": 11200,
"uniqueFloodMessageRecv": 46,
"version": "v12.2.0-46-g61aadd29"
},
{
"bytesRead": 32204,
"bytesWritten": 31212,
"duplicateFetchBytesRecv": 0,
"duplicateFetchMessageRecv": 0,
"duplicateFloodBytesRecv": 11200,
"duplicateFloodMessageRecv": 46,
"messagesRead": 115,
"messagesWritten": 112,
"nodeId": "GBUICIITZTGKL7PUBHUPWD67GDRAIYUA4KCOH2PUIMMZ6JQLNVA7C4JL",
"secondsConnected": 23,
"uniqueFetchBytesRecv": 176,
"uniqueFetchMessageRecv": 2,
"uniqueFloodBytesRecv": 14968,
"uniqueFloodMessageRecv": 62,
"version": "v12.2.0-46-g61aadd29"
}
],
"numTotalInboundPeers": 2,
"numTotalOutboundPeers": 0,
"maxInboundPeerCount": 64,
"maxOutboundPeerCount": 8,
"addedAuthenticatedPeers": 0,
"droppedAuthenticatedPeers": 0,
"p75SCPFirstToSelfLatencyMs": 72,
"p75SCPSelfToOtherLatencyMs": 112,
"lostSyncCount": 0,
"isValidator": false,
"outboundPeers": null
}
}
}
In this example, note that the node GBBN
under the topology
field has a null
value because it's configured to not respond to the survey message.
Some notable fields from this getsurveyresult
endpoint are:
backlog
: List of nodes for which the survey request are yet to be sentbadResponseNodes
: List of nodes that sent a malformed responsetopology
: Map of nodes to connection informationinboundPeers
/outboundPeers
: List of connection information by nodesaverageLatencyMs
: Average latency with this peer in milliseconds.bytesRead
: The total number of bytes read from this peer.bytesWritten
: The total number of bytes written to this peer.duplicateFetchBytesRecv
: The number of bytes received that were duplicate transaction sets and quorum sets.duplicateFetchMessageRecv
: The count of duplicate transaction sets and quorum sets received from this peer.duplicateFloodBytesRecv
: The number of bytes received that were transactions and SCP votes duplicates.duplicateFloodMessageRecv
: The count of duplicate transactions and SCP votes received from this peer.messagesRead
: The total number of messages read from this peer.messagesWritten
: The total number of messages written to this peer.nodeId
: Node's public key.secondsConnected
: The total number of seconds this peer has been connected to the surveyed node.uniqueFetchBytesRecv
: The number of bytes received that were unique transaction sets and quorum sets.uniqueFetchMessageRecv
: The count of unique transaction sets and quorum sets received from this peer.uniqueFloodBytesRecv
: The number of bytes received that were unique transactions and SCP votes.uniqueFloodMessageRecv
: The count of unique transactions and SCP votes received from this peer.version
: stellar-core version.
numTotalInboundPeers
/numTotalOutboundPeers
: The number of total inbound and outbound peers this node is connected to. The response will have a random subset of 25 connected peers per direction (inbound/outbound). These fields tell you if you're missing nodes so you can send another request out to get another random subset of nodes.maxInboundPeerCount
/maxOutboundPeerCount
: The number of total inbound and outbound peers that this node can accept. These fields correspond to stellar-core configurationsMAX_ADDITIONAL_PEER_CONNECTIONS
andTARGET_PEER_CONNECTIONS
, respectively.addedAuthenticatedPeers
: The number of authenticated peers added.droppedAuthenticatedPeers
: The number of authenticated peers dropped.p75SCPFirstToSelfLatencyMs
: 75th percentile latency to hear about new SCP messages in milliseconds.p75SCPSelfToOtherLatencyMs
: 75th percentile latency for other nodes to hear this node's SCP messages in milliseconds.lostSyncCount
: The number of times this node lost sync.isValidator
: Is this node a validator?
Quorum Health
To help node operators monitor their quorum sets and maintain the health of the overall network, Stellar Core also provides metrics on other nodes in your quorum set. You should monitor them to make sure they're up and running, and that your quorum set is maintaining good overlap with the rest of the network.
Quorum Set Diagnostics
The quorum
command allows you to diagnose problems with the quorum set of the local node.
If you run:
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'quorum'
The output will look something like:
{
"node": "GCTSF",
"qset": {
"agree": 6,
"cost": 20883268,
"delayed": null,
"disagree": null,
"fail_at": 2,
"fail_with": ["sdf_watcher1", "sdf_watcher2"],
"hash": "d5c247",
"lag_ms": {
"sdf_watcher1": 192,
"sdf_watcher2": 215,
"sdf_watcher3": 79,
"stronghold1": 321,
"eno": 266,
"tempo.eu.com": 225,
"satoshipay": 249
},
"ledger": 24311847,
"missing": ["stronghold1"],
"phase": "EXTERNALIZE",
"value": {
"t": 3,
"v": [
"sdf_watcher1",
"sdf_watcher2",
"sdf_watcher3",
{
"t": 3,
"v": ["stronghold1", "eno", "tempo.eu.com", "satoshipay"]
}
]
}
},
"transitive": {
"critical": [["GDM7M"]],
"intersection": true,
"last_check_ledger": 24311536,
"node_count": 21
}
}
This output has two main sections: qset
and transitive
. The former describes the node and its quorum set. The latter describes the transitive closure of the node's quorum set.
Per-node Quorum-set Information
Entries to watch for in the qset
section, which describe the node and its quorum set, are:
agree
: the number of nodes in the quorum set that seem to be up and running as expected. The local node has no reason to believe that this node isdelayed
,disagree
ormissing
. Note thatagree
has nothing to do with SCP terms such as "accept" or "confirming".delayed
: the nodes that are participating in consensus but seem to be behind.disagree
: the nodes that are participating but disagree with this instance.fail_at
: the number of failed nodes that would cause this instance to halt.fail_with
: an example of such potential failure.missing
: the nodes that seem down during this consensus round.value
: the quorum set used by this node (t
is the threshold expressed as a number of nodes).
In the example above, 6 nodes are functioning properly, one is down (stronghold1
), and the instance will fail if any two nodes still working (or one node and one inner-quorum-set) fail as well.
If a node is stuck in state Joining SCP
, this command allows to quickly find the reason:
- too many validators missing (down or without a good connectivity), solutions are:
- adjust your quorum set (thresholds, grouping, etc.) based on the nodes that are not missing
- try to get a better connectivity path to the missing validators
- network split would cause SCP to stick because of nodes that disagree. This would happen if either there is a bug in SCP, the network does not have quorum intersection, or the disagreeing nodes are misbehaving (compromised, etc.).
Note that the node not being able to reach consensus does not mean that the network as a whole will be unable to reach consensus (and the opposite is true, the network may fail because of a different set of validators failing).
You can get a sense of the quorum set health of a different node using using:
# the `NAME` of a validator
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'quorum?node=$sdf1'
# OR the `PUBLIC_KEY` of a validator
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'quorum?node=@GABCDE'
Overall network health can be evaluated by walking through all nodes and looking at their health. Note that this is only an approximation, as remote nodes may not have received the same messages (in particular: missing
for other nodes is not reliable).
Transitive Closure Summary Information
When showing quorum-set information about the local node rather than some other node, a summary of the transitive closure of the quorum set is also provided in the transitive
field. This has several important sub-fields:
last_check_ledger
: the last ledger in which the transitive closure was checked for quorum intersection. This will reset when the node boots and whenever a node in the transitive quorum changes its quorum set. It may lag behind the last-closed ledger by a few ledgers depending on the computational cost of checking quorum intersection.node_count
: the number of nodes in the transitive closure, which are considered when calculating quorum intersection.intersection
: whether or not the transitive closure enjoyed quorum intersection at the most recent check. This is of utmost importance in preventing network splits. It should always be true. If it is ever false, one or more nodes in the transitive closure of the quorum set is currently misconfigured, and the network is at risk of splitting. Corrective action should be taken immediately, for which two additional sub-fields will be present to help suggest remedies:last_good_ledger
: this will note the last ledger for which theintersection
field was evaluated as true; if some node reconfigured at or around that ledger, reverting that configuration change is the easiest corrective action to take.potential_split
: this will contain a pair of lists of validator IDs, which is a potential pair of disjoint quorums allowed by the current configuration. In other words, a possible split in consensus allowed by the current configuration. This may help narrow down the cause of the misconfiguration: likely it involves too-low a consensus threshold in one of the two potential quorums, and/or the absence of a mandatory trust relationship that would bridge the two.
critical
: an "advance warning" field that lists nodes that could cause the network to fail to enjoy quorum intersection, if they were misconfigured sufficiently badly. In a healthy transitive network configuration, this field will benull
. If it is non-null
then the network is essentially "one misconfiguration" (of the quorum sets of the listed nodes) away from no longer enjoying quorum intersection, and again, corrective action should be taken: careful adjustment to the quorum sets of nodes that depend on the listed nodes, typically to strengthen quorums that depend on them.
Detailed Transitive Quorum Analysis
The quorum endpoint can also retrieve detailed information about the transitive quorum. This is a format that's easier to process than what scp
returns, as it doesn't contain all SCP messages.
sudo -u stellar stellar-core --conf /etc/stellar/stellar-core.cfg http-command 'quorum?transitive=true'
The output looks something like:
{
"critical": null,
"intersection": true,
"last_check_ledger": 121235,
"node_count": 4,
"nodes": [
{
"distance": 0,
"heard": 121235,
"node": "GB7LI",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf2",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf3",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
},
{
"distance": 1,
"heard": 121235,
"node": "sdf1",
"qset": {
"t": 2,
"v": ["sdf1", "sdf2", "sdf3"]
},
"status": "tracking",
"value": "[ txH: d99591, ct: 1557426183, upgrades: [ ] ]",
"value_id": 1
}
]
}
The output begins with the same summary information as in the transitive
block of the non-transitive query (if queried for the local node), but also includes a nodes
array that represents a walk of the transitive quorum centered on the query node.
Notable fields contained in this response are:
node
: the identity of the validatordistance
: how far that node is from the root node (i.e., how many quorum set hops)heard
: the latest ledger sequence number at which this node cast a voteqset
: the node's quorum setstatus
: one ofbehind|tracking|ahead
(compared to the root node) ormissing|unknown
(when there are no recent SCP messages for that node)value_id
: a unique ID for what the node is voting for (allows you to quickly tell if nodes are voting for the same thing)value
: what the node is voting for
Using Prometheus
Monitoring stellar-core
using Prometheus is by far the simplest solution, especially if you already have a Prometheus server within your infrastructure. Prometheus is a free and open source time-series database with a simple yet incredibly powerful query language PromQL
. Prometheus is also tightly integrated with Grafana, so you can render complex visualisations with ease.
In order for Prometheus to scrape stellar-core
application metrics, you will need to install the stellar-core-prometheus-exporter (apt-get install stellar-core-prometheus-exporter
) and configure your Prometheus server to scrape this exporter (default port: 9473
). On top of that grafana can be used to visualize metrics.
Install a Prometheus Server Within your Infrastructure
Installing and configuring a Prometheus server is out of scope of this document, however it is a fairly simple process: Prometheus is a single Go binary which you can download from https://prometheus.io/docs/prometheus/latest/installation/.
Install the stellar-core-prometheus-exporter
The stellar-core-prometheus-exporter is an exporter that scrapes the stellar-core
metrics endpoint (http://localhost:11626/metrics
) and renders these metrics in the Prometheus text-based format available for Prometheus to scrape and store in its time series database.
The exporter needs to be installed on every Stellar Core node you wish to monitor.
apt-get install stellar-core-prometheus-exporter
You will need to open up port 9473
between your Prometheus server and all your Stellar Core nodes for your Prometheus server to be able to scrape metrics.
Point Prometheus to stellar-core-prometheus-exporter
Pointing your Prometheus instance to the exporter can be achieved by manually configuring a scrape job; however, depending on the number of hosts you need to monitor this can quickly become unwieldy. Luckily, the process can also be automated using Prometheus' various "service discovery" plugins. For example with AWS hosted instance you can use the ec2_sd_config
plugin.
Manual
- job_name: "stellar-core"
scrape_interval: 10s
scrape_timeout: 10s
static_configs:
- targets: [
"core-node-001.example.com:9473",
"core-node-002.example.com:9473",
] # stellar-core-prometheus-exporter default port is 9473
- labels:
application: "stellar-core"
Using Service Discovery (EC2)
- job_name: stellar-core
scrape_interval: 10s
scrape_timeout: 10s
ec2_sd_configs:
- region: eu-west-1
port: 9473
relabel_configs:
# ignore stopped instances
- source_labels: [__meta_ec2_instance_state]
regex: stopped
action: drop
# only keep with `core` in the Name tag
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*core.*)"
action: keep
# use Name tag as instance label
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*)"
action: replace
replacement: "${1}"
target_label: instance
# set application label to stellar-core
- source_labels: [__meta_ec2_tag_Name]
regex: "(.*core.*)"
action: replace
replacement: stellar-core
target_label: application
Create Alerting Rules
Once Prometheus scrapes metrics we can add alerting rules. Recommended rules are here (require Prometheus 2.0 or later). Copy rules to /etc/prometheus/stellar-core-alerting.rules on the Prometheus server and add the following to the prometheus configuration file to include the file:
rule_files:
- "/etc/prometheus/stellar-core-alerting.rules"
Rules are documented in-line,and we strongly recommend that you review and verify all of them as every environment is different.
Configure Notifications Using Alertmanager
Alertmanager is responsible for sending notifications. Installing and configuring an Alertmanager server is out of scope of this document, however it is a fairly simple process. Official documentation is here.
All recommended alerting rules have "severity" label:
- critical normally require immediate attention. They indicate an ongoing or very likely outage. We recommend that critical alerts notify administrators 24x7
- warning normally can wait until working hours. Warnings indicate problems that likely do not have production impact but may lead to critical alerts or outages if left unhandled
The following example alertmanager configuration demonstrates how to send notifications using different methods based on severity label:
global:
smtp_smarthost: localhost:25
smtp_from: [email protected]
route:
receiver: default-receiver
group_by: [alertname]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
routes:
- receiver: critical-alerts
match:
severity: critical
- receiver: warning-alerts
match:
severity: warning
receivers:
- name: critical-alerts
pagerduty_configs:
- routing_key: <PD routing key>
- name: warning-alerts
slack_configs:
- api_url: https://hooks.slack.com/services/slack/warning/channel/webhook
- name: default-receiver
email_configs:
- to: alerts-[email protected]
In the above examples alerts with severity "critical" are sent to pagerduty and warnings are sent to slack.
Useful Exporters
You may find the below exporters useful for monitoring your infrastructure as they provide incredible insight into your operating system and database metrics. Installing and configuring these exporters is out of the scope of this document but should be relatively straightforward.
- node_exporter can be used to track all operating system metrics.
- postgresql_exporter can be used to monitor the local stellar-core database.
Visualize Metrics Using Grafana
Once you've configured Prometheus to scrape and store your stellar-core metrics, you will want a nice way to render this data for human consumption. Grafana offers the simplest and most effective way to achieve this. Installing Grafana is out of scope of this document but is a very simple process, especially when using the prebuilt apt packages
We recommend that administrators import the following two dashboards into their grafana deployments:
- Stellar Core Monitoring - shows the most important metrics, node status and tries to surface common problems. It's a good troubleshooting starting point
- Stellar Core Full - shows a simple health summary as well as all metrics exposed by the
stellar-core-prometheus-exporter
. It's much more detailed than the Stellar Core Monitoring and might be useful during in-depth troubleshooting