The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.
The reason for degradation:
Network outage resulted in loss of connectivity between all sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
Resulted in complete loss of clustering, causing sensu to unschedule all checks.
The impact of this service degradation was:
No interfaces were polled between approximately 16:00 and 16:50 UTC, resulting in loss of data on the Production BRIAN instance.
No interfaces were polled between approximately 10:20 and 10:35 UTC, resulting in loss of data on the Production BRIAN instance, due to re-boot of the degraded Sensu cluster.
Incident severity: Temporary service outage
Data loss:
Total duration of incident: ~18 hours
Timeline
All times are in UTC
Date
Time (UTC)
Description
12:52:37
The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.
May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout
May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"
31 May 2022
11:56
Keith Slater informed APMs - BRIAN is back to normal operation.