Incident Description

The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.

The reason for degradation:

Network outage resulted in loss of connectivity between all sensu cluster nodes (prod-poller-sensu-agent(01|02|03).geant.org)
Resulted in complete loss of clustering, causing sensu to unschedule all checks.

The impact of this service degradation was:

No interfaces were polled between approximately 26 Feb 2023 16:00 and 26 Feb 2023 16:50 UTC, resulting in loss of data on the Production BRIAN instance.
No interfaces were polled between approximately 27 Feb 2023 10:20 and 27 Feb 2023 10:35 UTC, resulting in loss of data on the Production BRIAN instance, due to re-boot of the degraded Sensu cluster.

Incident severity: Temporary service outage

Data loss:

Total duration of incident: ~18 hours

Timeline

All times are in UTC

Date Time (UTC) Description

26 Feb 2023

12:52:37

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.

May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

31 May 2022

11:56

Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution

TBD