Incident Description

The BRIAN Sensu cluster for scheduling SNMP polling checks had an outage for approximately 1 hour total, spanning Sunday/Monday. No Counters were fetched from routers & saved in InfluxDB during this time.


The reason for degradation:


The impact of this service degradation was:


Incident severity:  Temporary service outage

Data loss: 

Total duration of incident: ~18 hours


Timeline

All times are in UTC

DateTime (UTC)Description

 

12:52:37

The first evidence of this incident appeared in the logs of prod-poller-processor.geant.org. remove_spikes_interface_rates is one of several stream functions in the data processing pipeline required for the data displayed in BRIAN.


May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

31 May 202211:56

Keith Slater informed APMs - BRIAN is back to normal operation.

Proposed Solution