Incident Description

BRIAN bitrate traffic was not computed or saved for approximately 21 hours.

The reason for degradation:

Failure to connect or write data to InfluxDB
Local system partition errors
cf. IT incident: 30052022

The impact of this service degradation was:

Data points between 30 May 2022 12:50 and 31 May 2022 9:30 UTC were temporarily not visible in the BRIAN gui

Incident severity: Intermittent service outage

Data loss:

Total duration of incident: 21 hours

Timeline

All times are in UTC

Date

Time (UTC)

Description

30 May 2022

12:52:37

the first evidence of this incident appeared in the logs of prod-poller-processor.geant.org

May 30 12:52:37 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:37.802Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=remove_spikes_gwsd_rates node=influxdb_out3 err=timeout

May 30 12:52:38 prod-poller-processor kapacitord[124994]: ts=2022-05-30T12:52:38.069Z lvl=error msg="encountered error" service=kapacitor task_master=main task=remove_spikes_interface_rates node=remove_spikes2 err="keepalive timedout, last keepalive received was: 2022-05-30 12:52:28.069298439 +0000 UTC"

30 May 2022

afternoon

Several performance issues started to become apparent:

ems was failing to resolve the influx cluster hostnames
puppet was failing or taking a very long time to complete on many vm's

30 May 2022

19:08

Keith Slater (and others) alerted on the #brian channel that data was missing in the BRIAN gui

30 May 2022

20:30

Bjarke Madsen replied that is seemed related to service problems seen earlier in the day

30 May 2022

21:12

Massimiliano Adamo replied on #swd-private that we had raised an issue with VMWare

30 May 2022

23:28

Linda Ness sent a mail to gn4-3-all@lists.geant.org indicating several services are down

30 May 2022

12:53

continuous failures writing to Influx, or resolving the hostname:

May 31 00:49:08 prod-poller-processor kapacitord[54933]: ts=2022-05-31T00:49:08.133Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=interface_rates node=influxdb_out12 err=timeout

May 31 01:26:44 prod-poller-processor kapacitord[54933]: ts=2022-05-31T01:26:44.163Z lvl=error msg="failed to connect to InfluxDB, retrying..." service=influxdb cluster=read err="Get https://influx-cluster.service.ha.geant.org:8086/ping: dial tcp: lookup influx-cluster.service.ha.geant.org on 83.97.93.200:53: no such host"

31 May 2022

08:12

31 May 2022

02:34-08:11

also lots of i/o errros in the logs

May 31 02:34:59 prod-poller-processor kernel: [1754481.561423] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561427] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 02 05 63 f0 00 00 08 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561442] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 149650 (offset 0 size 0 starting block 702078)

May 31 02:34:59 prod-poller-processor kernel: [1754481.561494] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561497] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 18 00 00 10 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561529] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561532] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 af dd b0 00 00 10 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561550] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561553] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 f3 b3 d0 00 00 08 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561567] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1086s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561569] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 aa 75 90 00 00 18 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561579] EXT4-fs warning (device dm-4): ext4_end_bio:302: I/O error -5 writing to inode 73 (offset 4927488 size 8192 starting block 1005746

)

May 31 02:34:59 prod-poller-processor kernel: [1754481.561632] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561635] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef b1 90 00 00 08 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561675] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561678] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 48 00 00 10 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561696] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561699] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef d3 68 00 00 a0 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561758] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561761] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 af c6 68 00 00 20 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561789] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1084s

May 31 02:34:59 prod-poller-processor kernel: [1754481.561792] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 01 ef b4 f0 00 00 10 00

May 31 02:34:59 prod-poller-processor kernel: [1754481.561814] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 131114 (offset 0 size 0 starting block 1029894)

May 31 02:34:59 prod-poller-processor kernel: [1754481.561852] EXT4-fs warning (device dm-0): ext4_end_bio:302: I/O error -5 writing to inode 131105 (offset 14843904 size 12288 starting block 261672)

May 31 02:34:59 prod-poller-processor kernel: [1754481.626924] JBD2: Detected IO errors while flushing file data on dm-0-8

31 May 2022

07:34

Keith Slater took ownership of informing APM's

31 May 2022

08:12

Pete Pedersen stopped the system and fixed the corrupt partition.

31 May 2022

08:26:55

System was rebooted.

31 May 2022

08:26:55

haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org

May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:30] : 'server prod-inventory-provider01.geant.org' : could not resolve address 'prod-inventory-provider01.geant.org'.

May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:31] : 'server prod-inventory-provider02.geant.org' : could not resolve address 'prod-inventory-provider02.geant.org'.

31 May 2022

08:27:07

Kapacitor tasks failed to run because the haproxy service wasn't running, for example:

May 31 08:27:07 prod-poller-processor kapacitord[839]: ts=2022-05-31T08:27:07.962Z lvl=info msg="UDF log" service=kapacitor task_master=main task=service_enrichment node=inventory_enrichment2 text="urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /poller/interfaces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f749f4a2978>: Failed to establish a new connection: [Errno 111] Connection refused',))"

31 May 2022

08:41:11

puppet ran and restarted haproxy.

this time dns resolution was back to normal, and haproxy successfully started

... but Kapacitor tasks were still in a non-executing state

31 May 2022

09:27:10

manual restart of Kapacitor. Normal system behavior restored

31 May 2022

10:39

Sam Roberts copied the lost data points from UAT to production

interface_rates
dscp32_rates
gwsd_rates
multicast_rates

Proposed Solution

The core issue seems to be related to VMWare and IT need to provide a solution.
A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)