continuous failures writing to Influx, or resolving the hostname:
May 31 00:49:08 prod-poller-processor kapacitord[54933]: ts=2022-05-31T00:49:08.133Z lvl=error msg="failed to write points to InfluxDB" service=kapacitor task_master=main task=interface_rates node=influxdb_out12 err=timeout
May 31 01:26:44 prod-poller-processor kapacitord[54933]: ts=2022-05-31T01:26:44.163Z lvl=error msg="failed to connect to InfluxDB, retrying..." service=influxdb cluster=read err="Get https://influx-cluster.service.ha.geant.org:8086/ping: dial tcp: lookup influx-cluster.service.ha.geant.org on 83.97.93.200:53: no such host"
Pete Pedersen stopped the system and fixed the corrupt partition.
08:26:55
System was rebooted.
08:26:55
haproxy failed to start because it couldn't resolve prod-inventory-provider0x.geant.org
May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:30] : 'server prod-inventory-provider01.geant.org' : could not resolve address 'prod-inventory-provider01.geant.org'.
May 31 08:26:55 prod-poller-processor haproxy[976]: [ALERT] 150/082655 (976) : parsing [/etc/haproxy/haproxy.cfg:31] : 'server prod-inventory-provider02.geant.org' : could not resolve address 'prod-inventory-provider02.geant.org'.
08:27:07
Kapacitor tasks failed to run because the haproxy service wasn't running, for example:
May 31 08:27:07 prod-poller-processor kapacitord[839]: ts=2022-05-31T08:27:07.962Z lvl=info msg="UDF log" service=kapacitor task_master=main task=service_enrichment node=inventory_enrichment2 text="urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /poller/interfaces (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f749f4a2978>: Failed to establish a new connection: [Errno 111] Connection refused',))"
08:41:11
puppet ran and restarted haproxy.
this time dns resolution was back to normal, and haproxy successfully started
... but Kapacitor tasks were still in a non-executing state
09:27:10
manual restart of Kapacitor. Normal system behavior restored
10:39
Sam Roberts copied the lost data points from UAT to production
interface_rates
dscp32_rates
gwsd_rates
multicast_rates
Proposed Solution
The core issue seems to be related to VMWare and IT need to provide a solution.
A previously-known issue with the Kapacitor tasks stopping due to unchecked errors meant that the services were not executing for longer than necessary. (cf. POL1-529)