Downlink message transmission issues
Incident Report for The Things Industries
Postmortem

Today we faced a minor operational issue with The Things Stack Cloud with regards to the inter-component communication of our infrastructure in the eu1 cluster, affecting mainly the transmission of downlink messages. This issue lasted today (July 27, 2023) from 04:27 to 05:03 UTC.

Cause

The root cause of this issue is that the service which we use for service discovery, AWS Cloud Map, was experiencing issues with instance status updates and as such did not return all of the available instances of a particular service.

Service discovery is used in distributed systems in order to be able to dynamically address individual instances of a particular service. In our case such a service would be the Gateway Server, which is our service which handles the communication with the LoRaWAN gateways.

Due to the issues experienced by AWS Cloud Map, our Network Server service was unable to detect the existence of the Gateway Server service instances, and in turn was unable to schedule downlinks. Fortunately, a subset of our Network Server instances were still visible, and as such uplink traffic was still successfully processed, albeit with small delays due to the increased capacity on the visible instances.

Resolution

As AWS Cloud Map issues were fixed, downlink traffic was restored and uplink traffic latency has returned to normal values.


Adrian-Ștefan Mareș
Head of Engineering, The Things Industries

Posted Jul 27, 2023 - 17:23 CEST

Resolved
We have experienced difficulties with downlink message transmissions between 04:27 to 05:03 UTC due to an issue with our service discovery service.
Posted Jul 27, 2023 - 06:30 CEST