Intermittent Console and API issues on The Things Stack Cloud
Incident Report for The Things Industries
Postmortem

We have faced a minor operational issue with The Things Stack Cloud with regards to external API availability in all of our clusters, affecting mainly Console usage. Traffic processing and delivery was not affected.

Cause

The root cause of this issue is that the service which we use for load balancing and request routing, Envoy, had a bug (1, 2) in their HTTP/2 request processing library.

We have upgraded to the latest release of Envoy at the time, v1.29.0, as part of our v3.29.0 release, and initially did not experience any elevated timeout or error rates in our load balancer. However, over the past two days more reports of these timeouts occurred and we have decided to rollback our Envoy version upgrade. We have not observed any elevated timeout rates since.

We monitor failed request rates inside a cluster, but not at the edge of the cluster, where the load balancer operates, as we deem these rates to be more accurate near the components which experience failures. We will be looking into possible improvements in our monitoring in order to account for such issues in the future.

Resolution

We have rolled back to the last known working Envoy version, v1.28.1, and have no longer been able to reproduce the sporadic timeouts.


Adrian-Ștefan Mareș
Head of Engineering, The Things Industries

Posted Feb 19, 2024 - 10:44 CET

Resolved
This issue is now resolved.
Posted Feb 17, 2024 - 14:27 CET
Monitoring
A fix has been deployed and we are monitoring the results.
Posted Feb 17, 2024 - 14:08 CET
Identified
We have identified the issue and deploying a fix to resolve it.
Posted Feb 17, 2024 - 13:26 CET
Investigating
We are investigating console access and API connection issues on The Things Stack Cloud. We will provide more details as we progress.
Posted Feb 17, 2024 - 12:02 CET
This incident affected: The Things Stack Cloud (Europe 1 (eu1), North America 1 (nam1)).