Yesterday our application was largely unavailable for just over 4 hours. Our application runs on kubernetes and automatically restarts failed pods. Initially we noticed that a subset of pods were becoming unresponsive and being restarted quite frequently. The behavior was indicative of heightened level of traffic so we scaled up to meet demand. This did not help the problem.
We started poring over application logs to find anything that might be out of the ordinary and found that requests to our instant preview cluster were timing out and taking up to 60 seconds. We quickly deployed a fix to decrease the timeouts on requests to the preview cluster. This freed up the hanging connections to serve other requests and brought the main application back into a stable state.
On initial investigation of the issues with the preview cluster we found that all nodes were in a NotReady state. We attempted to deploy a new node pool but were met with an error from Google Cloud that stated we did not have a billing account attached to the project. We confirmed that we did in fact have a billing account attached to the project and called Google Cloud Billing Support for assistance. They were unable to identify any issues on their end and, after a lengthy discussion over the phone, were unable to help us. We spent another hour trying to understand what was causing the error when suddenly the nodes came back online and we stopped seeing the error we were getting from Google Cloud.
So, as of right now it is unclear what the cause was but it appears to have been an error with the Google Cloud Billing system. We will continue to investigate this issue with Google Cloud.