Logging is degraded
Resolved
Dec 13 at 12:38pm IST
Update
The RCA is tracked on the following Google public issue: https://issuetracker.google.com/issues/363324206
They have enabled a liveness probe on the cluster to ensure this doesn't happen again. We are taking some precautionary steps to avoid this entirely in the future.
Affected services
API Service
Updated
Dec 03 at 11:00pm IST
Update from Clickhouse team:
While looking at our monitoring tool, We observed a spike in vCPU and a high throttling percentage soon before your pods went down. This was happening around 10:06 UCT.
Then, around 10:09 UTC, the pods went down and became unhealthy because they were not responding to K8s readiness probes. We didn't observe any particular change in the usage pattern before the outage, so we are escalating this issue to the cloud team and asking for an investigation.
Affected services
API Service
Updated
Dec 03 at 06:25pm IST
Everything is stable now. We have asked Clickhouse's on-call engineer for the detailed report. It will be attached with the next update.
Affected services
API Service
Updated
Dec 03 at 04:34pm IST
New cluster is up and running. API service is using this new cluster. We are triggering re-indexing now.
Affected services
API Service
Updated
Dec 03 at 04:29pm IST
New cluster is ready and up. Our services will start writing to this cluster. We are triggering migration of old data.
Affected services
API Service
Updated
Dec 03 at 04:15pm IST
The existing cluster is blocked - we are creating a new cluster and migrating all data. ETA 30 mins.
Affected services
API Service
Created
Dec 03 at 04:07pm IST
Clickhouse-hosted deployment for us-west-1 region is impacted. We are in touch with them and working on mitigation.
Affected services
API Service