Back to overview
Degraded

Logging is degraded

Dec 03 at 04:07pm IST
Affected services
API Service

Resolved
Dec 13 at 12:38pm IST

Update

The RCA is tracked on the following Google public issue: https://issuetracker.google.com/issues/363324206

They have enabled a liveness probe on the cluster to ensure this doesn't happen again. We are taking some precautionary steps to avoid this entirely in the future.

Updated
Dec 03 at 11:00pm IST

Update from Clickhouse team:

While looking at our monitoring tool, We observed a spike in vCPU and a high throttling percentage soon before your pods went down. This was happening around 10:06 UCT.

Then, around 10:09 UTC, the pods went down and became unhealthy because they were not responding to K8s readiness probes. We didn't observe any particular change in the usage pattern before the outage, so we are escalating this issue to the cloud team and asking for an investigation.

Updated
Dec 03 at 06:25pm IST

Everything is stable now. We have asked Clickhouse's on-call engineer for the detailed report. It will be attached with the next update.

Updated
Dec 03 at 04:34pm IST

New cluster is up and running. API service is using this new cluster. We are triggering re-indexing now.

Updated
Dec 03 at 04:29pm IST

New cluster is ready and up. Our services will start writing to this cluster. We are triggering migration of old data.

Updated
Dec 03 at 04:15pm IST

The existing cluster is blocked - we are creating a new cluster and migrating all data. ETA 30 mins.

Created
Dec 03 at 04:07pm IST

Clickhouse-hosted deployment for us-west-1 region is impacted. We are in touch with them and working on mitigation.