Logging is degraded

Resolved
Dec 13, 2024 at 07:08am UTC

Update

The RCA is tracked on the following Google public issue: https://issuetracker.google.com/issues/363324206

They have enabled a liveness probe on the cluster to ensure this doesn't happen again. We are taking some precautionary steps to avoid this entirely in the future.

Updated
Dec 03, 2024 at 05:30pm UTC

Update from Clickhouse team:

While looking at our monitoring tool, We observed a spike in vCPU and a high throttling percentage soon before your pods went down. This was happening around 10:06 UCT.

Then, around 10:09 UTC, the pods went down and became unhealthy because they were not responding to K8s readiness probes. We didn't observe any particular change in the usage pattern before the outage, so we are escalating this issue to the cloud team and asking for an investigation.

Updated
Dec 03, 2024 at 12:55pm UTC

Everything is stable now. We have asked Clickhouse's on-call engineer for the detailed report. It will be attached with the next update.

Updated
Dec 03, 2024 at 11:04am UTC

New cluster is up and running. API service is using this new cluster. We are triggering re-indexing now.

Updated
Dec 03, 2024 at 10:59am UTC

New cluster is ready and up. Our services will start writing to this cluster. We are triggering migration of old data.

Updated
Dec 03, 2024 at 10:45am UTC

The existing cluster is blocked - we are creating a new cluster and migrating all data. ETA 30 mins.

Created
Dec 03, 2024 at 10:37am UTC

Clickhouse-hosted deployment for us-west-1 region is impacted. We are in touch with them and working on mitigation.