Previous incidents

January 2025
Jan 16, 2025
1 incident

AI Models is down

Downtime

Resolved Jan 16 at 12:20pm IST

All models are recovered.

3 previous updates

December 2024
Dec 03, 2024
1 incident

Logging is degraded

Degraded

Resolved Dec 13 at 12:38pm IST

Update

The RCA is tracked on the following Google public issue: https://issuetracker.google.com/issues/363324206

They have enabled a liveness probe on the cluster to ensure this doesn't happen again. We are taking some precautionary steps to avoid this entirely in the future.

6 previous updates

November 2024
Nov 19, 2024
2 incidents

Dashboard, API Service, and 1 other service are down

Downtime

Resolved Nov 19 at 05:57pm IST

AI Models recovered.

4 previous updates

Dashboard, API Service, and 1 other service are down

Downtime

Resolved Nov 19 at 05:21pm IST

All services are up. One node from the cluster went down bringing too many reservation loops.

We have mitigated this issue.

4 previous updates