Previous incidents
November 2024
Nov 19, 2024
2 incidents
Dashboard, API Service, and 1 other service are down
Downtime
Resolved Nov 19 at 05:57pm IST
AI Models recovered.
4 previous updates
Dashboard, API Service, and 1 other service are down
Downtime
Resolved Nov 19 at 05:21pm IST
All services are up. One node from the cluster went down bringing too many reservation loops.
We have mitigated this issue.
4 previous updates
October 2024
Oct 24, 2024
1 incident
Degraded test run and log indexing
Degraded
Resolved Oct 24 at 10:48pm IST
Quick RCA -
- CPU for one of our real-time DBs was getting throttled even if we had allocated more CPU. This is now fixed, and things should start working normally.
- We are keeping an eye on the overall setup
2 previous updates
Oct 04, 2024
2 incidents
Degraded availability (test runs, prompt playground)
Degraded
Resolved Oct 04 at 08:39pm IST
We are up now.
RCA
- We use a third-party library to acquire distributed locks that expect specific LUA scripts to be cached. At 6:00 AM PT today, we realized that the Redis cache was burst due to disc corruption that led to the deletion of these scripts.
- We learned that the lib does not reindex the scripts, so we had to update them manually - once updated system is working as expected
2 previous updates
Dashboard and API Service are down
Downtime
Resolved Oct 04 at 02:06pm IST
API Service recovered.
5 previous updates