Previous incidents
Dashboard, API Service, and 1 other service are down
Resolved Nov 19 at 05:57pm IST
AI Models recovered.
4 previous updates
Dashboard, API Service, and 1 other service are down
Resolved Nov 19 at 05:21pm IST
All services are up. One node from the cluster went down bringing too many reservation loops.
We have mitigated this issue.
4 previous updates
Degraded test run and log indexing
Resolved Oct 24 at 10:48pm IST
Quick RCA -
- CPU for one of our real-time DBs was getting throttled even if we had allocated more CPU. This is now fixed, and things should start working normally.
- We are keeping an eye on the overall setup
2 previous updates
Degraded availability (test runs, prompt playground)
Resolved Oct 04 at 08:39pm IST
We are up now.
RCA
- We use a third-party library to acquire distributed locks that expect specific LUA scripts to be cached. At 6:00 AM PT today, we realized that the Redis cache was burst due to disc corruption that led to the deletion of these scripts.
- We learned that the lib does not reindex the scripts, so we had to update them manually - once updated system is working as expected
2 previous updates
Dashboard and API Service are down
Resolved Oct 04 at 02:06pm IST
API Service recovered.
5 previous updates
Degraded performance for test runs
Resolved Sep 25 at 12:37pm IST
The issue is mitigated. We are now working on a long-term fix.
2 previous updates