Previous incidents

November 2024
Nov 19, 2024
2 incidents

Dashboard, API Service, and 1 other service are down

Downtime

Resolved Nov 19 at 05:57pm IST

AI Models recovered.

4 previous updates

Dashboard, API Service, and 1 other service are down

Downtime

Resolved Nov 19 at 05:21pm IST

All services are up. One node from the cluster went down bringing too many reservation loops.

We have mitigated this issue.

4 previous updates

October 2024
Oct 24, 2024
1 incident

Degraded test run and log indexing

Degraded

Resolved Oct 24 at 10:48pm IST

Quick RCA -

  • CPU for one of our real-time DBs was getting throttled even if we had allocated more CPU. This is now fixed, and things should start working normally.
  • We are keeping an eye on the overall setup

2 previous updates

Oct 04, 2024
2 incidents

Degraded availability (test runs, prompt playground)

Degraded

Resolved Oct 04 at 08:39pm IST

We are up now.

RCA

  • We use a third-party library to acquire distributed locks that expect specific LUA scripts to be cached. At 6:00 AM PT today, we realized that the Redis cache was burst due to disc corruption that led to the deletion of these scripts.
  • We learned that the lib does not reindex the scripts, so we had to update them manually - once updated system is working as expected

2 previous updates

Dashboard and API Service are down

Downtime

Resolved Oct 04 at 02:06pm IST

API Service recovered.

5 previous updates

September 2024
Sep 25, 2024
1 incident

Degraded performance for test runs

Degraded

Resolved Sep 25 at 12:37pm IST

The issue is mitigated. We are now working on a long-term fix.

2 previous updates