Previous incidents
Log Ingestion Delay Due to Google Pub/Sub Slowness
Resolved Aug 28 at 03:00pm IST
This is now resolved, and ingestion is back to its original speed.
2 previous updates
Log ingestion has slowed down
Resolved Aug 13 at 01:43am IST
The node is recovered. Ingestion is resumed.
1 previous update
[Downstream service issue] Log ingestion buffer is taking more time than expe...
Resolved Aug 06 at 01:19am IST
We’ve received an update from the Clickhouse team. Here’s the crux of the issue:
"Due to memory starvation, other processes in your cluster are starting to fail, resulting in degraded performance, likely including the failed writes you are experiencing."
All queued logs have been written to the disk and the instance is back to normal.
We’re still keeping an eye on our data pipelines. We’ve also put together a checklist with the team to help prevent this from happening again.
3 previous updates
Degraded service
Resolved Jul 17 at 08:42pm IST
The latency is back to normal. We will be keeping an eye on the system for next few hours.
1 previous update
Dashboard is down
Resolved Jul 09 at 09:40pm IST
Postmortem:
A failover of one of the Kafka brokers triggered a cascading effect on an event queue, which led to degraded dashboard rendering and increased latency on some API endpoints.
We fully recovered within 3 minutes, but the system experienced intermittent degradation for the following 20 minutes. To prevent recurrence, we have added an extra replica and increased pod affinity.
We apologize for the inconvenience.
2 previous updates
API Service is down
Resolved Jun 08 at 09:09am IST
API Service recovered.
1 previous update