[Downstream service issue] Log ingestion buffer is taking more time than expected
Resolved
Aug 06 at 01:19am IST
We’ve received an update from the Clickhouse team. Here’s the crux of the issue:
"Due to memory starvation, other processes in your cluster are starting to fail, resulting in degraded performance, likely including the failed writes you are experiencing."
All queued logs have been written to the disk and the instance is back to normal.
We’re still keeping an eye on our data pipelines. We’ve also put together a checklist with the team to help prevent this from happening again.
Affected services
Updated
Aug 06 at 01:10am IST
The Clickhouse team is still looking into it. Rest assured, all the data is safe in our data lake. We’ve marked the timestamps where the slowdown started. Once the Clickhouse team gives us the green light, we’ll re-audit the volume and replay ingestion if any logs are missing.
Affected services
Updated
Aug 06 at 12:58am IST
Ingestion is now back to normal speed. We have asked for the RCA to clickhouse team. It will be attached to this degradation update.
Affected services
Created
Aug 06 at 12:42am IST
Clickhouse queries are taking longer than expected. We are working with clickhouse team to debug the issue. Will keep this status updated
Affected services