New top story on Hacker News: Show HN: I built an open-source tool to make on-call suck less - Hindi Top Breaking News - Hindi News, Latest News in Hindi, Breaking News

Hindi Top Breaking News - Hindi News, Latest News in Hindi, Breaking News

India Hindi News app brings you the latest news and videos from the Hindi Top Breaking News studios in India. Stay tuned to the latest news stories from India and the world. Access videos and photos on your device with the Hindi Top Breaking News India News app.

Breaking

Home Top Ad

Post Top Ad

Responsive Ads Here

Sunday, July 28, 2024

New top story on Hacker News: Show HN: I built an open-source tool to make on-call suck less

Show HN: I built an open-source tool to make on-call suck less
18 by aray07 | 2 comments on Hacker News.
Hey HN, I am building an open source platform to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty). Here is a quick video of how it works: https://youtu.be/m_K9Dq1kZDw I hated being on-call for a couple of reasons: * Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later. * Debugging: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly. There were some more tangential issues that used to take up a lot of on-call time * Support: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before. * Dealing with PagerDuty: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules. I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers. We heard from a lot of engineers that maintaining good alert hygiene is a challenge. To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy. We analyze your alert history across various signals: 1. Alert frequency 2. How quickly the alerts have resolved in the past 3. Alert priority 4. Alert response history Our classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert. Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene. What’s next? 1. Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better 2. Help make debugging and root cause analysis easier. 3. Runbook automation We’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!

No comments:

Post a Comment

Post Bottom Ad

Responsive Ads Here

Pages