(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=41086620

一位工程师正在开发一个开源平台,旨在提高效率并减轻待命压力。 他们的目标是创建一个工具,最大限度地减少不必要的警报、简化调试过程并简化运行手册、回答团队查询以及与 PagerDuty 交互等任务。 该工具将与 Datadot 等流行的可观测平台以及 Prometheus、Splunk、Sentry 和 PagerDuty 等其他潜在平台集成。 为了实现这一目标,他们计划将警报分类为必要或非必要,并考虑警报频率、解决速度、优先级和之前的响应等因素。 为了确保准确性,他们的方法最初将采取谨慎的立场,允许根据团队对系统预测能力日益增长的信任进行定制。 此外,还将提供一份全面的每周报告,详细说明总体警报状况。 随着开发的进展,重点仍然是创建进一步的集成、简化调试和根本原因分析以及实施自动化运行手册解决方案。 鼓励其他开发人员提出建设性的批评和意见。

相关文章

原文
Hey HN,

I am building an open source platform to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty). Here is a quick video of how it works: https://youtu.be/m_K9Dq1kZDw

I hated being on-call for a couple of reasons:

* Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

* Debugging: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly.

There were some more tangential issues that used to take up a lot of on-call time

* Support: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before.

* Dealing with PagerDuty: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules.

I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers.

We heard from a lot of engineers that maintaining good alert hygiene is a challenge.

To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy.

We analyze your alert history across various signals:

1. Alert frequency

2. How quickly the alerts have resolved in the past

3. Alert priority

4. Alert response history

Our classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert.

Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene.

What’s next?

1. Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better

2. Help make debugging and root cause analysis easier.

3. Runbook automation

We’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!

联系我们 contact @ memedata.com