Back to Cookbook
OpenClaw recipe

Datadog and PagerDuty Alert Noise Audit and Threshold Tuner

aka “Alert Tuning Audit

Find noisy alerts and propose threshold changes backed by real paging data

Alert fatigue is one of the fastest ways to burn out an on-call rotation. The alerts waking people up the most are often old monitors nobody has revisited. This recipe audits Datadog monitor history and PagerDuty incidents, ranks noisy alerts by fire rate and self-resolution rate, and proposes threshold changes with the evidence behind each recommendation.

House RecipeWork15 min
Try in KiloClawFree 7-day trial

PROMPT

Run an alert tuning audit for a DevOps or SRE engineer. Goal: Help me find the alerts creating the most noise on my team's rotation and propose data-backed threshold changes I can safely review and ship. Ask me for: - Lookback window in days (default 90) - Service or team scope, if I want to limit the audit - My team's noise tolerance: how many fires per week is acceptable per alert - Whether to open a GitHub PR or just produce a report - The repo where monitor config lives, if I want a PR Use available integrations this way: - Datadog: list monitors, query firing history, and pull metric values during fires - PagerDuty: cross-reference which alerts paged a human and which auto-resolved - GitHub: locate monitor config files and prepare a PR with proposed threshold changes - Linear: create tickets for alerts that need owner action beyond a threshold change - Slack: post a summary of findings to the team channel - Google Docs: write the audit report Output: 1. Top 10 noisiest alerts ranked by fire rate and self-resolution rate 2. Alerts that never fire in the lookback window, marked for review 3. Per-alert proposed threshold with the data behind it (percentile, baseline, suggested value) 4. A GitHub PR with the monitor config changes 5. Linear tickets for any alerts that need redesign, not just tuning 6. A Slack summary for the team channel 7. The full audit report in Google Docs Rules: - Never auto-merge the PR; human review is required - Do not propose deleting an alert without a documented reason - Show the data behind every threshold proposal; no opaque recommendations - If an alert is tied to an SLO, flag it; SLO alerts get reviewed differently - Distinguish flapping alerts from genuinely noisy ones; they need different fixes

How It Works

This recipe runs a noise audit across your alerting stack. It pulls

firing history from Datadog and PagerDuty, identifies monitors that

fire often or self-resolve quickly, and proposes threshold changes

based on observed signal rather than guesswork.

What You Get

  • Top 10 noisiest alerts ranked by fire rate and self-resolution rate
  • Proposed threshold per alert with supporting data
  • Alerts that have not fired in the lookback window
  • Alerts that consistently require human action
  • GitHub PR draft updating monitor config
  • Slack post summarizing findings for the team channel

Setup Steps

  1. Ask OpenClaw to run the "Alert Tuning Audit" recipe using the prompt below
  2. Connect Datadog, PagerDuty, GitHub, Linear, Slack, and Google so the agent can audit and propose changes
  3. Specify the time window (90 days is the default; use longer for slow-firing alerts)
  4. Review the proposed thresholds. The data is real, but the decision is yours.
  5. Open the GitHub PR and have a teammate review before merging

Tips

  • Self-resolution under 5 minutes usually means the threshold is too sensitive.
  • Alerts that never fire are not free. They create false confidence and config debt.
  • Tune one alert at a time, then watch it for two weeks.
  • If the runbook references the threshold, update the runbook in the same PR.
Tags:#devops#sre#alerting#monitoring#on_call#reliability

Related Recipes

On-Call Shift Handoff Brief

Hand off the pager without making the next engineer reconstruct the shift

On-call handoffs usually happen fast, right when context is easiest to lose. The next engineer starts their shift digging through incidents, deploys, noisy alerts, and half-finished Slack threads just to understand the current state. This recipe pulls the shift's PagerDuty incidents, deploy activity, Datadog alerts, and open Slack threads into a clean handoff brief the next on-call can use immediately.

Work5 min

Runbook Freshness Checker

Find stale runbooks before they fail during an incident

Runbooks go stale quietly. Deploy paths change, services move, dependencies get renamed, and the doc only gets noticed at 3 a.m. when it points to infrastructure that no longer exists. This recipe finds runbooks that have not kept up with the services they describe and proposes updates based on recent service changes.

Work10 min

VFR Gatekeeper

Stop audio drift by quarantining variable-frame-rate clips at ingest

Audio slowly drifts out of sync or randomly desyncs in your timeline when footage is variable frame rate — common with iPhone footage, screen recordings, and some OBS workflows. This recipe catches VFR clips at ingest, transcodes them to constant frame rate, and quarantines the originals so drift never reaches your edit.

Creative15 min setup

CLAWBITE AI

Local-first AI assistant that automates small daily tasks safely on your device

A personal, local-first AI assistant that automates small daily tasks—organizing files, setting reminders, and monitoring system events—without touching sensitive data or taking risky actions without your approval.

Personal5 min