Post

Alert System โ€” Real-time Monitoring with Slack and Anomaly Detection

Alert System โ€” Real-time Monitoring with Slack and Anomaly Detection

๐ŸŽฏ Overview

Collecting and storing data is not enough.
The real value of a monitoring system comes from real-time alerts and anomaly detection.

This post implements:

  • Slack-based alert notifications
  • Threshold-based alert rules
  • Basic anomaly detection strategy

๐Ÿ—๏ธ Alert System Architecture

1
2
3
4
5
6
7
8
9
[InfluxDB (Raw Data)]
        โ†“
[Python Alert Engine]
        โ†“
[Slack Webhook]
        โ†“
[User Notification]


๐Ÿง  Alert Strategy

We define two types of alerts:

1. Threshold-based Alerts

  • Simple and predictable
  • Example:
    • CPU Temperature > 80ยฐC
    • Fan Speed < 500 RPM

2. Anomaly-based Alerts

  • Detect abnormal patterns
  • Based on statistical deviation

๐Ÿ“ก Slack Integration

Why Slack?

  • Real-time notification
  • Easy integration (Webhook)
  • Lightweight and flexible

Webhook Setup

  1. Create Slack App
  2. Enable Incoming Webhook
  3. Copy Webhook URL

Example Message Format

1
2
3
4
5
๐Ÿ”ฅ ALERT: High CPU Temperature  
Host: pc-01  
Value: 85ยฐC  


๐Ÿ Python Alert Engine

Workflow

1
2
3
4
5
6
1. Fetch latest data from InfluxDB
2. Evaluate threshold rules
3. Detect anomalies
4. Send alert to Slack


๐Ÿ”ฅ Threshold-based Detection

Example Rules

MetricCondition
CPU Temp> 80ยฐC
Fan Speed< 500 RPM
Power> 120W

Logic

1
2
3
4
if cpu_temp > 80:
    trigger_alert()


๐Ÿ“Š Anomaly Detection (Basic)

Method: Z-Score

Z-score measures how far a value is from the average.


Formula

1
2
3
Z = (X - ฮผ) / ฯƒ


Interpretation

Z-ScoreMeaning
0 ~ 2Normal
> 3Anomaly

Example Logic

1
2
3
4
if z_score > 3:
    trigger_alert()


๐Ÿง  Combining Alerts

We combine both methods:

  • Threshold โ†’ Immediate alerts
  • Anomaly โ†’ Smart alerts

โฑ๏ธ Alert Frequency Control

Problem

  • Too many alerts โ†’ noise
  • Too few alerts โ†’ missed issues

Solution

  • Cooldown period (e.g. 5 minutes)
  • Deduplication logic

Example

1
2
3
4
if last_alert < 5 minutes ago:
    skip_alert()


๐Ÿงช Example Alert Flow

1
2
3
4
5
6
1. Read latest temperature
2. Check threshold
3. Calculate Z-score
4. If triggered โ†’ send Slack message


โš ๏ธ Challenges

1. False Positives

  • Temporary spikes
  • Need smoothing

2. Alert Flooding

  • Repeated alerts
  • Requires throttling

3. Data Delay

  • Late ingestion
  • Impacts real-time alerts

๐ŸŽฏ Best Practices

  • Use threshold + anomaly together
  • Add cooldown logic
  • Keep messages simple and clear

๐Ÿ“Š Example Alert Message

1
2
3
4
5
6
๐Ÿ”ฅ ALERT: Temperature Spike  
Host: pc-01  
Temp: 87ยฐC  
Status: Critical  


๐Ÿš€ Next Step

In the next post, we will build the Analytics & Dashboard layer, including:

  • KPI design
  • Metabase dashboard
  • Insight visualization

This post is licensed under CC BY 4.0 by the author.