Building a near real-time topic monitoring system on a VPS
- Published on
- ·8 min read
Too much noise, not enough signal
I was spending way too much time trying to keep up with multiple topics at once. Google Alerts sent half-relevant stuff three times a day. Raw RSS feeds? Drowning in dozens of articles saying the same thing with different words. Twitter/X has become useless for any kind of structured monitoring.
And the SaaS tools - Mention, Talkwalker, Feedly Pro - run $30-300/month for features you can honestly reproduce with RSS, a script, and an LLM.
So I wrote down what I actually needed:
- Multi-topic: track 3-5 subjects in parallel, each with its own sources and keywords
- Near real-time: checks every 10 minutes
- Smart filtering: not just keyword matching, actual relevance scoring that understands context
- Actionable summaries: 2-3 sentences that give you the gist without clicking
- Zero duplicates: same event covered by 10 outlets = 1 message
- Self-hosted: no SaaS dependency, my data stays on my server
The architecture
A $10/month VPS, Docker, Node.js, an LLM CLI, and a cron job. Nothing else.
┌──────────────────────────────────────────────────────┐
│ VPS │
│ │
│ Cron */10 min Cron */30 min │
│ │ │ │
│ ▼ ▼ │
│ monitor.js page-monitor.js │
│ │ │ │
│ ├── Google News RSS ├── Page A (hash) │
│ ├── GitHub Releases ├── Page B (hash) │
│ ├── Blogs (Atom/RSS) └── Page C (hash) │
│ ├── Al Jazeera, etc. │
│ │ │
│ ├─ 1. Freshness filter (< 6h) │
│ ├─ 2. URL + normalized title dedup (SQLite) │
│ ├─ 3. Keyword pre-filter (free) │
│ ├─ 4. LLM scoring + summary + semantic dedup │
│ │ │
│ ▼ │
│ Slack / Discord / Telegram │
└──────────────────────────────────────────────────────┘
Two scripts:
monitor.js: aggregates RSS feeds, filters, scores via LLM, pushes to Slackpage-monitor.js: watches specific web pages (changelogs, blogs) by comparing a SHA-256 hash of the content. Alerts only when the page changes.
The Docker stack
I burned a solid hour fighting Docker Hub rate limits (You have reached your unauthenticated pull rate limit). 100 anonymous pulls per 6h, and you hit that fast when you're iterating. Switched everything to ghcr.io and never looked back.
services:
n8n:
image: ghcr.io/n8n-io/n8n:latest
ports:
- '5678:5678'
environment:
- GENERIC_TIMEZONE=Europe/Paris
volumes:
- n8n-data:/home/node/.n8n
changedetection:
image: ghcr.io/dgtlmoon/changedetection.io:latest
ports:
- '5000:5000'
volumes:
- changedetection-data:/datastore
rsshub:
image: ghcr.io/diygod/rsshub:latest
ports:
- '1200:1200'
environment:
- CACHE_TYPE=memory
- CACHE_EXPIRE=600
Docker Hub enforces rate limits on anonymous pulls (100/6h). All three images are also published on GitHub Container Registry (ghcr.io), with no limits. A good habit for any automated deployment.
- n8n: visual automation platform. I deployed it mostly as a "might need it later" thing, for graphical workflows or connecting other services. Not required for the core monitoring.
- RSSHub: turns almost anything into an RSS feed - GitHub repos, subreddits, YouTube channels... Essential when the source has no native feed.
- changedetection.io: web UI for watching pages. Handy for adding watchers on the fly without touching code.
The whole stack sits at about 1.5 GB of RAM. A 4 GB VPS handles it without breaking a sweat.
The filtering pipeline: 4 layers
This is where all the logic lives. The goal: eliminate as much noise as possible before calling the LLM, because every call costs time and tokens.
Layer 1: freshness
const MAX_AGE_HOURS = config.max_age_hours || 6
function isRecent(item) {
if (!item.pubDate && !item.isoDate) return true
const pubDate = new Date(item.isoDate || item.pubDate)
if (isNaN(pubDate.getTime())) return true
const ageMs = Date.now() - pubDate.getTime()
return ageMs >= 0 && ageMs < MAX_AGE_HOURS * 60 * 60 * 1000
}
Google News feeds return 100 articles per query. Most are over 24h old. With a 6h threshold, you go from 100 down to 5-20. It's configurable in config.json - bump it to 12h or 24h if you'd rather get a daily digest.
Layer 2: URL + title deduplication
Here's what drove me crazy early on. Google News generates a unique redirect URL for every single result, even when two links point to the same article. "Iran strike - Reuters" and "Iran strike - BBC" have completely different news.google.com/rss/articles/CBM... URLs.
URL dedup alone is useless. I added title normalization:
function normalizeTitle(title) {
if (!title) return ''
return title
.toLowerCase()
.replace(/\s*[-–—|:]\s*(the\s+)?(reuters|ap|bbc|cnn|...).*$/i, '')
.replace(/[^a-z0-9àâäéèêëïîôùûüÿçæœ]/g, '')
.replace(/^(update|breaking|live|exclusive|watch|video)\s*/i, '')
.slice(0, 60)
}
Strip the source suffix (- Reuters, | BBC), editorial prefixes (BREAKING:, LIVE:), punctuation, compare the first 60 normalized characters. Two articles about the same event with near-identical titles? Only one gets through.
SQLite storage with an index on the title hash:
CREATE TABLE seen_articles (
url TEXT PRIMARY KEY,
title TEXT,
title_hash TEXT,
topic TEXT,
seen_at TEXT DEFAULT (datetime('now'))
);
CREATE INDEX idx_title_hash ON seen_articles(title_hash);
30-day retention, automatic cleanup.
Layer 3: keyword pre-filter
Free, instant. Each topic has its keyword list in the config:
{
"name": "My Topic",
"keywords": ["keyword-1", "keyword-2", "exact phrase"],
"slack_channel": "C0XXXXXXX",
"feeds": ["https://news.google.com/rss/search?q=...", "https://github.com/org/repo/releases.atom"]
}
Keywords matched against title + snippet. A lowercase includes(), deliberately permissive. The fine-grained filtering is the LLM's job right after.
Layer 4: LLM scoring + semantic deduplication
The layer that makes the whole thing work. Articles that survived the first 3 layers get sent to the LLM in batches of 25:
Score each article (0-10) based on relevance to the topic.
DEDUPLICATE: if multiple articles cover the same event,
keep only the best one.
Summary of 2-3 sentences with key facts.
Include ONLY scores >= 6.
One call, three results:
- Scores relevance (an article that mentions a keyword in passing -> score 3, not sent)
- Semantic dedup (5 articles about the same event -> the best one is kept)
- Summarizes in 2-3 actionable sentences
The result is structured JSON parsed on the Node.js side. If the LLM times out or crashes, a fallback returns articles with a default score. The system never breaks.
Why score before summarizing? You could summarize all articles then filter. But scoring first cuts the volume by 80-90%, slashing token costs. Free keywords first, paid LLM only on serious candidates.
Page monitoring
Some sources don't have RSS. A changelog, a documentation page, a blog with no feed. For those I built page-monitor.js, and honestly it's the piece of code I'm most satisfied with:
const content = extractMainContent(html)
const hash = createHash('sha256').update(content).digest('hex')
const existing = db.prepare('SELECT hash FROM page_hashes WHERE url = ?').get(url)
if (existing && existing.hash !== hash) {
// Change detected -> Slack alert
await postToSlack(channel, `🔔 Change detected on ${pageName}`)
}
Extract the <main> content (to ignore headers/footers/ads that change constantly), hash it, compare. Zero false positives in 2 weeks of running.
Runs every 30 minutes via cron. Uses virtually no resources.
The cron
*/10 * * * * cd ~/news-monitor/app && node monitor.js >> monitor.log 2>&1
*/30 * * * * cd ~/news-monitor/app && node page-monitor.js >> monitor.log 2>&1
Cron has a minimal PATH. If your LLM CLI isn't in /usr/bin/, remember to prepend its path: PATH=/usr/local/bin:/usr/bin:/home/user/.local/bin before the command.
One thing that bit me: my VPS runs fish shell, and fish doesn't support heredoc syntax (<<EOF). So writing config files required either bash -c wrappers or scp transfers. Spent 15 minutes wondering why my commands kept failing before I figured that out.
The result in Slack
Each check produces a structured message per topic:
🔴 Monitoring Geopolitics - 03/17/2026 09:50
• Article title
> 2-3 sentence summary with key information.
> The reader gets the gist without clicking through.
_8/10 - 03/17, 08:30_
• Another article on a different subject
> Context and important details summarized here.
> Impact and consequences mentioned.
_7/10 - 03/17, 07:15_
No duplicates, no noise. If nothing new since the last run, nothing is sent. Slack channels stay clean.
The numbers
On a typical run with an active geopolitical topic:
| Step | Articles | Reduction |
|---|---|---|
| Raw RSS feeds | ~400 | - |
| After freshness filter (6h) | ~25 | -94% |
| After URL + title dedup | ~20 | -20% |
| After keyword filter | ~15 | -25% |
| After LLM scoring (>= 6) | ~8 | -47% |
| After LLM semantic dedup | ~5 | -37% |
400 articles down to 5. Signal-to-noise ratio of 1:80. The first run pushed 26 articles to Slack. The second one, 10 minutes later: zero. Dedup was working.
What's next
This setup covers 90% of my monitoring needs. A few things on my list:
- Telegram OSINT sources via MTProto - some channels break news 15-30 min before traditional media
- Local LLM via Ollama to drop the external API dependency. A Llama 3.2 8B runs on 8 GB of RAM and is more than enough for scoring
- Web dashboard with history and stats (n8n is already deployed, might as well use it)
- Push alerts for scores 9-10 instead of waiting for the next poll
The full code fits in two files (~200 lines each), a JSON config, and a Docker Compose. No framework, no exotic dependency. If the VPS goes down, you redeploy in 10 minutes.
Stack: Debian 13 - Docker Compose - Node.js 22 - RSSHub - changedetection.io - SQLite - LLM CLI - Slack API