Why Incremental Link Checking Beats Traditional Website Crawlers
The Problem with Traditional Crawlers
Most website monitoring tools use a “snapshot” approach: they hit your site hard with simultaneous requests, check everything at once, then wait until the next scheduled scan. This creates several problems:
- Server strain: Hundreds of concurrent requests can spike CPU and bandwidth
- Incomplete coverage: Large sites get partially scanned within time limits
- Missed changes: New pages added between scans go unnoticed
- False negatives: Temporary issues during the scan window aren’t detected
How Incremental Checking Works
HITS Scout takes a fundamentally different approach inspired by how search engines crawl the web:
1. Discovery-Driven
Instead of requiring you to list every URL upfront, we start with your homepage and automatically discover links as we go. Each check extracts new URLs, building a complete map of your site organically.
2. Oldest-First Scheduling
We maintain a queue of discovered links and always check the oldest unvisited one next. This ensures:
- Even coverage: Every link gets checked on a predictable rotation
- No blind spots: Pages buried deep in your site structure get attention
- Automatic updates: New pages enter the queue immediately when discovered
3. One Link at a Time
Rather than bombarding your server, we check a single URL per configured interval (e.g., every 5 minutes). This gentle approach:
- Respects your infrastructure: Negligible performance impact
- Maintains freshness: Frequent small checks vs. infrequent big scans
- Catches intermittent issues: More opportunities to detect problems
Real-World Example
Consider a blog with 200 posts:
Traditional crawler (every 24 hours):
- Tries to check all 200 pages in one burst
- Your server sees 200 requests in < 5 minutes
- Posts added mid-cycle wait up to 24 hours for first check
- If a page is temporarily down during that 5-minute window, you miss it
Incremental checker (5-minute intervals):
- Checks 1 page every 5 minutes = 12 pages/hour = 288 pages/day
- Your server sees 1 request every 5 minutes (barely noticeable)
- New posts discovered and added to queue within one rotation cycle
- Checks distributed across the entire day provide better outage detection
When Traditional Crawling Makes Sense
Incremental checking isn’t always the right fit:
- Small static sites (< 20 pages): Full scans are fine and complete quickly
- Intranet tools: Where you control the infrastructure and can afford bursts
- One-time audits: When you need a snapshot for reporting purposes
For growing sites, shared hosting environments, or continuous monitoring, incremental checking provides superior coverage with minimal footprint.
Implementation Details
At HITS Scout, our incremental crawler:
- Stores every discovered link with
discovered_atandlast_checked_attimestamps - Queries for the oldest link per monitor using indexed database queries
- Extracts new links from HTML responses and adds them to the queue
- Verifies the primary URL when any child link fails (smart alerting)
- Tracks coverage percentage so you know how much of your site has been checked
Get Started
Try incremental link checking for free with HITS Scout:
- 3 monitors on the free plan
- 5-minute check intervals
- Automatic link discovery
- No server impact
FAQ
Q: How long until my entire site is checked?
A: Depends on your site size and check interval. With 100 pages and 5-minute intervals, full coverage takes ~8 hours. Our dashboard shows current coverage percentage.
Q: What about pages requiring authentication?
A: Currently, we only check publicly accessible pages. Private page monitoring is on our roadmap.
Q: Do you execute JavaScript?
A: Not yet. We parse static HTML, which covers most use cases. JavaScript rendering support is planned for Pro/Enterprise tiers.
Q: Can I exclude certain paths?
A: Yes! Use the exclude patterns feature to skip admin areas, private sections, or high-churn pages.