Why Incremental Link Checking Beats Traditional Website Crawlers


The Problem with Traditional Crawlers

Most website monitoring tools use a “snapshot” approach: they hit your site hard with simultaneous requests, check everything at once, then wait until the next scheduled scan. This creates several problems:

  • Server strain: Hundreds of concurrent requests can spike CPU and bandwidth
  • Incomplete coverage: Large sites get partially scanned within time limits
  • Missed changes: New pages added between scans go unnoticed
  • False negatives: Temporary issues during the scan window aren’t detected

How Incremental Checking Works

HITS Scout takes a fundamentally different approach inspired by how search engines crawl the web:

1. Discovery-Driven

Instead of requiring you to list every URL upfront, we start with your homepage and automatically discover links as we go. Each check extracts new URLs, building a complete map of your site organically.

2. Oldest-First Scheduling

We maintain a queue of discovered links and always check the oldest unvisited one next. This ensures:

  • Even coverage: Every link gets checked on a predictable rotation
  • No blind spots: Pages buried deep in your site structure get attention
  • Automatic updates: New pages enter the queue immediately when discovered

Rather than bombarding your server, we check a single URL per configured interval (e.g., every 5 minutes). This gentle approach:

  • Respects your infrastructure: Negligible performance impact
  • Maintains freshness: Frequent small checks vs. infrequent big scans
  • Catches intermittent issues: More opportunities to detect problems

Real-World Example

Consider a blog with 200 posts:

Traditional crawler (every 24 hours):

  • Tries to check all 200 pages in one burst
  • Your server sees 200 requests in < 5 minutes
  • Posts added mid-cycle wait up to 24 hours for first check
  • If a page is temporarily down during that 5-minute window, you miss it

Incremental checker (5-minute intervals):

  • Checks 1 page every 5 minutes = 12 pages/hour = 288 pages/day
  • Your server sees 1 request every 5 minutes (barely noticeable)
  • New posts discovered and added to queue within one rotation cycle
  • Checks distributed across the entire day provide better outage detection

When Traditional Crawling Makes Sense

Incremental checking isn’t always the right fit:

  • Small static sites (< 20 pages): Full scans are fine and complete quickly
  • Intranet tools: Where you control the infrastructure and can afford bursts
  • One-time audits: When you need a snapshot for reporting purposes

For growing sites, shared hosting environments, or continuous monitoring, incremental checking provides superior coverage with minimal footprint.

Implementation Details

At HITS Scout, our incremental crawler:

  • Stores every discovered link with discovered_at and last_checked_at timestamps
  • Queries for the oldest link per monitor using indexed database queries
  • Extracts new links from HTML responses and adds them to the queue
  • Verifies the primary URL when any child link fails (smart alerting)
  • Tracks coverage percentage so you know how much of your site has been checked

Get Started

Try incremental link checking for free with HITS Scout:

  • 3 monitors on the free plan
  • 5-minute check intervals
  • Automatic link discovery
  • No server impact

Start monitoring your site →


FAQ

Q: How long until my entire site is checked?
A: Depends on your site size and check interval. With 100 pages and 5-minute intervals, full coverage takes ~8 hours. Our dashboard shows current coverage percentage.

Q: What about pages requiring authentication?
A: Currently, we only check publicly accessible pages. Private page monitoring is on our roadmap.

Q: Do you execute JavaScript?
A: Not yet. We parse static HTML, which covers most use cases. JavaScript rendering support is planned for Pro/Enterprise tiers.

Q: Can I exclude certain paths?
A: Yes! Use the exclude patterns feature to skip admin areas, private sections, or high-churn pages.