Practical Tools
seobroken-linksautomationmonitoring

Broken Link Monitoring Automation: Full Setup Guide

Set up broken link monitoring automation that scans your site on a schedule, flags 404s and slow URLs, and exports structured reports for $0.0008 per link.

To set up automatic broken link monitoring for a website, deploy a scheduled crawler that scans your pages at a fixed interval, flags target HTTP status codes (404, 410, 500, etc.), and ships a structured report to your team. The cleanest path is to use a hosted actor like Dead-Link Watchdog on Apify, point it at your sitemap, configure which status codes count as failures, and attach the Apify Scheduler to run it weekly or daily. No servers, no cron jobs, no Puppeteer maintenance.

Quick Answer

Broken link monitoring automation works by combining three pieces: a recurring schedule, a configurable crawler that traverses your site, and a structured output destination. Set the crawler to follow internal links from your homepage or sitemap, define which status codes (400, 403, 404, 410, 500, 502, 503, 504) trigger a flag, and route results to JSON, CSV, or XLSX. Run it weekly for marketing sites or daily for high-traffic ecommerce and documentation portals. At roughly $0.0008 per link checked, a 5,000-link site costs $4 per scan — cheap insurance against silent SEO decay.

Most "broken link checker" tutorials walk you through a single scan with a Chrome extension or a free online tool. That catches today's broken links, but link rot is continuous. Studies of editorial content show roughly 25% of external links break within 7 years, and internal links break every time someone deletes a page, edits a slug, or migrates a CMS.

Real failure modes you only catch with recurring scans:

  • A vendor renames their docs URL → 30 internal links 404 overnight
  • A redirect chain grows from 1 hop to 4 hops after a domain migration → page speed tanks
  • A third-party CDN starts returning 503s during peak hours → users see broken images
  • Google indexes a 410 page you forgot to remove from your sitemap

A scheduled monitor catches each of these within hours, not the next time you remember to run a manual audit.

The fastest production setup using Dead-Link Watchdog:

  1. Create an Apify account (free tier covers small sites).
  2. Open the Dead-Link Watchdog actor page and click "Try for free."
  3. Configure the input JSON:
{
  "startUrls": ["https://yoursite.com/sitemap.xml"],
  "maxCrawlDepth": 5,
  "maxRequestsPerCrawl": 10000,
  "flagStatusCodes": [400, 403, 404, 410, 500, 502, 503, 504],
  "respectRobotsTxt": true
}
  1. Run once manually to confirm coverage and baseline link count.
  2. Open the "Schedules" tab in Apify and create a new schedule. Cron expression 0 6 * * 1 runs the scan every Monday at 6 AM UTC.
  3. Attach a webhook or integration — Slack, email, Google Sheets, or a direct download URL for the dataset.

Total setup time: under 15 minutes. After that, it runs forever without intervention.

What HTTP status codes should I flag as broken?

Not all non-200 responses are bugs. Here's how to think about each code:

CodeMeaningFlag it?
301Permanent redirectOptional — flag if you want clean canonical links
302Temporary redirectOptional — often legitimate
400Bad requestYes
401Auth requiredYes, if public page expected
403ForbiddenYes
404Not foundAlways
410GoneAlways
429Rate limitedYes — your crawler may be too aggressive, or the target is blocking you
500/502/503/504Server errorsYes — but rescan to confirm not transient

For most marketing sites, start with [400, 403, 404, 410, 500, 502, 503, 504]. Add 301 if you're cleaning up after a migration and want a punch list of internal links to update.

Match the cadence to your publish velocity and traffic risk:

  • Static marketing site, monthly updates → weekly scan
  • Blog publishing 5+ posts/week → twice weekly
  • Documentation portal with external API references → daily
  • Ecommerce with thousands of SKUs → daily, with shorter scans on category pages and a full crawl weekly
  • News or aggregator site → daily, sometimes hourly on the homepage and top sections

Cost math at $0.0008 per link:

  • 1,000-link blog scanned weekly = $0.80/week = ~$42/year
  • 10,000-link docs site scanned daily = $8/day = ~$2,920/year
  • 50,000-link ecommerce scanned weekly + 5,000 hot pages daily = $40 + $4 × 7 = $68/week

Compare that to losing organic rankings on a single product page that 404s for three weeks — the ROI is obvious.

How do I monitor only specific sections of a large site?

Three techniques for scoping:

1. Use a filtered sitemap. Point startUrls at sitemap-blog.xml instead of the root sitemap. Most CMSes generate per-section sitemaps automatically.

2. Cap depth. Set maxCrawlDepth: 2 if you only want to verify links from the homepage and one click deep. Useful for monitoring high-value landing pages.

3. Cap total requests. maxRequestsPerCrawl: 2000 puts a hard ceiling on cost. The crawler exits cleanly at that point and reports what it found.

Combine all three for surgical monitoring — for example, a daily scan of just /pricing, /docs, and /blog for under $5/month even on a large site.

Dead-Link Watchdog exports structured results as JSON, CSV, or XLSX. To turn that into an alert:

Option A: Apify webhook. In the actor's settings, add a webhook that fires on ACTOR.RUN.SUCCEEDED. Point it at a Zapier or Make.com endpoint that:

  • Reads the dataset
  • Filters rows where flagged == true
  • Posts a Slack message with the count and a link to the full CSV

Option B: Email integration. Use Apify's built-in email notification on run completion, then download the dataset manually from the run page.

Option C: Google Sheets sync. Send the JSON output to a Sheets row via webhook, then trigger a conditional email when row count > 0.

The structured output makes filtering easy — every row includes sourceUrl, targetUrl, statusCode, responseTimeMs, and redirectChain, so you can sort by severity or by which page contains the most broken outbound links.

The crawler checks every link it encounters, including external ones referenced from your pages. This matters because:

  • External links that 404 hurt user experience even though they're not "your" content
  • Affiliate links that break cost you revenue silently
  • Citations in documentation that go dead reduce trust

Set checkExternalLinks: true in the input to verify external destinations. The crawler won't traverse into external sites (it won't recursively crawl Wikipedia from one citation), it just performs a HEAD/GET on each external URL and records the status.

For a typical blog, external links roughly double the link count — budget accordingly. A 1,000-internal-link site with average 1 external link per page becomes ~2,000 checks = $1.60 per scan.

How does this compare to Screaming Frog or Ahrefs?

ToolRecurring scansCostSetupOutput
Screaming Frog desktopManual only£199/year + your timeLocal install, manual runCSV
Ahrefs Site AuditWeekly, fixed$129+/monthSaaSDashboard
SitebulbManual or scheduled (server required)$176/year + hostingSelf-hostedDashboard
Dead-Link WatchdogAny cron schedule$0.0008/link, pay per use15 min, no installJSON/CSV/XLSX

If you already pay for Ahrefs and use the dashboard daily, stick with it. If you want headless, scriptable monitoring that plugs into your own tooling and only bills when it runs, the actor model wins on cost and flexibility.

  • Crawling without respectRobotsTxt on your own site is fine, but enable it when scanning external destinations to avoid getting your IP blocked.
  • No maxRequestsPerCrawl cap on a site with infinite pagination (faceted search, calendars) can run up unexpected costs. Always set a ceiling.
  • Treating 429 as broken — it usually means your crawler is too fast, not that the link is dead. Lower concurrency before flagging.
  • Ignoring redirect chains. A link that 200s after 3 redirects still hurts page speed. The output includes the full chain so you can fix the source.
  • Running daily on a site that changes weekly. Match cadence to actual change rate; daily scans of a static brochure site waste budget.

FAQ

Q: How much does it cost to monitor a 10,000-page site weekly? At $0.0008 per link checked, 10,000 links per scan × 52 scans/year = $416/year. If each page averages 30 internal + external links and you check all of them, you're closer to 300,000 checks per scan, which would be $240/scan — at that scale, restrict to internal links or sample a subset.

Q: Can I integrate broken link reports with Jira or Linear? Yes. Use the Apify webhook to POST the run results to a middleware (Make, Zapier, n8n) that creates one ticket per unique broken URL, with the source page and status code in the description. Deduplicate against existing tickets to avoid noise.

Q: Does Dead-Link Watchdog handle JavaScript-rendered links? The actor crawls server-rendered HTML by default for speed and cost. For SPAs where links only appear after JS execution, enable headless browser mode in the input — it costs more per page but catches links that a plain HTTP crawler would miss.

Q: What's the difference between a broken link and a redirect? A broken link returns 4xx or 5xx — the destination doesn't exist or errored. A redirect (301/302) returns a new URL that does work. Redirects aren't broken, but long chains slow page loads and waste crawl budget, so most teams flag chains of 3+ hops.

Q: How do I prevent the crawler from hitting login-protected pages? Add URL patterns to an exclusion list (e.g., /account/*, /admin/*) in the actor input, or rely on respectRobotsTxt if your robots.txt already disallows those paths. For pages that require auth to test, pass session cookies via the input — the actor supports custom headers per run.