Practical Tools
job-scrapingaggregationautomationapify

Aggregate Job Listings from Multiple Boards: A Guide

Learn how to aggregate job listings from multiple boards automatically using APIs and scrapers — including company career sites, not just Indeed.

To aggregate job listings from multiple boards automatically, you need a scraper or API that hits each source (Indeed, LinkedIn, Glassdoor, company career pages) on a schedule, normalizes the data into one schema, and deduplicates overlapping postings. The fastest path is to use a managed actor that already handles anti-bot logic and ghost-job filtering, then pipe results into a database or spreadsheet via webhook or scheduled export.

Quick Answer

To aggregate job listings across multiple boards, run a unified scraper that pulls from Indeed, LinkedIn, Glassdoor, Google Jobs, and direct company career sites, then merge the results by hashing title + company + location. Open-source tools like JobSpy cover the big four boards, but they miss thousands of postings that live only on company ATS pages (Greenhouse, Lever, Workday). A managed solution like Global Jobs Scraper 2 handles the global scraping, ghost-listing filtering, and dedup in one call at ~$0.01 per result. Schedule it hourly or daily depending on freshness needs, and store output in Postgres, Airtable, or Google Sheets.

Why aggregate job listings from multiple boards in the first place?

Single-board scraping misses 40–60% of the market. Indeed indexes a lot but throttles aggressively. LinkedIn hides salary ranges and locks listings behind login walls. Glassdoor duplicates Indeed heavily. Meanwhile, around 70% of mid-to-senior engineering roles at companies like Stripe, Datadog, and Anthropic appear on their own Greenhouse or Ashby pages days before they hit aggregators — if they hit aggregators at all.

Real use cases for aggregation:

  • Recruiting intelligence: Track when competitors post new roles to infer hiring strategy.
  • Job alert products: Build a niche board (e.g., "remote Rust jobs in EU") by filtering a firehose.
  • Salary benchmarking: Pull thousands of postings with comp data to build market reports.
  • Personal job search: Stop refreshing 8 tabs every morning.

The economics matter. At $0.01 per result, pulling 10,000 fresh listings daily costs ~$100/day, which is cheaper than one engineer-hour of building and maintaining custom scrapers per board.

What tools can scrape multiple job boards at once?

There are four broad approaches:

1. Open-source libraries (JobSpy, JobFunnel) JobSpy on GitHub wraps LinkedIn, Indeed, Glassdoor, ZipRecruiter, and Google Jobs into one Python call. Free, but you own the proxy bill, the CAPTCHA breaks, and the maintenance when boards change their DOM (which happens roughly every 4–8 weeks on Indeed).

2. Official APIs Indeed deprecated its public Publisher API in 2023. LinkedIn's Talent API requires partner status (~6-figure annual minimum). Adzuna and TheMuse offer free tiers but cap at a few hundred calls/day. Limited coverage, restrictive terms.

3. Managed scraping actors (Apify ecosystem) Run on someone else's infrastructure. Pay per result. No proxy management. Global Jobs Scraper 2 goes beyond the standard board list and pulls directly from company career sites — which is where roughly half of high-quality listings actually originate. At $9.99 per 1,000 results (dropping to $6.99 at volume), it's competitive with running your own infrastructure once you factor in proxy costs ($75–$300/month for residential IPs).

4. Enterprise data vendors (Greenhouse Job Board API, LinkUp, Lightcast) $2,000–$50,000/month. Worth it only if you're building a commercial product with SLAs.

For most developers and small teams, option 3 hits the sweet spot.

How do you deduplicate jobs across boards?

The same role often appears 3–7 times: once on the company site, once on LinkedIn, once on Indeed (scraped from LinkedIn), once on Glassdoor (scraped from Indeed), plus aggregator copies. Without dedup, your dataset is garbage.

A reliable dedup key:

import hashlib

def job_fingerprint(job):
    normalized = "|".join([
        job["company"].strip().lower(),
        job["title"].strip().lower(),
        job["location"].split(",")[0].strip().lower(),  # city only
    ])
    return hashlib.sha256(normalized.encode()).hexdigest()

For fuzzier matches (e.g., "Senior Software Engineer" vs "Sr. Software Engineer"), apply token sort ratio with rapidfuzz at a threshold of 90+. Prefer the listing with the most fields populated (salary, benefits, full JD) when collapsing duplicates. Generally, the company's own career page version wins because it's authoritative.

How do you filter out ghost jobs and scams?

Studies from Clarify Capital in 2024 estimated that ~40% of job postings are "ghost jobs" — kept active without active hiring intent. Heuristics that work:

  • Age threshold: Listings older than 30 days have a 3x higher ghost rate. Discard or flag.
  • Repost frequency: If the same job_id reappears every 14 days for 3+ cycles, it's likely a pipeline-builder.
  • Vague descriptions: Postings under 400 characters with no specific tech stack or team mention.
  • Suspicious salary ranges: $40k–$400k bands signal a copy-paste template.
  • Domain mismatch: Apply link goes to a random .info domain instead of the company site.

Global Jobs Scraper 2 applies these filters server-side, so you don't have to write the logic yourself. If you're rolling your own, build a scoring function that flags rather than deletes — borderline listings sometimes turn out real.

How often should you run the scraper?

Depends on use case:

Use caseFrequencyDaily volume
Personal job alertsEvery 6 hours200–500
Niche job boardHourly2,000–10,000
Recruiting intelDaily5,000–20,000
Market researchWeekly50,000+

Hourly is usually overkill — most companies post during business hours in their timezone, and the boards themselves don't refresh more often than every few hours. Run a daily full crawl plus hourly incremental checks on high-priority companies if you need recency.

How do you store and query aggregated job data?

For under 1 million rows, SQLite or Postgres is fine. Schema suggestion:

CREATE TABLE jobs (
  fingerprint TEXT PRIMARY KEY,
  title TEXT,
  company TEXT,
  location TEXT,
  remote BOOLEAN,
  salary_min INT,
  salary_max INT,
  currency TEXT,
  description TEXT,
  source TEXT,        -- 'indeed', 'company_site', etc.
  source_url TEXT,
  posted_at TIMESTAMP,
  scraped_at TIMESTAMP,
  is_ghost BOOLEAN DEFAULT FALSE
);

CREATE INDEX idx_company ON jobs(company);
CREATE INDEX idx_posted ON jobs(posted_at DESC);
CREATE INDEX idx_location ON jobs(location);

Add a full-text index on title + description if users will search by keyword. For larger datasets (10M+), move to Postgres with tsvector columns or push into Elasticsearch / Meilisearch.

What's the cheapest way to aggregate jobs internationally?

International coverage is where most tools fall short. Indeed has 60+ country versions, but each has different anti-bot rules. LinkedIn surfaces non-US jobs inconsistently. Local boards (StepStone in Germany, Seek in Australia, Naukri in India) require dedicated scrapers.

Costs to compare for 10,000 listings/day:

  • DIY with rotating residential proxies: ~$200/month proxies + 15–30 hrs/month maintenance = $1,000+ true cost
  • JobSpy + budget proxies: ~$50/month proxies, but limited to 4 boards, no global coverage
  • Global Jobs Scraper 2 at base tier: 10,000 × $0.00999 × 30 days = ~$3,000/month, but zero maintenance and includes company career sites globally
  • Volume tier ($6.99/1k): drops to ~$2,100/month at sustained high usage

If you're scraping under 1,000 results/day, the managed actor is unambiguously cheaper. Above 50,000/day, DIY starts to win on raw cost — but only if you have an engineer who genuinely enjoys fixing selectors.

How do you stay compliant when scraping job boards?

Three rules of thumb:

  1. Respect robots.txt where it's enforced — but most public job boards don't disallow listing pages.
  2. Don't republish copyrighted descriptions verbatim on a competing site. Summarize, link out, or only show metadata (title, company, salary).
  3. GDPR: Job postings are corporate data, not personal data, so generally safe. But avoid scraping recruiter names or contact info without basis.

For internal use (recruiting intel, personal search, research), legal exposure is minimal. For public products, consult a lawyer once you cross a few thousand users.

FAQ

Q: Can I aggregate jobs from LinkedIn without getting blocked? LinkedIn aggressively rate-limits and bans scrapers. Using a managed actor with rotating residential proxies and human-like request patterns is the only reliable approach. Even then, expect occasional gaps — supplement LinkedIn data with company career site scraping for redundancy.

Q: How is Global Jobs Scraper 2 different from JobSpy? JobSpy targets four major aggregators (LinkedIn, Indeed, Glassdoor, Google Jobs). Global Jobs Scraper 2 adds direct scraping of company career sites and international boards, which is where ~50% of senior-level openings actually live first. It also filters ghost jobs server-side, which JobSpy leaves to the user.

Q: What's a realistic cost to aggregate 1,000 jobs per day? At the base tier of $9.99 per 1,000 results, that's about $300/month. Running it yourself with proxies plus engineering time typically costs $400–$1,000/month once you account for breakage and maintenance, so the managed option wins for low-to-mid volumes.

Q: How do I export aggregated jobs to Google Sheets or Airtable? Apify actors expose a JSON dataset and webhook on each run completion. Point the webhook at a Zapier/Make.com scenario, or use Apify's native Google Sheets integration to append rows automatically. For Airtable, use their REST API with a simple upsert keyed on the job fingerprint.

Q: Can I filter aggregated jobs by tech stack or salary? Yes. Global Jobs Scraper 2 supports filters for location, salary range, benefits, and stack at scrape time, so you don't waste credits on irrelevant results. For finer-grained filtering (e.g., "Rust + remote + EU + $120k+"), post-process the JSON output with a simple Python or SQL query.