You can scrape any website using an AI web scraper with no CSS selectors by describing the data you want in plain English to a tool like ScrapeGraphAI, Thunderbit, or Browse.AI — the LLM reads the rendered page and returns structured JSON. This approach survives most site redesigns because the model identifies content by meaning, not by div.product-title-v2. For production work, pair it with a stealth-capable fetcher like the Scrapling Media & Web Extractor to handle anti-bot defenses before the LLM step.
Quick Answer
An AI web scraper with no CSS selectors works by sending the rendered HTML (or a cleaned text version) to a large language model along with a natural-language prompt like "extract product name, price, and rating." The model returns structured JSON without you ever inspecting the DOM. This is the dominant pattern in 2025 because CSS selectors break every time a site changes class names, while LLM-based extraction tolerates layout shifts. The tradeoff: higher cost per page (~$0.001–$0.01 in LLM tokens) and slower response (2–8 seconds). For high-volume jobs, hybrid pipelines fetch with stealth scrapers and parse with AI only when needed.
Why do CSS selectors keep breaking?
Modern frontends ship CSS-in-JS frameworks (styled-components, Tailwind JIT, CSS Modules) that generate class names like sc-bdVaJa kPzpqo or _1xj8h2k. Those hashes regenerate on every deploy. Amazon changes product page selectors roughly every 2–3 weeks. LinkedIn rotates them weekly. Even a stable site like Wikipedia restructured infoboxes three times in 2024.
If you maintain 50 scrapers with hardcoded selectors, expect 5–10 to break per week. The maintenance cost — engineer time at $100/hour debugging XPath — quickly exceeds the cost of running LLM extraction at $0.005 per page.
How does an AI web scraper work without selectors?
The typical pipeline has three stages:
- Fetch — A headless browser or HTTP client retrieves the page. This step still needs to handle JavaScript rendering, cookies, and anti-bot challenges.
- Clean — The raw HTML gets stripped to relevant text. Libraries like
readability-lxmlortrafilaturaremove navigation, ads, and footers, cutting tokens by 70–90%. - Extract — The cleaned content plus a JSON schema or natural-language prompt goes to an LLM (GPT-4o-mini, Claude Haiku, Gemini Flash). The model returns structured data.
Example prompt for an e-commerce page:
Extract from this page:
- product_name (string)
- price_usd (number)
- in_stock (boolean)
- review_count (integer)
Return only valid JSON.
A 4,000-token page processed by GPT-4o-mini costs about $0.0006. Claude Haiku 3.5 runs around $0.001. At 10,000 pages per month, that's $6–$10 in extraction costs — cheaper than one hour of selector debugging.
What are the best AI scraping tools in 2025?
Here's how the main players compare for no-selector scraping:
| Tool | Best for | Pricing | Stealth |
|---|---|---|---|
| ScrapeGraphAI | Developers, open-source pipelines | Free self-host, $20/mo cloud | Manual proxy setup |
| Browse.AI | Non-technical users, monitoring | $48–$500/mo | Built-in |
| Thunderbit | Chrome-based one-off jobs | Free tier, $15/mo | Browser-native |
| Firecrawl | LLM-ready markdown output | $19–$333/mo | Limited |
| Scrapling + LLM | High-volume + protected sites | Pay-per-use on Apify | Yes |
ScrapeGraphAI wins on flexibility — you write Python, point it at any LLM, and pipe results anywhere. Browse.AI wins on usability — point and click, no code. Thunderbit wins on speed for ad-hoc work.
For sites with Cloudflare, PerimeterX, or DataDome protection, you need a stealth-capable fetcher upstream. That's where Scrapling Media & Web Extractor fits: it handles the stealth fetch and media extraction, then you feed the HTML to your LLM of choice for structured parsing.
How do I build an AI scraper with no CSS selectors?
Here's a minimal working pipeline in Python:
import os
from openai import OpenAI
from apify_client import ApifyClient
apify = ApifyClient(os.environ["APIFY_TOKEN"])
openai = OpenAI()
# Step 1: Fetch with stealth
run = apify.actor("scrapling-actor").call(run_input={
"urls": ["https://example-shop.com/product/123"],
"stealth": True,
"outputFormat": "html"
})
html = next(apify.dataset(run["defaultDatasetId"]).iterate_items())["html"]
# Step 2: Extract with LLM
response = openai.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": f"""Extract product_name, price_usd, in_stock from this HTML.
Return JSON only.
{html[:15000]}"""
}]
)
print(response.choices[0].message.content)
This pattern handles 95% of e-commerce, news, and listing pages. For long pages, chunk the HTML or pre-clean with trafilatura.extract(html) to drop boilerplate before sending to the model.
Handling lists and pagination
For pages with repeated items (search results, product grids), ask the LLM for an array:
Extract every product card as an array of objects with
keys: title, price, url. Return JSON.
GPT-4o-mini handles up to ~80 items per call reliably. Beyond that, paginate by sending the URL to your fetcher again with ?page=2.
When should I still use CSS selectors?
AI extraction isn't always the right tool. Use traditional selectors when:
- You scrape one site at high volume — 1M pages on a single domain. The 100x cost gap matters. Write the selector once, fix it monthly.
- Latency must be under 500ms — LLMs add 1–5 seconds. Selectors return instantly.
- The data is in a structured API — Many sites expose JSON endpoints (
/api/products.json). Hit those directly. The Scrapling actor supports JSON API extraction natively. - You need exact pixel-level data — Coordinates, exact HTML structure for archiving, etc.
A practical hybrid: use selectors for the 80% of pages on stable sites, fall back to AI extraction when selectors fail. Log selector breaks, run AI as backup, then update selectors in batches.
Can AI scrapers bypass anti-bot protection?
No — and this trips up most beginners. An LLM only sees what your fetcher delivers. If Cloudflare blocks the request, the AI gets an error page and extracts nothing useful.
You still need:
- Residential or mobile proxies for sites that fingerprint IPs (Instagram, LinkedIn, Amazon)
- Browser fingerprint spoofing — TLS, canvas, WebGL, font enumeration
- Stealth headless browsers — Patched Chromium, undetected-chromedriver, or Camoufox
The Scrapling actor includes stealth mode that handles TLS fingerprinting and common bot-detection bypasses. Run it as the fetch stage, then route the clean HTML to your LLM. This separation keeps your AI extraction code simple and your anti-bot logic in one place.
What does this cost at scale?
Real numbers for 100,000 pages per month:
- Fetch (Scrapling on Apify): ~$30–$80 depending on stealth usage
- LLM extraction (GPT-4o-mini): ~$60 at 4K tokens per page
- Proxy bandwidth (if needed): $20–$200
- Total: ~$110–$340/month
Compare to Browse.AI's enterprise plan at $500/month for similar volume, or a full-time scraping engineer at $10,000+/month. The DIY hybrid wins on cost above ~50K pages/month.
For smaller jobs (under 5K pages/month), Thunderbit's $15 plan or Browse.AI's $48 plan beats building anything yourself.
FAQ
Q: Is AI web scraping legal? Scraping publicly available data is generally legal under hiQ v. LinkedIn and similar rulings, but you must respect robots.txt where contractually binding, avoid copyrighted content reuse, and comply with GDPR/CCPA for personal data. Always check a site's Terms of Service before commercial scraping.
Q: Which LLM is best for web scraping in 2025? GPT-4o-mini and Claude Haiku 3.5 give the best cost-to-accuracy ratio for structured extraction, around $0.0006–$0.001 per typical page. Gemini Flash 2.0 is cheapest but slightly less reliable on complex schemas. Use GPT-4o or Claude Sonnet only when extracting nuanced unstructured content like sentiment or summaries.
Q: Can AI scrapers handle JavaScript-heavy sites? Yes, but only if your fetcher renders JavaScript. The LLM itself sees only the HTML you send it. Use a headless browser (Playwright, Puppeteer) or an actor like Scrapling that renders pages before extraction. Static HTTP clients will miss anything loaded by React, Vue, or HTMX.
Q: How accurate is AI extraction compared to CSS selectors? On well-defined schemas (product price, article title, author), GPT-4o-mini hits 95–99% accuracy versus selectors at 100% — when the selectors work. Over a 6-month window, selectors typically drop to 60–80% accuracy due to site changes, while AI stays at 95%+ without maintenance.
Q: Do I need to know Python to use no-selector AI scraping? No. Browse.AI and Thunderbit offer fully visual workflows with no code. ScrapeGraphAI requires Python. For Apify-based pipelines like Scrapling, you can trigger runs from the web UI, Zapier, Make, or a single curl command — Python is optional but unlocks custom LLM routing.