Website to PDF Screenshot API: How to Do It Right

To screenshot a website as PDF programmatically, send the target URL to a headless-browser API (like Puppeteer, Playwright, or a hosted actor), wait for the page to fully render, then call the browser's page.pdf() method with options like format: 'A4' and printBackground: true. The cleanest path for most teams is a hosted actor that handles lazy-loaded content, viewport sizing, and storage in one call — no infrastructure to maintain.

Quick Answer

A website to PDF screenshot API works by rendering the target page in a headless Chromium instance, scrolling to trigger lazy-loaded images, and exporting the rendered DOM as a PDF file. You POST a URL (plus options like format, wait time, and viewport), and the API returns either a binary PDF or a download URL. Self-hosting via Puppeteer costs ~$0 in software but eats engineering time on Chrome crashes, memory leaks, and proxy rotation. A managed actor like Full Page Screenshot runs at $0.05 per capture with no servers to babysit. Pick the route that matches your volume and tolerance for ops work.

What's the difference between PNG screenshots and PDF screenshots?

PNG captures the visual pixels of a rendered page — great for thumbnails, previews, and visual diffing. PDF preserves text as selectable, searchable content, supports multi-page pagination, embeds fonts, and is the standard for archiving, legal evidence, invoices, and printed reports.

Practical differences:

Feature	PNG	PDF
Text searchable	No	Yes
File size (long page)	2–8 MB	200 KB – 2 MB
Multi-page split	No	Yes (A4, Letter, etc.)
Print-ready	No	Yes
Vector elements	Rasterized	Preserved

If you're archiving 10,000 product pages a month, PDFs will save you 70%+ on storage and remain text-searchable for compliance audits.

How do I screenshot a website as PDF using Puppeteer?

Here's the minimal self-hosted recipe with Puppeteer:

const puppeteer = require('puppeteer');

async function urlToPdf(url, outputPath) {
  const browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle0', timeout: 60000 });

  // Force lazy-loaded images to render
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 100;
      const timer = setInterval(() => {
        window.scrollBy(0, distance);
        totalHeight += distance;
        if (totalHeight >= document.body.scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });

  await page.pdf({
    path: outputPath,
    format: 'A4',
    printBackground: true,
    margin: { top: '20px', bottom: '20px', left: '20px', right: '20px' }
  });

  await browser.close();
}

urlToPdf('https://example.com', 'output.pdf');

This works for ~80% of pages. The other 20% will trip on:

Sites that detect headless Chrome and block (LinkedIn, Cloudflare-protected pages)
Fonts that don't load before page.pdf() fires
Single-page apps where networkidle0 never resolves
Memory leaks when running 500+ jobs in a long-lived process

Plan for ~3 days of engineering to harden Puppeteer for production, plus $20–40/month for a VPS with 4 GB RAM minimum.

How do I use a hosted website to PDF screenshot API?

A hosted actor removes the ops layer. Using Full Page Screenshot on Apify:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('apify-screenshot').call({
  url: 'https://news.ycombinator.com',
  output: 'pdf',          // pdf, png, or base64
  fullPage: true,
  viewport: { width: 1280, height: 800 },
  delay: 2000             // wait 2s after load for animations
});

// Fetch the download URL from the dataset
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].pdfUrl);

The actor handles:

Lazy-loading via auto-scroll
Mobile/desktop viewport switching
Temporary signed download URLs (no need to host the file)
Retry on transient failures
Proxy rotation (if needed for geo-locked sites)

At $0.05 per screenshot + platform usage, capturing 1,000 product pages costs about $50 — less than one hour of engineering time billed at industry rates.

When should I self-host vs. use a hosted API?

A rough decision matrix based on monthly volume:

< 5,000 captures/month: Hosted API wins on every dimension. You'll spend more on a DigitalOcean droplet than on the API.
5,000–100,000/month: Hosted is still cheaper than engineering time. At $0.05/capture, 50,000/month = $2,500 — less than half a senior dev's monthly cost.
> 100,000/month: Run the math. A dedicated Puppeteer cluster on Kubernetes can hit $0.005–0.01 per capture at scale, but you need a full-time engineer to maintain it.
Anything regulated (legal, financial, healthcare): Hosted with a clear audit trail and SLA is usually the right call.

The hidden cost of self-hosting is the long tail: a Chromium update breaks your fonts, a customer's marketing page uses a new lazy-load library, an anti-bot vendor adds a fingerprint check. Each one is a half-day of debugging.

How do I capture a multi-page website as a single PDF?

There are two approaches:

1. Single long-page PDF. Use fullPage: true so the actor scrolls and captures everything as one continuous PDF. Works for blog posts, product pages, and documentation. Pagination is handled automatically by the PDF format based on A4/Letter sizing.

2. Multi-URL stitching. If you need 10 different URLs combined (e.g., a full product catalog), capture each as a PDF, then merge with pdf-lib or pdftk:

import { PDFDocument } from 'pdf-lib';
import fs from 'fs';

async function mergePdfs(pdfPaths, outputPath) {
  const merged = await PDFDocument.create();
  for (const path of pdfPaths) {
    const bytes = fs.readFileSync(path);
    const pdf = await PDFDocument.load(bytes);
    const pages = await merged.copyPages(pdf, pdf.getPageIndices());
    pages.forEach((p) => merged.addPage(p));
  }
  fs.writeFileSync(outputPath, await merged.save());
}

This pattern is common for compliance archives where you need a date-stamped snapshot of an entire site.

How do I handle authenticated or paywalled pages?

Three options, in order of reliability:

Cookie injection. Most hosted screenshot APIs accept a cookies array. Log in once manually, export the session cookie, pass it in. Works for sites where sessions last days or weeks.
Bearer tokens in headers. Pass customHeaders: { Authorization: 'Bearer xyz' } for API-backed dashboards.
Pre-screenshot login script. Some actors accept a preNavigationHook where you can run a login flow before capture. Slower and more brittle.

Never embed plaintext passwords in API calls. Use cookie or token injection from a separately maintained session manager.

What about archiving for compliance?

For legal-grade archiving (FINRA, GDPR data requests, IP enforcement), you need three things alongside the PDF:

Cryptographic hash of the PDF at capture time (SHA-256 is fine)
Timestamp from a trusted source (RFC 3161 timestamping authority)
Source metadata: full URL, HTTP response code, user agent, IP geolocation

Wrap the screenshot API in a service that records these three to an append-only log (S3 with Object Lock, or a database with row-level immutability). Total cost: ~$0.06 per archived page including the actor call.

FAQ

Q: How much does a website to PDF screenshot API cost? Hosted options range from $0.02 to $0.10 per capture depending on features. Full Page Screenshot on Apify is $50 per 1,000 captures ($0.05 each) plus platform usage, which works out cheaper than ScreenshotAPI.net's $0.07 tier at comparable volume.

Q: Can I capture JavaScript-heavy sites like React apps as PDF? Yes — any modern headless-Chromium-based API renders React, Vue, and Angular apps fully. Set a delay of 2–5 seconds after page load to let client-side rendering finish, and use waitUntil: 'networkidle0' so the capture waits for XHR calls to settle.

Q: How do I trigger lazy-loaded images before PDF capture? Auto-scroll the page from top to bottom in 100px increments with a 100ms delay between scrolls — this triggers IntersectionObserver-based lazy loaders. The Full Page Screenshot actor does this automatically when fullPage: true is set, so you don't need custom scroll logic.

Q: What's the max page length for a PDF screenshot? Most APIs cap at ~16,384 pixels of page height (roughly 30–40 A4 pages). For longer pages, split into sections by CSS selector or use a multi-shot strategy: capture in 8,000px chunks and merge with pdf-lib. Memory consumption grows linearly with page height, so a 50,000px page can crash a 4 GB Chrome instance.

Q: Can I add headers, footers, or watermarks to the PDF? Puppeteer's page.pdf() supports displayHeaderFooter: true with HTML templates for header and footer. For watermarks, inject a fixed-position div with page.evaluate() before capture, or post-process with pdf-lib to stamp text or images onto each page after the screenshot is taken.