Glossary

JSON Schema

JSON Schema is a way to define the shape of JSON data: which fields exist, what types they are, and what counts as valid. In scraping, it gives you a contract for the output so you get structured data you can actually rely on instead of vaguely shaped JSON that breaks downstream.

Examples

A simple schema for extracting product data from a page:

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "price": { "type": "number" },
    "in_stock": { "type": "boolean" }
  },
  "required": ["title", "price"]
}

What valid output might look like:

{
  "title": "Mechanical Keyboard",
  "price": 129.99,
  "in_stock": true
}

And what breaks the contract:

{
  "title": "Mechanical Keyboard",
  "price": "129.99",
  "in_stock": "yes"
}

That kind of mismatch matters more than people think. The scrape might look fine in a demo, then your pipeline falls over because one site returned strings, another returned numbers, and now you're cleaning up junk instead of shipping anything.

Practical tips

  • Keep schemas tight enough to be useful, but not so strict that every minor page variation fails.
  • Use required only for fields you actually need downstream.
  • Be explicit about types: string, number, boolean, array, object.
  • If a field is optional in the real world, model it that way. Production data is messy, pretending otherwise just creates retries and post-processing.
  • For list pages, define array item structure clearly so every row comes back in the same shape.
  • If you're using schema-based extraction with a scraping API, validate the response before it hits your database.

Example pattern for a results list:

{
  "type": "object",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "url": { "type": "string" },
          "rating": { "type": "number" }
        },
        "required": ["name", "url"]
      }
    }
  },
  "required": ["products"]
}

Use cases

  • LLM-based extraction: define the fields you want, then force the model output into a predictable structure.
  • Scraping APIs: send a schema with the request so the API returns normalized JSON instead of raw HTML or loosely formatted text.
  • Data pipelines: validate scraped output before storing it, so bad records get caught early.
  • Multi-site scraping: keep one output format across different page layouts, which saves a lot of cleanup later.
  • Team handoffs: schemas make expectations explicit, so your scraper, backend, and analytics code are all working against the same contract.

This is one of those things that feels optional when you're scraping one page for yourself. It stops feeling optional when you have hundreds of pages, multiple sources, and something downstream that expects the data to be sane.

Related terms

Structured Data Data Extraction HTML Parsing CSS Selectors XPath Validation JSON LLM Extraction