CDATA | ScrapeRouter

CDATA is an XML section that tells the parser to treat the contents as raw text instead of markup. It’s mainly there so characters like < and & can appear without being escaped, which comes up a lot in RSS feeds, XML APIs, and embedded HTML or JavaScript.

Examples

A CDATA section in XML looks like this:

<description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>

If you're scraping XML in Python, the parser will typically give you the text inside the CDATA block:

from lxml import etree

xml = '''
<item>
  <title>Example</title>
  <description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>
</item>
'''

root = etree.fromstring(xml)
description = root.findtext('description')
print(description)
# <p>Price dropped to $19.99 & shipping is free</p>

If that CDATA contains HTML, you may need to parse it a second time:

from bs4 import BeautifulSoup
from lxml import etree

xml = '''
<item>
  <description><![CDATA[<div class="price">$19.99</div>]]></description>
</item>
'''

root = etree.fromstring(xml)
html_fragment = root.findtext('description')
soup = BeautifulSoup(html_fragment, 'html.parser')
price = soup.select_one('.price').get_text(strip=True)
print(price)
# $19.99

Practical tips

Don’t treat CDATA as a separate data source. In most XML parsers, it comes through as text, and that’s all you need.
If the CDATA contains HTML, parse the XML first, then parse the extracted string as HTML. Trying to do both at once gets messy fast.
Watch for feeds that stuff JSON, JavaScript, or broken HTML inside CDATA: valid XML outside, messy content inside.
CDATA is an XML thing, not an HTML thing. If you’re scraping regular web pages, this usually only shows up inside embedded XML, SVG, or feed endpoints.
The raw CDATA wrapper is <![CDATA[ ... ]]>. Good parsers remove that wrapper for you.
CDATA does not make bad content safe or structured. It just stops the XML parser from treating the contents as markup.
If you’re pulling RSS or other XML endpoints through ScrapeRouter, the main job is still the same: fetch reliably first, then parse the XML cleanly after.

from lxml import etree

parser = etree.XMLParser(recover=True)
root = etree.fromstring(xml_bytes, parser=parser)

Use a tolerant parser when feeds are technically XML but still kind of broken, which happens more than people want to admit.

Use cases

RSS and Atom feeds: descriptions and content fields often wrap HTML in CDATA so feed readers can render formatted text.
XML APIs: some older APIs return HTML snippets, SQL fragments, or JSON blobs inside CDATA instead of modeling the response properly.
Sitemaps and product feeds: merchants sometimes stuff rich descriptions into CDATA to avoid escaping every tag.
Config and integration files: CDATA is sometimes used for embedded scripts or templates where escaping would be annoying and error-prone.

In scraping, the practical issue is simple: the XML parses fine, but the useful data inside CDATA may still need another parsing step. That’s where people trip up.