Glossary

CDATA

CDATA is an XML section that tells the parser to treat the contents as raw text instead of markup. It’s mainly there so characters like < and & can appear without being escaped, which comes up a lot in RSS feeds, XML APIs, and embedded HTML or JavaScript.

Examples

A CDATA section in XML looks like this:

<description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>

If you're scraping XML in Python, the parser will typically give you the text inside the CDATA block:

from lxml import etree

xml = '''
<item>
  <title>Example</title>
  <description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>
</item>
'''

root = etree.fromstring(xml)
description = root.findtext('description')
print(description)
# <p>Price dropped to $19.99 & shipping is free</p>

If that CDATA contains HTML, you may need to parse it a second time:

from bs4 import BeautifulSoup
from lxml import etree

xml = '''
<item>
  <description><![CDATA[<div class="price">$19.99</div>]]></description>
</item>
'''

root = etree.fromstring(xml)
html_fragment = root.findtext('description')
soup = BeautifulSoup(html_fragment, 'html.parser')
price = soup.select_one('.price').get_text(strip=True)
print(price)
# $19.99

Practical tips

  • Don’t treat CDATA as a separate data source. In most XML parsers, it comes through as text, and that’s all you need.
  • If the CDATA contains HTML, parse the XML first, then parse the extracted string as HTML. Trying to do both at once gets messy fast.
  • Watch for feeds that stuff JSON, JavaScript, or broken HTML inside CDATA: valid XML outside, messy content inside.
  • CDATA is an XML thing, not an HTML thing. If you’re scraping regular web pages, this usually only shows up inside embedded XML, SVG, or feed endpoints.
  • The raw CDATA wrapper is <![CDATA[ ... ]]>. Good parsers remove that wrapper for you.
  • CDATA does not make bad content safe or structured. It just stops the XML parser from treating the contents as markup.
  • If you’re pulling RSS or other XML endpoints through ScrapeRouter, the main job is still the same: fetch reliably first, then parse the XML cleanly after.
from lxml import etree

parser = etree.XMLParser(recover=True)
root = etree.fromstring(xml_bytes, parser=parser)
  • Use a tolerant parser when feeds are technically XML but still kind of broken, which happens more than people want to admit.

Use cases

  • RSS and Atom feeds: descriptions and content fields often wrap HTML in CDATA so feed readers can render formatted text.
  • XML APIs: some older APIs return HTML snippets, SQL fragments, or JSON blobs inside CDATA instead of modeling the response properly.
  • Sitemaps and product feeds: merchants sometimes stuff rich descriptions into CDATA to avoid escaping every tag.
  • Config and integration files: CDATA is sometimes used for embedded scripts or templates where escaping would be annoying and error-prone.

In scraping, the practical issue is simple: the XML parses fine, but the useful data inside CDATA may still need another parsing step. That’s where people trip up.

Related terms

XML RSS XPath HTML parsing Structured Data