Examples
A CDATA section in XML looks like this:
<description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>
If you're scraping XML in Python, the parser will typically give you the text inside the CDATA block:
from lxml import etree
xml = '''
<item>
<title>Example</title>
<description><![CDATA[<p>Price dropped to $19.99 & shipping is free</p>]]></description>
</item>
'''
root = etree.fromstring(xml)
description = root.findtext('description')
print(description)
# <p>Price dropped to $19.99 & shipping is free</p>
If that CDATA contains HTML, you may need to parse it a second time:
from bs4 import BeautifulSoup
from lxml import etree
xml = '''
<item>
<description><![CDATA[<div class="price">$19.99</div>]]></description>
</item>
'''
root = etree.fromstring(xml)
html_fragment = root.findtext('description')
soup = BeautifulSoup(html_fragment, 'html.parser')
price = soup.select_one('.price').get_text(strip=True)
print(price)
# $19.99
Practical tips
- Don’t treat CDATA as a separate data source. In most XML parsers, it comes through as text, and that’s all you need.
- If the CDATA contains HTML, parse the XML first, then parse the extracted string as HTML. Trying to do both at once gets messy fast.
- Watch for feeds that stuff JSON, JavaScript, or broken HTML inside CDATA: valid XML outside, messy content inside.
- CDATA is an XML thing, not an HTML thing. If you’re scraping regular web pages, this usually only shows up inside embedded XML, SVG, or feed endpoints.
- The raw CDATA wrapper is
<![CDATA[ ... ]]>. Good parsers remove that wrapper for you. - CDATA does not make bad content safe or structured. It just stops the XML parser from treating the contents as markup.
- If you’re pulling RSS or other XML endpoints through ScrapeRouter, the main job is still the same: fetch reliably first, then parse the XML cleanly after.
from lxml import etree
parser = etree.XMLParser(recover=True)
root = etree.fromstring(xml_bytes, parser=parser)
- Use a tolerant parser when feeds are technically XML but still kind of broken, which happens more than people want to admit.
Use cases
- RSS and Atom feeds: descriptions and content fields often wrap HTML in CDATA so feed readers can render formatted text.
- XML APIs: some older APIs return HTML snippets, SQL fragments, or JSON blobs inside CDATA instead of modeling the response properly.
- Sitemaps and product feeds: merchants sometimes stuff rich descriptions into CDATA to avoid escaping every tag.
- Config and integration files: CDATA is sometimes used for embedded scripts or templates where escaping would be annoying and error-prone.
In scraping, the practical issue is simple: the XML parses fine, but the useful data inside CDATA may still need another parsing step. That’s where people trip up.