Reading HTML with PHP and XPath

Learning new things is fun, especially when you get paid to learn said new thing! A couple months ago I was tasked with scraping data from a website, as a proof of concept of course. I thought it was going to take me forever, trying to use regular expressions to find certain strings in the data, trial and error, you know what I mean. Well, I found out that through PHP’s SimpleXMLElement your can use a function called XPath to query the DOM you have loaded.

Normally, xpath is used for reading nodes within the DOM of an XML document, but you can trick PHP into reading non-well-formatted HTML; awesome right? Here is a quick example how to get this working:

$sampleContent = '<html>
	<head>
	<title>Sample Content</title>
	</head>
	<body>
	<a href="mailto:test@testing.com">Email Me!</a>
	</body>
	</html>';

//Disable libxml errors and allow user to
//fetch error information as needed
libxml_use_internal_errors(true);

$dom = new DOMDocument();

$dom->loadHTML($sampleContent);

$xpath = new DOMXPath($dom);

//find the email
$result = $xpath->query('//a[contains(@href, "mailto:")]/@href');
$email = $result->item(0)->nodeValue;

//we need to strip out the mailto: portion
$email = preg_replace('/mailto:/', '', $email);

echo $email;

This is the first step to being able to read the data within your HTML document. One thing I found that made it easier to query and find the data you are looking for is by installing XPath Checker for FireFox. Now only if there was a comparable version for Chrome.

In the very distant future (like in the next couple of days), I will expand upon this article to detail from start to finish how to utilize cURL and XPath to successfully scrape data from a website.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>