Tag Archives: scraping

Scraping Sites Using cURL & XPath

As promised, this article will explain how I have gone about scraping data from websites, for proof of concept of course! Obviously this is considered a grey area, especially if you are taking data that is not yours and using it for commercial purposes; that may even be considered illegal! So for testing purposes, I am going to write code that will crawl my company’s support forum to get some useful data and you can learn and adapt it for your purposes.

To start, the below code will initialize your cURL request and get the HTML from the specified URL:

$ch = @curl_init();

if($ch){

	//the url to crawl
	$url = 'http://forum.laflabs.com/';

	//why not fake a user agent?
	$user_agent =
		'Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9';

	//lets fake a referer while we are at it!
	$referer =
		'http://www.google.com';

	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
	curl_setopt($ch, CURLOPT_REFERER, $referer);

	$content = curl_exec($ch);
	$headers = curl_getinfo($ch);

	curl_close($ch);

	if($headers['http_code'] == 200){

		//traverse the dom

	}

}

Now that is pretty easy so far, right? I think it pretty much speaks for itself, set a couple variables, really just the URL is necessary. If the the requested URL sends a HTTP Code of 200 (OK) that means content was found and we can traverse the DOM using XPath.
Continue reading

Reading HTML with PHP and XPath

Learning new things is fun, especially when you get paid to learn said new thing! A couple months ago I was tasked with scraping data from a website, as a proof of concept of course. I thought it was going to take me forever, trying to use regular expressions to find certain strings in the data, trial and error, you know what I mean. Well, I found out that through PHP’s SimpleXMLElement your can use a function called XPath to query the DOM you have loaded.

Normally, xpath is used for reading nodes within the DOM of an XML document, but you can trick PHP into reading non-well-formatted HTML; awesome right? Here is a quick example how to get this working:

$sampleContent = '<html>
	<head>
	<title>Sample Content</title>
	</head>
	<body>
	<a href="mailto:test@testing.com">Email Me!</a>
	</body>
	</html>';

//Disable libxml errors and allow user to
//fetch error information as needed
libxml_use_internal_errors(true);

$dom = new DOMDocument();

$dom->loadHTML($sampleContent);

$xpath = new DOMXPath($dom);

//find the email
$result = $xpath->query('//a[contains(@href, "mailto:")]/@href');
$email = $result->item(0)->nodeValue;

//we need to strip out the mailto: portion
$email = preg_replace('/mailto:/', '', $email);

echo $email;

This is the first step to being able to read the data within your HTML document. One thing I found that made it easier to query and find the data you are looking for is by installing XPath Checker for FireFox. Now only if there was a comparable version for Chrome.

In the very distant future (like in the next couple of days), I will expand upon this article to detail from start to finish how to utilize cURL and XPath to successfully scrape data from a website.