As promised, this article will explain how I have gone about scraping data from websites, for proof of concept of course! Obviously this is considered a grey area, especially if you are taking data that is not yours and using it for commercial purposes; that may even be considered illegal! So for testing purposes, I am going to write code that will crawl my company’s support forum to get some useful data and you can learn and adapt it for your purposes.
To start, the below code will initialize your cURL request and get the HTML from the specified URL:
$ch = @curl_init();
if($ch){
//the url to crawl
$url = 'http://forum.laflabs.com/';
//why not fake a user agent?
$user_agent =
'Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9';
//lets fake a referer while we are at it!
$referer =
'http://www.google.com';
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_REFERER, $referer);
$content = curl_exec($ch);
$headers = curl_getinfo($ch);
curl_close($ch);
if($headers['http_code'] == 200){
//traverse the dom
}
}
Now that is pretty easy so far, right? I think it pretty much speaks for itself, set a couple variables, really just the URL is necessary. If the the requested URL sends a HTTP Code of 200 (OK) that means content was found and we can traverse the DOM using XPath.
Continue reading