So Long 2010

Well, 2010 was definitely an enlightening and quick year. I have made some great strides in my programming knowledge, really stepped it up a notch, I guess… This next year is going to another year of refining my skills, learning new ones and most likely, there will be some time spent looking for my next adventure (i.e. job). Cindy and I set a goal of one year to move up to Washington, but that may have to be put on hold for a while due to our current living situation. Word to the wise: do not buy a place if you are not 200% (yes, 200%) sure you will want to live there for 3 to 5 years. So, I will probably be looking for a new job in SoCal for the time being, unless something drastic happens at my current job (doubting it).

Anyway, my main focus for the next year is to delve further into C#, work on some proof of concept projects and finish reading a reference book or two along the way. It is an amazingly easy to learn language, at least this far, I hope it stays that way when I start doing more advanced development. My extra-curricular goal is to make a real world application that may have potential to be used, abused and ultimately sold. We will see how that turns out when I am sitting here a year from now.

Happy New Year!

Scraping Sites Using cURL & XPath

As promised, this article will explain how I have gone about scraping data from websites, for proof of concept of course! Obviously this is considered a grey area, especially if you are taking data that is not yours and using it for commercial purposes; that may even be considered illegal! So for testing purposes, I am going to write code that will crawl my company’s support forum to get some useful data and you can learn and adapt it for your purposes.

To start, the below code will initialize your cURL request and get the HTML from the specified URL:

$ch = @curl_init();

if($ch){

	//the url to crawl
	$url = 'http://forum.laflabs.com/';

	//why not fake a user agent?
	$user_agent =
		'Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9';

	//lets fake a referer while we are at it!
	$referer =
		'http://www.google.com';

	curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 15);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
	curl_setopt($ch, CURLOPT_REFERER, $referer);

	$content = curl_exec($ch);
	$headers = curl_getinfo($ch);

	curl_close($ch);

	if($headers['http_code'] == 200){

		//traverse the dom

	}

}

Now that is pretty easy so far, right? I think it pretty much speaks for itself, set a couple variables, really just the URL is necessary. If the the requested URL sends a HTTP Code of 200 (OK) that means content was found and we can traverse the DOM using XPath.
Continue reading

Reading HTML with PHP and XPath

Learning new things is fun, especially when you get paid to learn said new thing! A couple months ago I was tasked with scraping data from a website, as a proof of concept of course. I thought it was going to take me forever, trying to use regular expressions to find certain strings in the data, trial and error, you know what I mean. Well, I found out that through PHP’s SimpleXMLElement your can use a function called XPath to query the DOM you have loaded.

Normally, xpath is used for reading nodes within the DOM of an XML document, but you can trick PHP into reading non-well-formatted HTML; awesome right? Here is a quick example how to get this working:

$sampleContent = '<html>
	<head>
	<title>Sample Content</title>
	</head>
	<body>
	<a href="mailto:test@testing.com">Email Me!</a>
	</body>
	</html>';

//Disable libxml errors and allow user to
//fetch error information as needed
libxml_use_internal_errors(true);

$dom = new DOMDocument();

$dom->loadHTML($sampleContent);

$xpath = new DOMXPath($dom);

//find the email
$result = $xpath->query('//a[contains(@href, "mailto:")]/@href');
$email = $result->item(0)->nodeValue;

//we need to strip out the mailto: portion
$email = preg_replace('/mailto:/', '', $email);

echo $email;

This is the first step to being able to read the data within your HTML document. One thing I found that made it easier to query and find the data you are looking for is by installing XPath Checker for FireFox. Now only if there was a comparable version for Chrome.

In the very distant future (like in the next couple of days), I will expand upon this article to detail from start to finish how to utilize cURL and XPath to successfully scrape data from a website.

Mootools Nested Accordion

I noticed some traffic coming in from old links on a topic I wrote quite a long time ago about nesting Mootools Accordions. I am in no mood to resurrect a post from that long ago, especially since I know my way method was not optimal and I no long use or support Mootools. It’s all about jQuery baby! I will however, point you to an article that was inspired by my work, and greatly improved upon it.

You can read about the nesting of Mootools Accordions at Bogdan’s blog Medianotions. Remember, this article is a couple years old, so things may have changed, but it is a good read none-the-less.

Adventures in VPS Providers

I thought I would take a few minutes and write about my disastrous trip though VPS Provider territory. About four years ago, about the time I started coming up with more ideas for sites as well as acquiring hosting accounts, I realized that the slow reseller hosting account I had was no longer going to cut it. Since I had my business, I decided I would pick up a couple servers off eBay and host them on my business line located at my grandpas house; smart move (not really). Things went well for a while, using some open source control panel on the boxes, but as it turns out home-office internet just can’t hang in terms of the quality and speed of data center.

After a year of struggling with modem issues, router lockups, a couple DoS attacks, I decided to look for alternatives. The issue was that my business was not making enough money to justify some $150-250/month hosting cost with a dedicated server. Well, behold the almighty VPS; cheaper, less management overhead (compared to hosting servers myself), and generally easier to work with.

The first VPS provider I hooked up with was VPSLink; they had prices that I could fit into my budget especially if I paid for the year upfront. So I took a stab in the dark and went for it. Throughout the year, I had a couple small issues; issues you would expect from time to time, no complaints. Everything was peachy until VPSLink slyly announced they had been acquired; I mean they were cryptic; I didn’t even get the email until a week after it initially went out. So there was supposed to be some sort of pre-planned and announced transition. Well, that didn’t really work out too well for quite a few people, me included. I once again never got an email until the transition was finished, with my new username and password. It is lucky I didn’t have any mission critical sites at that point.

I did quite a bit of research on the acquiring party and noticed a trend of VPS provider acquisitions and the looming loss of customers due to poor support and lack of professionalism. To clarify, I read horror stories about them. I decided it was time to move somewhere else, and I still wanted a bargain.

The second provider I found was IntoVPS, their prices were a tad higher but I decided you do get what you pay for. I signed up, this time on a month-to-month basis, just in case my luck followed me. To shorten the story a bit, I used IntoVPS for about 3 or 4 months. The server performance was sub-par, as if one of the nodes was just screwing it up for the rest of us. After trying to battle with their support about constant load issues, (1-minute average was at 12.0+ every night, I didn’t know that was possible), I decided to move somewhere else.

After those trials/tribulations, I ended up at VPS.NET (Affiliate Link), I now pay about double what I was with the above two providers, but the quality of service is at a minimum 200% better. I think I have had one small issue in the 8 or so months I have been there, but small issues are expected from time to time, their support is quick and helpful. There interface and scalability impressed me from the beginning, I think it is pretty cool that you can in-place upgrade your VPS without having to contact sales or support.

Oh, and apparently you do get what you pay for.

Welcome back!

Well, I spent like 240 hours (quite an exaggeration) typing up my about me section, so I am pretty well tapped out on the writing front. However, the 404 page on the main page was bothering me, you know since there were no posts… Anyway, I am working hard on teaching myself C# to make myself a bit more marketable, robust, etc. I found out PHP can only get you so far in this business and with my desires to one day work in some capacity for Microsoft, I figured something like C# would be a safe bet.

So far, I am pleased with C# and my transition is working out pretty well so far. I mean, it is quite a bit different from PHP, but it seems that since I understand how code is supposed to work syntactically, it does make it a little less painless to learn. The biggest issue is figuring out what resource to use to do certain things, but the internet is pretty damn helpful, so things seem to be clicking fairly well.

I have the next week off from work and I am sure shortly after that, I will be laid off. This is of course calculated based on the current rate of asshattery going on in this wonderful company. The good news is though, that I have a C# resource book coming in the mail by the end of the week, so I am sure devoting a few hours here and there is going to help this learning process.

As far as a job, I have a couple projects that are pending release, and depending on how they are received will determine a portion of my “next-steps”. The wife and I have been looking at Redmond, WA as our next place of residence, but that obviously has some contingencies, i.e. renting or selling our condo here, finding jobs (Hey Microsoft, I love you, and so does my wife!). We have already found the apartment we want to rent when we went on our awesome road trip to visit Washington. We shall see.