<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>CodeUtopia - The blog of Jani Hartikainen &#187; Python</title>
	<atom:link href="http://codeutopia.net/blog/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://codeutopia.net/blog</link>
	<description>Software development with a focus on web-related technologies</description>
	<lastBuildDate>Wed, 08 Sep 2010 19:50:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Scraping HTML with Python</title>
		<link>http://codeutopia.net/blog/2008/12/14/scraping-html-with-python/</link>
		<comments>http://codeutopia.net/blog/2008/12/14/scraping-html-with-python/#comments</comments>
		<pubDate>Sun, 14 Dec 2008 14:00:13 +0000</pubDate>
		<dc:creator>Jani Hartikainen</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://codeutopia.net/blog/2008/12/14/scraping-html-with-python/</guid>
		<description><![CDATA[Have you ever had to write a script that scrapes data from an HTML page? Was the page horribly bad HTML too? If so, you probably know how annoying and time consuming it can be to write a script that reliably fetches data from such a mess. I was recently asked to write a script [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever had to write a script that scrapes data from an HTML page? Was the page horribly bad HTML too? If so, you probably know how annoying and time consuming it can be to write a script that reliably fetches data from such a mess.</p>
<p>I was recently asked to write a script that scrapes an ASP.NET page at work. It was a paginated list of people, and each person was linked to a page with more details about them. Naturally, the HTML code was also a horrible mess.</p>
<p>The usual first step in such is to try if the XML functionality in your language of choice can make something workable out from the code. Since that will only work with well-formed XHTML, it was not an option here. The next thing is regular expressions, but they are such a huge pain to write and maintain for something like parsing specific data out from HTML.</p>
<p>Luckily, there&#8217;s a better way to do it in Python, using a library called <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>. It&#8217;s definitely the best tool for this job I&#8217;ve seen.</p>
<p><span id="more-174"></span></p>
<h3>No more regular expressions</h3>
<p>I have used my fair share of regular expressions to get data out from HTML. Often it&#8217;s not a simple task to write a regex to get some data from HTML, and sometimes you&#8217;ll need tens of them to get all the data you want.</p>
<p>With BeautifulSoup, it&#8217;s really simple. You probably know what tag the data in is, and what&#8217;s near it. You can look up elements by attributes, parents, find text nodes based on what they contain etc.</p>
<p>For example, if the page has a person&#8217;s name and a phone number listed in a table:</p>

<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;!-- this is somewhere in some malformed HTML --&gt;
&lt;td&gt;Name:&lt;/td&gt;&lt;td&gt;Dirty Harry&lt;/td&gt;
&lt;td&gt;Phone:&lt;/td&gt;&lt;td&gt;123 345 6000&lt;/td&gt;</pre></div></div>

<p>so to get the data out:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> BeautifulSoup <span style="color: #ff7700;font-weight:bold;">import</span> BeautifulSoup
&nbsp;
<span style="color: #808080; font-style: italic;"># Let's assume the_html contains the html code for the page</span>
s = BeautifulSoup<span style="color: black;">&#40;</span>the_html<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Look up element which contains &quot;Name:&quot;, and get the next node's string contents</span>
name = s.<span style="color: black;">find</span><span style="color: black;">&#40;</span>text=<span style="color: #483d8b;">&quot;Name:&quot;</span><span style="color: black;">&#41;</span>.<span style="color: black;">next</span>.<span style="color: #dc143c;">string</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Do same for &quot;Phone:&quot;</span>
phone = s.<span style="color: black;">find</span><span style="color: black;">&#40;</span>text=<span style="color: #483d8b;">&quot;Phone:&quot;</span><span style="color: black;">&#41;</span>.<span style="color: black;">next</span>.<span style="color: #dc143c;">string</span></pre></div></div>

<p>Doing the above with regex could&#8217;ve been tricky, for example if there was a random amount of space between the tds, or maybe the td&#8217;s weren&#8217;t organized like that etc. etc.</p>
<p>That was just one small example, BeautifulSoup has much more functionality to offer, so you should definitely check out <a href="http://www.crummy.com/software/BeautifulSoup/">the BeautifulSoup homepage</a> for downloads and documentation!</p>
]]></content:encoded>
			<wfw:commentRss>http://codeutopia.net/blog/2008/12/14/scraping-html-with-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Django = Awesome</title>
		<link>http://codeutopia.net/blog/2008/05/16/django-awesome/</link>
		<comments>http://codeutopia.net/blog/2008/05/16/django-awesome/#comments</comments>
		<pubDate>Fri, 16 May 2008 10:03:55 +0000</pubDate>
		<dc:creator>Jani Hartikainen</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Zend Framework]]></category>

		<guid isPermaLink="false">http://codeutopia.net/blog/2008/05/16/django-awesome/</guid>
		<description><![CDATA[So in the lack of anything &#8220;useful&#8221; to post, and in the attemps to at least post something interesting, I shall dedicate this post to talking about what makes Django such an awesome framework! I&#8217;m going to compare it to Zend Framework, as it&#8217;s the framework I&#8217;m most familiar with. Reduce redundancy Django has many [...]]]></description>
			<content:encoded><![CDATA[<p>So in the lack of anything &#8220;useful&#8221; to post, and in the attemps to at least post <i>something</i> interesting, I shall dedicate this post to talking about what makes <a href="http://www.djangoproject.com">Django</a> such an awesome framework!</p>
<p>I&#8217;m going to compare it to Zend Framework, as it&#8217;s the framework I&#8217;m most familiar with.</p>
<p><span id="more-99"></span></p>
<h3>Reduce redundancy</h3>
<p>Django has many features to reduce the amount of redundant code you need to write. </p>
<ul>
<li>Generic Views</li>
<li>Forms</li>
<li>Modelforms</li>
<li>Automatic admin panel</li>
</ul>
<h3>Generic views</h3>
<p>This was initially slightly confusing to me, but then I figured it out. Django&#8217;s &#8220;generic views&#8221; are basically kind of like functions &#8211; they perform some task based on the parameters you pass to them.</p>
<p>Instead of always writing an action in a controller for all your requests, you can create a generic view for common tasks, which could for example be a list of something in your database. </p>
<p>Combined with Django&#8217;s way of defining URLs separately for all actions, you can do some things in your app even without writing anything else than a url pattern.</p>
<h3>Forms</h3>
<p>Django&#8217;s newforms library makes it really easy to do forms. Just define a form class using the simple methods provided and you can have a nice form <i>with validation</i> ready in no time. </p>
<p>Comparing this to <abbr title="Zend Framework">ZF</abbr>, it&#8217;s really a lot less of typing. In ZF, the syntax for defining the form is longer, but not by much, but the main factor is that in it you&#8217;ll need to define the validation by hand. Of course, if you don&#8217;t like Django&#8217;s validation, it&#8217;s easy to define custom validation for it as well&#8230; it just has <i>sensible defaults</i>, which in my opinion is a Good Thing &trade;</p>
<h3>Modelforms</h3>
<p>Django&#8217;s modelforms basically let you create forms based on your models by just defining a form class and adding a Meta class parameter that tells which model class the form is for. You won&#8217;t need to define any fields or anything, and you&#8217;ll get a ready-to-use form for creating and editing model data! Simply add a view that will display the form and call form.save() when done.</p>
<p>Often you will need to spend time to make forms to edit your DB stuff and with modelforms it&#8217;s almost no lines of code. Even relationships between objects work &#8211; you get a select box for one-to-many relations etc.</p>
<p>Of course, if needed, modelforms can be extended to perform their job a bit differently too. It will require a bit of going through Django&#8217;s code, though, as at least I wasn&#8217;t able to find any examples. Definitely doable and not too difficult if you know what you&#8217;re doing.</p>
<h3>Automatic admin panel</h3>
<p>This is one very very useful feature. Many many times I&#8217;ve written similar admin panel code for different projects &#8211; that stuff isn&#8217;t usually very reusable. Django all that away and gives you a great automagically generated admin panel which you can even customize to a good degree to make it even better. Need to create a database driven website? Just this should be a reason good enough you should definitely at least give Django a try &#8211; not to mention all the other things it has.</p>
<h3>Unicode support</h3>
<p>Python, unlike PHP, has good unicode support out of the box. No weird mbstring stuff or such, and Django&#8217;s svn release is fully unicode aware. Will make your life a whole lot easier if you need to work with multiple languages.</p>
<h3>PS: Regarding Zend Framework</h3>
<p>While Zend Framework has a different approach to things &#8211; it is meant to be loosely coupled etc. with most components being usable without needing the others, and Django has parts that are much more tightly coupled (like modelforms with the ORM) &#8211; it could still learn a thing or two from Django.</p>
]]></content:encoded>
			<wfw:commentRss>http://codeutopia.net/blog/2008/05/16/django-awesome/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>
