An Interview with the author of harvestman


Edit

An Interview with Anand B Pillai

Author of the Harvestman Webcrawler

Question: HarvestMan is a webcrawler in Python, what made you select Python?

Back in October 2002, I was working for Dassault Systemes , a French software developer at their Bangalore office.

My projects at work were done using a proprietary language written on top of C++ and Java utilizing component technologies such as COM/CORBA. The coding was quite mechanical and really boring. I started searching for some language I could learn in my free time which would also enhance my resume and hone my programming skills. I tried Perl for a month and gave up. It was then that I chanced across the name Python in some mailing list or newsgroup. I visited the Python.org homepage, downloaded the latest version of the software and installed it on my machine. I tried the examples and the official Python tutorial the next day. By the week-end I was hooked on to Python!

I started learning Python in earnest spending most of my spare time in office writing Python code which would help me to learn the nitty-gritties and also come to grips with its standard library. I also started browsing the web for finding out software related to Python. I downloaded some of them and gave it a try.

Question: Why choose to write a webcrawler?

By mid 2003, I was getting pretty confident in Python. I thought it was time to start a Python project of my own which could help sharpen my skills further in the language. I am a firm believer in the principle that the best way to learn a programming language is to actually do a full fledged project in that language. Books and tutorials can help, but nothing can help as much as doing a full-fledged project in it. If the project idea is a your own, it is all the better.

At that time I was using many webcrawler programs to download files from websites, mostly GNU manuals, Python documentation and such. I was not happy with the proprietary programs I was using which were pretty bad in their design and also used to expire since they were mostly shareware.

I had a friend who was also learning Python at that time. During a visit to his house, I mentioned to him that I was thinking of writing a web crawler program in Python all on my own. This was in June 2003. He thought it was a great idea. My logic was that a web crawler program, being pretty complex, will allow me to utilize the full power of Python and its standard library modules. I was not mistaken in this assessment.

My friend proved to be good to his word. The next day I received a mail attachment which contained a couple of Python modules which could run as a simple web crawler! He had developed the url parsing module in just under a day, which was pretty amazing. This program, given a URL could crawl it in a simple minded way, and save files to the disk. It was also able to recreate the directory structure of the website on the disk. We called it just 'Webcrawler' at that time. It just consisted of two modules at that time - a 'crawler' one which consisted of the web page parsing, crawling and downloading logic and a 'urlparser' one which parsed a URL, associating it to the directory structure on the disk.

Question: How did you get started and what challenges did you overcome?

First I added multithreading to the crawler module. Then I added a third module which would tie together the two modules (urlparser, crawler) and invoke their routines to run the program. I kept working on this project in my free time and by July 2003 it was getting kind of usable. Since I wanted to take my efforts online, I put up the code in my personal website and also created a freshmeat project page for the program. I also re-christened it as HarvestMan, the name taken from a kind of spider found in northern U.S.A .

I got a number of feedback mails from people who used my program and gave constructive suggestions. Since the code was of alpha quality, there were a huge number of bugs and also the design was not very good.

I worked on the design and improved it by adding a symmetric queue paradigm which used two queues of data flow that controlled two different family of threads inside HarvestMan in a synergic manner. (For details, go to the HarvestMan FAQ.)

To cut a long story short, by mid 2004, HarvestMan was in pretty good shape, seeing around 9-10 releases. I had stuck to the open source principle of Release early, Release often in order to improve the visibility of the project online as well as to fix bugs as and when they were found.

Question: How has Harvestman been received and where is it headed?

By mid 2004, HarvestMan had around 30 subscribers to the project online and as many as a 1000 downloads. (I have not kept track of the downloads after that, so I don't have the current figure). As many as 50 bug reports were filed at the HarvestMan bugtracker (which I developed myself using Zope Metapublisher) which helped to improve the program.

I had at that time joined a large MNC based in Bangalore in their group that was working on Grid computing technologies. I gave a presentation on HarvestMan in the company LUG (Linux User's Group) meeting in May 2004. At that time, the HarvestMan website was also getting mentioned in a number of Linux/OSS websites. Prominent among them is the mention on the Linuxlinks.com website.

HarvestMan saw about 4 releases in 2004. I improved the performance of the program by fixing a number of critical bugs and also redesigning and rejigging things were it was fit. HarvestMan also became quite a lot modular. The initial three modules had expanded to nearly 16 modules. I also redesigned the configuration file of HarvestMan which was a simple-minded text file with name-value pairs to a complex XML file, whose schema I designed myself. HarvestMan at that time was having as many as 50 configuration options. You could tweak the program to your heart's content, controlling the download rate, giving time/file limits, specifying depth of URLs, specifying URL filters, URL priorities etcetera. The scope and amount of configuration options for HarvestMan is too many to mention here. In fact, it can fill a whole article of its own.

Still the project was not getting adopted by any open source groups which I had set as the an important goal for the project. I had made the project very modular, with this idea in mind, since I believe that the best way to open source a project and make it successful is to make it very very very modular. That way group which want to extend the project can easily do so, since they need to concentrate their effots only on a few modules, not on a monolithic 'big ball of mud'!

Question: How did the European Union get involved?

This holy grail was achieved in Feb 2005, when I got a mail from Prof. Mikael Snaprud of Agder University College, Norway explaining to me that they were on the initial stages of a large scale web accessibility measurement and analysis project for all European websites called E.I.A.O (European Internet Accessibility Observatory). For doing this, they required a crawler which could download pages from websites on which they would then perform theier accessibility measurements. However, since their project was open source and had a policy of using only open source components, they required an open source webcrawler. Moreover, they were planning to develop most of the code in Python which made it logical to look for an open source webcrawler written in Python - and HarvestMan happened to be the only one which satisified both the requirements. On top of it they required a program which was already quite advanced, which they could configure in many ways- again, HarvestMan fit the bill perfectly.

They had conducted some initial tests on HarvestMan and prepared their own assessment of the program. They mailed it to me and asked my opinion on it. They also asked me if I could come to Norway to give a presentation on HarvestMan in a workshop on Web accessibility they were planning from April 15-20 2005.

I made my travel plans and reached Norway on April 14 2005. I gave a presentation on HarvestMan at Agder University College in front of an audience that consisted of some top minds in Europe who worked on aspects of the Web such as accessibility, meta-modelling, data mining etc. The EIAO project has participation from a number of top Universities and technical insititutes all over from Europe including England, Italy, Germany, Sweden, Norway and Denmark.

I am right now an invited contributor to EIAO till the project finishes in 2007. Under the agreement, the EIAO project will utilize HarvestMan as the crawler component. Since HarvestMan is under GPL, they will contribute back any changes they made to the program, which they want to publish. I also happen to be the only contributor to the EIAO project outside Europe.

I will work on improving the program and if required, tailor the program to EIAO requirements. We have also started work on a distributed version of HarvestMan called D-HarvestMan. D-HarvestMan is a joint project under the aegis of AUC. I am doing this project along with two Norwegians, who are working for the Norwegian telecom company Teleca. From E.I.A.O reports, I find that HarvestMan is being used to download as many as 100,000 web-pages in day from different websites. Apparently it has downloaded nearly 1 million web-pages so far from around 100 websites. Not a small achievement for a program which started off with a much lesser goal of downloading manuals for offline browsing!

I also contribute to conference papers authored by the EIAO participants. We have presented one paper so far, in the Conference on Digital Inclusion and Open source, which took place in Oslo from Oct 20-21 2005. Another paper has been accepted for presentation in the World summit on the Information society conference to take place in Tunisia from 13-15 Nov 2005.

Question: What have you learnt from working on Harvestman?

My experiences with HarvestMan have helped me cement my belief in Open source, and continues to motivate me in my work in Spikesource. The project has taken me to places I thought I would never go, and brought me experiences I thought I would never have and is an ideal example of what can happen when the power of the open source model is combined with the desire to explore and learn.

A project which started off as a wild idea in my friend's house in June 2003, has now become an international project which is being used by top technical institutes and academia in Europe and a few universities in the U.S.A.

References

Anand B Pillai is a Staff Engineer at Spikesource India. He is an invited c ontributor of the European Internet Accessibility Observatory. A founding member of the OVC (The Open Voting Consortium) and developer and maintainer of HarvestMan, http://harvestman.freezope.org


Most Recent

Most Popular

Most Active Categories




Back To Top Add New Article Printable Page

Python

MediaWiki

This page has been accessed 4,117 times.

This page was last modified 05:38, 9 November 2005.