How to Write a Simple Recursive Web and Image Crawler in Perl

Aim
The aim of this tutorial is to demonstate a simple yet effective Perl based webcrawler. In this example we will initially open a socket to a starting website, grab the links and then recursively trawl those links for more links. This builds on the previous perl tutorial Perl Socket and File Example. This is a great starting point to building your very own customized perl web crawler. I am not using the Mechanize package as the aim is to keep everthing light and tight and hard core. Leave any questions or problems as comments and I will endeavour to answer them.

Assumptions
This article assumes that you have a compatible Perl installation and a reasonable understanding or regular expressions. Previous perl experience not required.

Versions used in this example
Sofware/Component
Image
Active Perl 5
ActivePerl-5.10.1.1006-MSWin32-x86-291086.msi
Microsoft Windows XP
N/A


Quick Overview
The aim of this example is to trawl through a website and it's links, find all the images and write the references to a file. You can then simple view the html file and see all the images, without having to navigate throught the site. To keep life simple this is a recursive program so we don't need to a database to store our urls and everything is kept simple.

The depth of the drill specifies how deep the crawler will go. Be careful, because if a site has an average of 10 links and you drill to a depth of 5 links deep you will end up reading 100, 000 pages and finishing up your badwidth. If you give a regex to the script it will then ignore all pages that do not match the regex in the thml body. As you can see I'm not actually downloading any images, there is no need to. All you need are the HTML IMG tags and you can view them in your browser.

For example

pwc.pl codediaries.blogspot.com -d 3 -s "(java|C\+\+)" -p 50"

Will start at codediaries.blogspot.com will follow links 3 deep, and save all the images it finds in myimages.html file with 50 pics per page, but only images referenced from pages that have the word either "java" or "C++" in them.

You can download the file here or copy it off the listing below (turn off line numbering at the bottom).


Example
 1. # Righteous Ninja            #
 2. # scriptdiaries.blogspot.com #
 3. 
 4. use IO::Socket;
 5. 
 6. my %imagehash=();
 7. my %urlhash=();
 8. 
 9. my $depth_limit=0;
10. my $pics_per_page=50;
11. my $output_file="images.html";
12. my $search_terms=" ";
13. 
14. 
15. #recursive subroutine - careful of stack overflow.
16. sub recursiveget{
17.     
18.     my $hostname;
19.     my $port;
20.     my $page="/";
21.     my $current_depth= $_[1];
22.     my $baseurl=$_[0];
23.     
24.     $current_depth++;
25.     
26.     $_[0]=~s/http:\/\///;
27.     
28.     $urlhash{"$_[0]"} = 1;
29.     
30.     if($_[0] =~ m/:(\d+)/ ){
31.         $port = $1;
32.     }
33.     if($_[0] =~ m/(.+?)(:|\/)/ ){
34.         $hostname = $1;
35.     }
36.     else{
37.         $hostname = $_[0];
38.     }
39.     if($_[0] =~ m/(\/.+)/ ){
40.         $page = ($1);
41.     }
42.     
43.     $port=($port)?$port:80;
44. 
45.     my $sock = new IO::Socket::INET (PeerAddr => $hostname ,
46.                                         PeerPort => $port, 
47.                                         Proto => 'tcp');
48.                         
49.     if(!$sock){
50.         print "$baseurl\t\tFailed\n";
51.         return;
52.     }
53. 
54.     
55.     eval{ #catch any processing errors and ignore them
56.     
57.     printf $sock ("GET $page HTTP/1.0\n");
58.     printf $sock ("Host: $hostname\n");
59.     printf $sock ("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
60.     printf $sock ("User-Agent: Mozilla/5\n");
61.     printf $sock ("Connection: keepalive\n");
62.       printf $sock ("\n\n");
63.     
64.     my $data="";
65.     while (<$sock>) {
66.         if($_=~ m/validate_response/i) {
67.              last;
68.         }
69.         else{
70.             $data.=$_;
71.         }
72.     }
73.     #check to see if it's a redirect- and if it is go to the new site.
74.     if($data=~ m/HTTP\/1\.\d +302/i && $data=~ m/Location: (.*)\r\n/){
75.         print "$baseurl\t\tRedirect\n";
76.         &recursiveget("".$1, $current_depth);
77.     }
78.     elsif($data!~ m/HTTP\/1\.\d +200/i){
79.         print "$baseurl\t\tHTTP Error\n";
80.         return;
81.     }
82.     #grab the images
83.     else{
84.         #check is the page matches any of my search terms
85.         if($data !~ m/$search_terms/ig){
86.             #ignore
87.         }
88.         else{
89.             $page =~ m/(.*)\//;
90.             $page=$1;
91.             while($data =~ m/\<.*?img.*?src="(.+?)".*?\>/g){
92.                 $temp=$1;
93.                 if($temp =~ m/http:\/\//i){
94.                     $picurl=$temp;
95.                 }
96.                 else{
97.                      $picurl= "http://$hostname:$port$page/$temp";
98.                 }
99.                 if( not exists $imagehash{$picurl} ){
100.                     $imagehash{$picurl}=1;
101.                      writetofile($picurl, $baseurl);
102.                  }
103.             }
104.          }
105.         while($data=~ m/href=\"(.*?)\"/ig ){
106.             if($1 !~ m/(mailto:|irc:|^#|\.css)/i){
107.                 $temp=$1;
108.                 if($temp =~ m/http:\/\//i){
109.                     $newurl=$temp;
110.                 }
111.                 elsif($temp =~ m/^\//){
112.                     $newurl=$hostname.":"."$port$temp";
113.                 }
114.                 else{
115.                     #$temp =~ s/^\///;
116.                     $newurl=$hostname.":"."$port$page/$temp";
117.                 }
118.                 #print "$newurl\n";
119.                 if ($current_depth <= $depth_limit){
120.                     if ( ! exists $urlhash{ $newurl } ){
121.                         &recursiveget($newurl,$current_depth);
122.                     }
123.                 }
124.             }
125.         }
126.         print "$baseurl\t\tOK\n";
127.     }
128.     };#end eval - ignore any errors
129.     if($@){
130.         print "$@\n";
131.         #ignore;
132.     }
133. }
134. 
135. #called from recursiveget to print out the urls
136. my $counter=1;
137. my $fileseq=2;
138. sub writetofile{
139.     printf IMAGEFILE "<a href=\"$_[1]\"><img src=\"$_[0]\"/></a>";
140.     if(++$counter % $pics_per_page == 0){
141. 
142.         print IMAGEFILE "<p><a href=\"$fileseq$output_file\">Next Page</a></p>";    
143.         print IMAGEFILE "</body></html>";    
144.         close (IMAGEFILE);    
145.     
146.         open (IMAGEFILE, ">$fileseq$output_file");
147.         $fileseq++;
148.         print IMAGEFILE "<html><body>";
149.     }
150. }
151. 
152. #show useage information
153. sub showuseage{
154.     print "This application will crawl for images and put the references into an file.\n\n";
155.     print "useage:  PWC.PL site [-d depth] [-p pics/page] [-s search regex]\n"; 
156.     print "Example:  PWC.PL codediaries.blogspot.com -d 2 -p 100 -s \"(ninja|barbie)\" \n\n";
157.     exit;
158. }
159. 
160. 
161. #main
162. &showuseage unless @ARGV[0];
163. for ($i=1; $i < $#ARGV+1 ;$i++){
164. 
165.     if( @ARGV[$i] eq "-d"){
166.         $depth_limit=@ARGV[$i] unless !@ARGV[++$i];
167.     }
168.     elsif( @ARGV[$i] eq "-p"){
169.         $pics_per_page=@ARGV[$i] unless !@ARGV[++$i];
170.     }
171.     elsif( @ARGV[$i] eq "-s"){
172.         $search_terms=@ARGV[$i] unless !@ARGV[++$i];
173.     }
174.     else{
175.         &showuseage;
176.     }
177. }    
178. $output_file="images.html";
179. 
180. open (IMAGEFILE, ">$output_file");
181. print IMAGEFILE "<html><body>";
182. 
183. #Call recursive routine
184. &recursiveget(@ARGV[0],0);
185. 
186. print IMAGEFILE "</body></html>";    
187. close (IMAGEFILE);
188. 
189. print "Finished\n";
Hide line numbers

Some improvements that should probably be done are to limit the search to the starting domain and not go all over the place.

1 comment:

Anonymous said...

Good!