perl多线程抓取网页
perl抓取网页的功能特别强大,所以尝试用多线程来抓网页。。#!/usr/bin/perl
use threads;
use threads::shared;
use LWP;
use LWP::Simple;
use LWP::UserAgent;
use LWP::ConnCache;
use HTML::TreeBuilder;
my @urls:shared;
my %uniq_url:shared;
my $starturl=$ARGV;
push @urls,$starturl;
$uniq_url{$starturl}=1;
my $browser=LWP::UserAgent->new();
$browser->timeout(10);
$browser->protocols_allowed(['http','gopher']);
$browser->conn_cache(LWP::ConnCache->new());
my $num=1;
while(scalar @urls >0)
{
if(scalar @urls == 1)
{
my $url=shift @urls;
&parse($url,'old');
}
if(scalar @urls >= 2)
{
if(scalar @urls >= 20)
{
$num=20;
}
else{
$num= scalar @urls;
}
my @thread;
for(my $j=0;$jcreate(\&parse,$url,"thread$j");
}
for(my $j=0;$jjoin();
}
}
}
sub parse()
{
my $url=shift;
my $type=shift;
my $response=$browser->get($url);
unless($response->is_success)
{
print "cant access $url",$response->status_line."\n";
return 0;
}
my $html=$response->content;
if(scalar @urls new_abs($1,$response->base);
if(!exists($uniq_url{$new_url}))
{
push @urls,"$new_url";
$uniq_url{$new_url}=1;
}
}
}
$|=1;
my $root=HTML::TreeBuilder->new_from_content($html);
my $title=$root->find_by_tag_name('title');
if($title)
{
my $str_title=$title->as_text();
print "$url\t$str_title\t$type\n";
}
else{
print "$url\tno> }
}
感觉抓百度的音乐还不错,让可以把链接以及歌名放到mysql里面,写个cgi+sql的模糊查询等,简单实现一下搜索。 很粗陋啊,希望大家不要见笑。
页:
[1]