z7369 发表于 2018-8-31 09:25:12

perl多线程抓取网页

  perl抓取网页的功能特别强大,所以尝试用多线程来抓网页。。
  #!/usr/bin/perl
  use threads;
  use threads::shared;
  use LWP;
  use LWP::Simple;
  use LWP::UserAgent;
  use LWP::ConnCache;
  use HTML::TreeBuilder;
  my @urls:shared;
  my %uniq_url:shared;
  my $starturl=$ARGV;
  push @urls,$starturl;
  $uniq_url{$starturl}=1;
  my $browser=LWP::UserAgent->new();
  $browser->timeout(10);
  $browser->protocols_allowed(['http','gopher']);
  $browser->conn_cache(LWP::ConnCache->new());
  my $num=1;
  while(scalar @urls >0)
  {
  if(scalar @urls == 1)
  {
  my $url=shift @urls;
  &parse($url,'old');
  }
  if(scalar @urls >= 2)
  {
  if(scalar @urls >= 20)
  {
  $num=20;
  }
  else{
  $num= scalar @urls;
  }
  my @thread;
  for(my $j=0;$jcreate(\&parse,$url,"thread$j");
  }
  for(my $j=0;$jjoin();
  }
  }
  }
  sub parse()
  {
  my $url=shift;
  my $type=shift;
  my $response=$browser->get($url);
  unless($response->is_success)
  {
  print "cant access $url",$response->status_line."\n";
  return 0;
  }
  my $html=$response->content;
  if(scalar @urls new_abs($1,$response->base);
  if(!exists($uniq_url{$new_url}))
  {
  push @urls,"$new_url";
  $uniq_url{$new_url}=1;
  }
  }
  }
  $|=1;
  my $root=HTML::TreeBuilder->new_from_content($html);
  my $title=$root->find_by_tag_name('title');
  if($title)
  {
  my $str_title=$title->as_text();
  print "$url\t$str_title\t$type\n";
  }
  else{

  print "$url\tno>  }
  }
  感觉抓百度的音乐还不错,让可以把链接以及歌名放到mysql里面,写个cgi+sql的模糊查询等,简单实现一下搜索。 很粗陋啊,希望大家不要见笑。

页: [1]
查看完整版本: perl多线程抓取网页