php爬虫抓取百度贴吧图片

xxqyzsc · 发表于 2015-8-24 12:59:56

　　最近有从百度贴吧上批量下载图片的需求，即从某一个贴吧下载所有图片。
　　本来打算用python写的，因为对python不熟悉，试了minidom，HtmlParser等，感觉上不了手，还是使用比较擅长的php语言吧。
　　以下是源代码：

1 <?php
2 //运行时间
3 @set_time_limit(60);
4 //贴吧名称
5 $tbname = "%CD%BC%C6%AC";
6 //抓取类型 0-按照帖子顺序 1-按照贴图顺序
7 $type = 0;
8 //列表页url
9 $listurltpl = "http://tieba.baidu.com/f?kw=%s".($type?"&tp=1":"&pn=");
10 //图册页url
11 $galleryurltpl = "http://tieba.baidu.com/photo/bw/picture/guide?kw=%s&tid=%s&next=9999";
12 //图片url
13 $imageurltpl = "http://imgsrc.baidu.com/forum/pic/item/%s.jpg";
14 //本地的目录
15 $savepath = "h:/images/";
16 //帖子子文件夹
17 $filedirtpl = $savepath."%s/";
18 //图片文件
19 $filenametpl = $savepath."%s/%s.jpg";
20
21 $listurl = sprintf($listurltpl,$tbname);
22 //抓取起始点
23 $pn = 0;
24 while(1)
25 {
26    if (!$type) $listurl .= $pn;
27    //得到列表页源代码
28    $listhtml = file_get_contents($listurl);
29    //匹配出帖子id
30    if($type)
31       preg_match_all('/<div class=\"aep_wrapper\" id=\"pic_item_(\d+)\" tid=\"\d+\">/',$listhtml,$m1);
32    else
33       preg_match_all('/<ul class=\"threadlist_media j_threadlist_media\" id=\"fm(\d+)\"/',$listhtml,$m1);
34    //得到帖子id列表
35    $tidlist = $m1[1];
36    echo "Fetching ... <br /> \r\n";
37    foreach($tidlist as $tid)
38    {
39       echo "--Gallery $tid <br /> \r\n";
40       $galleryurl = sprintf($galleryurltpl,$tbname,$tid);
41       //得到帖子图册的源代码
42       $galleryhtml = file_get_contents($galleryurl);
43       //匹配出图片id
44       preg_match_all('/\{\"original\":\{\"id\":\"(\w+)\"/',$galleryhtml,$m2);
45       //得到图片id列表
46       $pidlist = $m2[1];
47       foreach($pidlist as $pid)
48       {
49          echo "----Picture {$tid}/{$pid}.jpg ";
50          $filedir = sprintf($filedirtpl,$tid);
51          $filename = sprintf($filenametpl,$tid,$pid);
52          //文件是否存在
53          if(!is_file($filename))
54          {
55                $imageurl = sprintf($imageurltpl,$pid);
56                //下载图片
57                $imagebin = file_get_contents($imageurl);
58                //目录是否存在
59                if(!is_dir($filedir))
60                   mkdir($filedir);
61                //保存图片
62                file_put_contents($filename,$imagebin);
63                $rnd = rand(2000,5000);
64                echo "Downloaded! ";
65                //延时休息
66                sleep(1.0*$rnd/1000);
67                echo "Sleep $rnd us <br />\r\n";
68          }
69          else
70                echo "Existed! <br />\r\n";
71       }
72    }
73    //翻到下一页
74    if (!$type) $pn += 50;
75 }
　　运行测试：
　　程序基本上可以满足要求，但是长时间抓取图片时，百度会弹出验证码，此时使用猫重新拨号即可更换IP继续抓取图片。
　　（仅供学习参考，请勿用来做非法的事情。）

账号		自动登录	找回密码
密码			立即注册

c++ size_t 和 int 的区别

HERE 使用 AWS EF 和 JFrog Artifactory 打

C++ 指针大全：从基础到进阶，一篇快速上手

wirelessnetview好用的无线分析工具

Red Hat RHCE 8 (EX294) Cert Guide

亿图图示专家(EDraw Max) V7.9 中文破解版

zabbix3.4.1安装部署+微信推送信息+大屏显

[经验分享] php爬虫抓取百度贴吧图片

扫码加入运维网微信交流群