用Python进行网页分析实现批量下载

michellc · 发表于 2017-5-5 10:53:28

最终版：前几个版本（见本人的以前文章）基本都是用正则表达式实现匹配得到下载链接的，弊端有两方面：1。由于所分析的网页很有规律，所以正则表达式实现起来过于繁琐，肯定不合适 2。各个任务之间都不不同，每次都重新编码，实现起来毫无规律，没有扩展性。所以此篇文章为这个专题的最终版，以后还有更多东西要学习，就不要在着方面纠缠了。

可执行版本：这次终于可以出一个自己比较满意的可执行版本，所以贴出源代码，作为这段工作的总结。相关的其他资源的获取，都可以由以下代码扩展之，而且相当简单，所以不再提供

运行要求：必须先下载和安装python-2.4.2.msi ，配置好Python环境；单击start.bat即可实现批量下载

源代码：本工程只包括两个文件start.bat 和CustomParser.py：

start.bat

   //make the dir for files and run the project

      mkdir files
      python CustomParser.py

CustomParser.py

from sgmllib import SGMLParser
from string import find, replace, rjust
from threading import Thread
import urllib

__author__ = "Chen Peng (peng.ch@hotmail.com)"
__version__ = "$Revision: 1.0 $"
__date__ = "$Date: 2006/03/03 $"
__copyright__ = "Copyright (c) 2006 Chen Peng"
__license__ = "Python"

__all__ = ["Gif_163_Parser"]

class PDownloadThread( Thread ):
"""
Download the files in the dict and save them to local files with the given name
"""
def __init__( self, DictList,i ):
      Thread.__init__( self )
      self.DictList=DictList
      self.pageno=str(i);

def run( self ):
      for k in self.DictList.keys():
         try:
            print 'Download'+self.DictList[k]+'......'
            uFile=urllib.urlretrieve( self.DictList[k], '.files'+k+'.'+self.DictList[k].split('.')[self.DictList[k].split('.').__len__()-1])
         except :
            logfile = open('error.log', 'a')
            logfile.write(self.pageno+' '+self.DictList[k]+' '+k+'n')
            logfile.close()
         print 'Save to file '+k

class Gif_163_Parser( SGMLParser ):
"""
任务:下载163彩图
原理:http://mms.163.com/new_web/cm_lv2_pic.jsp?catID=&ord=dDate&page=2&type=1&key=
      从1到415页（共6637）分析得到如下路径：“/fgwx/hhsj/1_060302175613_186/128x128.gif”
eg:<script>showPic('22930','1','/fgwx/hhsj/1_060302175613_186/128x128.gif','1','编号：22930n名字: 因为有你n人气:100');</script>
下载路径:http://mmsimg.163.com/new_web/loaditem.jsp/type=1/path=/fgwx/llfj/1_060302175612_995/176x176.gif
"""
def reset( self ):
      SGMLParser.reset( self )
      self.headURL='http://mmsimg.163.com/new_web/loaditem.jsp/type=1/path='
      self.SubURL = []
      self.Links = {}

def start_script( self, attrs ):
      #self.SubURL.extend( [' %s="%s"' % ( key, value ) for key, value in attrs] )
      pass

def end_script( self ):
      pass

def handle_data( self, text ):
      if find( text, 'showPic' )!=-1:
         self.Links[replace( text.split( 'n' )[1], 'xc3xfbxd7xd6: ', '' )]=self.headURL+replace ( text.split( ',' )[2], ''', '' );

def Execute( self ):
   for i in range( 1, 415 ):
         self.Links.clear;
         try:
            usock = urllib.urlopen( "http://mms.163.com/new_web/cm_lv2_pic.jsp?catID=&ord=dDate&page="+str(i)+"&type=1&key=" )
            self.feed( usock.read() )
            usock.close()
            TestThread=PDownloadThread( self.Links ,i)
            TestThread.start()
            self.close()
         except IOError:
            pass
      #print ( ["%s=%sn"% ( k, self.Links[k] ) for k in self.Links.keys()] )
      #print self.Links

if __name__ == '__main__':
#Gif_163_Parser().Execute();
   testtask=Gif_163_Parser()
   testtask.Execute()

账号		自动登录	找回密码
密码			立即注册

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

c++ size_t 和 int 的区别

[经验分享] 用Python进行网页分析实现批量下载

扫码加入运维网微信交流群