python爬取拉勾网招聘信息并利用pandas做简单数据分析

9778998 · 发表于 2016-10-10 08:33:54

初来上海找工作，方向python后端，也找了很多家，各种招聘网站都光顾过。想来要做一个稍微有意思的事情，就是爬取招聘网站的招聘信息，这次选择了拉勾网的python招聘信息。浏览器：火狐（能详细的查看网站的详细信息）
1、输入关键词“python”，点击搜索

QQ截图20161010083252.png

2、按F12，打开firebug，可以看到有post提交的数据，看响应内容，就是页面的搜索结果数据。
QQ截图20161010083257.png

3、这是关于分页的参数
QQ截图20161010083302.png

4、综合分析下来，请求数据的API就是
http://www.lagou.com/jobs/positi ... p;kd=python&pn=页码剩下的就是用python写爬虫了。
代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

# encoding: utf-8
"""
@author: chenhuachao
@license: Apache Licence
@file: LaGouSpider.py
@time: 2016/10/1 8:17
"""
import requests
import time
import json
import configparser
from utils.util import MysqlOp

class LaGouSpiders(object):
'''拉勾网爬虫测试'''
def __init__(self,keyword):
      self.keyword = keyword
def spider_run(self):
      url ="http://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false&first=false&kd={0}&pn=".format(self.keyword)
      content = json.loads(requests.get(url+str(1)).text)
      pagesize = content.get('content').get('pageSize')
      for page in range(1,pagesize):
         content_next = json.loads(requests.get(url+str(page)).text)
         company_inro_list = content_next.get('content').get('positionResult').get('result')
         if company_inro_list:
            for company_inro in company_inro_list:
                  companyId = company_inro.get("companyId")#公司ID
                  businessZones = ','.join(company_inro.get('businessZones')) if company_inro.get('businessZones') else "无"#工作地址
                  companyFullName = company_inro.get("companyFullName","")#公司名称
                  positionName = company_inro.get("positionName")#岗位名称
                  education = company_inro.get("education","")#学历要求
                  city = company_inro.get('city',"")#城市
                  financeStage = company_inro.get("financeStage","")#公司状况（上市公司，A\B\C\D轮）
                  salary = company_inro.get("salary","")#薪资
                  workYear = company_inro.get("workYear","")#工作年限
                  companySize = company_inro.get("companySize","")#公司人数规模
                  industryField = company_inro.get("industryField","")#行业类型
                  positionAdvantage = company_inro.get("positionAdvantage","")#公司文化
                  companyLabelList = ','.join(company_inro.get("companyLabelList")) if company_inro.get("companyLabelList","") else "无"#公司福利
                  insert_mysql = MysqlOp()
                  sql = '''INSERT INTO spider.lagou_spider (companyId,job_type,businessZones,companyFullName,positionName,
                                       education,city,financeStage,salary,workYear,companySize,industryField,positionAdvantage,companyLabelList)
                                    VALUE ({0},'{1}','{2}','{3}','{4}','{5}','{6}','{7}','{8}','{9}','{10}','{11}','{12}','{13}')'''.format(companyId,
                                          keyword,businessZones,companyFullName,positionName,education,city,
                                          financeStage,salary,workYear,companySize,industryField,positionAdvantage,companyLabelList)
                  print sql
                  insert_mysql.insert(sql)
                  # print companyId,keyword,businessZones,companyFullName,positionName,education,city,financeStage,salary,workYear,companySize,industryField
         time.sleep(15)

if __name__ == '__main__':
keyword = raw_input("请输入要爬取的岗位：")
spider = LaGouSpiders(keyword)
spider.spider_run()

5、运行代码，输入python

采集到的数据如下图：
QQ截图20161010083309.png