@twein89
2017-06-12T09:51:58.000000Z
字数 2606
阅读 1342
爬虫
仓库地址:
repository:git@gitlab.qjdchina.com:yantengwei/business-python.git * branch:master * tag:d0.0.3-1
参考api文档
页面解析器pyquery文档
框架作者博客里有pyspider中文系列教程,可以先看这个
想要调试一个项目比如court_zhixing.py这个脚本
可以在本地用pyspider one命令调试,例如pyspider one court/court_zhixing.py
或者可以在webUI调试。
1.项目数据迁移:
因为pyspider在webUI中save的项目是存放在数据库中projectdb表里的,有时需要在文件数据库和其他类型数据库之间进行迁移,
以sqlite和mongodb为例:
cd spider_tender
把项目从sqlite迁移到mongdb:
python tools/migrate.py sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db mongodb+projectdb://127.0.0.1:27017/projectdb
把项目从mongdb迁移到本地sqlite:
python tools/migrate.py mongodb+projectdb://127.0.0.1:27017/projectdb sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db
2.项目提交:本地从local_config启动项目:
pyspider -c config/local_config.json
在webUI中贴入代码保存,项目会自动保存到本地sqlite,spider_tender/data/project.db中,然后进行git提交即可更新项目
3.项目部署:环境配置及项目部署见部署文档
1.config文件说明:
pre, online的配置文件在spider_tender/config文件夹下
以online_config.json为例:
{"result_worker": {"result_cls": "mongo_result_worker.OnlineResultWorker"},"court_mongo_param": {"ip": "10.171.250.51","port": "27017","database": "proxy","username": "proxy","password": "proxy_1","replica": "online"}}
result_worker这个是属于框架自带设置
"result_worker": {"result_cls": "mongo_result_worker.OnlineResultWorker"},
这部分是属于自定义的配置
"court_mongo_param": {"ip": "10.171.250.51","port": "27017","database": "proxy","username": "proxy","password": "proxy_1","replica": "online"}
自定义配置的读取利用了click模块的api,读取court的mongodb配置的代码写在spider_tender/court/court_db.py里,
config = click.get_current_context().__dict__['obj']['config']db_param = config.get('court_mongo_param', 0)ip = db_param['ip']port = int(db_param['port'])database = db_param['database']replica = db_param['replica']username = db_param['username']password = db_param['password']
2.抓取结果保存
两种方式:一种通过写result_worker,另一种在抓取脚本里写on_result
如果想从一个response里面返回多个结果,可以使用self.send_message和on_message这两个方法。参考文档中http://docs.pyspider.org/en/latest/apis/self.send_message/
def detail_page(self, response):for i, each in enumerate(response.json['products']):self.send_message(self.project_name, {"name": each['name'],'price': each['prices'],}, url="%s#%s" % (response.url, i))def on_message(self, project, msg):return msg
并且在court_wenshu.py中有例子
3.代理配置:
from pyspider.libs.base_handler import *from settings import USER_AGENTclass Handler(BaseHandler):crawl_config = {'itag': 'v10','proxy': 'http://duoipyewfynqt:wAgJx3wLYjF6A@222.184.35.196:33610','headers': {'User-Agent': USER_AGENT}}@every(minutes=3 * 60)def on_start(self):for i in range(20):self.crawl("http://ip.chinaz.com/getip.aspx#{}".format(i), callback=self.page_parse)def page_parse(self, response):print(response.text)