[关闭]
@twein89 2017-06-12T17:51:58.000000Z 字数 2606 阅读 1177

pyspider交接文档

爬虫


仓库地址:
repository:git@gitlab.qjdchina.com:yantengwei/business-python.git * branch:master * tag:d0.0.3-1

基础文档

参考api文档
页面解析器pyquery文档
框架作者博客里有pyspider中文系列教程,可以先看这个

调试

想要调试一个项目比如court_zhixing.py这个脚本
可以在本地用pyspider one命令调试,例如pyspider one court/court_zhixing.py
或者可以在webUI调试。


部署相关

1.项目数据迁移:
因为pyspider在webUI中save的项目是存放在数据库中projectdb表里的,有时需要在文件数据库和其他类型数据库之间进行迁移,
以sqlite和mongodb为例:
cd spider_tender
把项目从sqlite迁移到mongdb:
python tools/migrate.py sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db mongodb+projectdb://127.0.0.1:27017/projectdb
把项目从mongdb迁移到本地sqlite:
python tools/migrate.py mongodb+projectdb://127.0.0.1:27017/projectdb sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db

2.项目提交:本地从local_config启动项目:
pyspider -c config/local_config.json
在webUI中贴入代码保存,项目会自动保存到本地sqlite,spider_tender/data/project.db中,然后进行git提交即可更新项目

3.项目部署:环境配置及项目部署见部署文档


开发相关

1.config文件说明:
pre, online的配置文件在spider_tender/config文件夹下
以online_config.json为例:

  1. {
  2. "result_worker": {
  3. "result_cls": "mongo_result_worker.OnlineResultWorker"
  4. },
  5. "court_mongo_param": {
  6. "ip": "10.171.250.51",
  7. "port": "27017",
  8. "database": "proxy",
  9. "username": "proxy",
  10. "password": "proxy_1",
  11. "replica": "online"
  12. }
  13. }

result_worker这个是属于框架自带设置

  1. "result_worker": {
  2. "result_cls": "mongo_result_worker.OnlineResultWorker"
  3. },

这部分是属于自定义的配置

  1. "court_mongo_param": {
  2. "ip": "10.171.250.51",
  3. "port": "27017",
  4. "database": "proxy",
  5. "username": "proxy",
  6. "password": "proxy_1",
  7. "replica": "online"
  8. }

自定义配置的读取利用了click模块的api,读取court的mongodb配置的代码写在spider_tender/court/court_db.py里,

  1. config = click.get_current_context().__dict__['obj']['config']
  2. db_param = config.get('court_mongo_param', 0)
  3. ip = db_param['ip']
  4. port = int(db_param['port'])
  5. database = db_param['database']
  6. replica = db_param['replica']
  7. username = db_param['username']
  8. password = db_param['password']

2.抓取结果保存
两种方式:一种通过写result_worker,另一种在抓取脚本里写on_result
如果想从一个response里面返回多个结果,可以使用self.send_message和on_message这两个方法。参考文档中http://docs.pyspider.org/en/latest/apis/self.send_message/

  1. def detail_page(self, response):
  2. for i, each in enumerate(response.json['products']):
  3. self.send_message(self.project_name, {
  4. "name": each['name'],
  5. 'price': each['prices'],
  6. }, url="%s#%s" % (response.url, i))
  7. def on_message(self, project, msg):
  8. return msg

并且在court_wenshu.py中有例子

3.代理配置:

  1. from pyspider.libs.base_handler import *
  2. from settings import USER_AGENT
  3. class Handler(BaseHandler):
  4. crawl_config = {
  5. 'itag': 'v10',
  6. 'proxy': 'http://duoipyewfynqt:wAgJx3wLYjF6A@222.184.35.196:33610',
  7. 'headers': {
  8. 'User-Agent': USER_AGENT
  9. }
  10. }
  11. @every(minutes=3 * 60)
  12. def on_start(self):
  13. for i in range(20):
  14. self.crawl("http://ip.chinaz.com/getip.aspx#{}".format(i), callback=self.page_parse)
  15. def page_parse(self, response):
  16. print(response.text)

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注