@twein89
2017-06-12T09:51:58.000000Z
字数 2606
阅读 1230
爬虫
仓库地址:
repository:git@gitlab.qjdchina.com:yantengwei/business-python.git * branch:master * tag:d0.0.3-1
参考api文档
页面解析器pyquery文档
框架作者博客里有pyspider中文系列教程,可以先看这个
想要调试一个项目比如court_zhixing.py这个脚本
可以在本地用pyspider one命令调试,例如pyspider one court/court_zhixing.py
或者可以在webUI调试。
1.项目数据迁移:
因为pyspider在webUI中save的项目是存放在数据库中projectdb表里的,有时需要在文件数据库和其他类型数据库之间进行迁移,
以sqlite和mongodb为例:
cd spider_tender
把项目从sqlite迁移到mongdb:
python tools/migrate.py sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db mongodb+projectdb://127.0.0.1:27017/projectdb
把项目从mongdb迁移到本地sqlite:
python tools/migrate.py mongodb+projectdb://127.0.0.1:27017/projectdb sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db
2.项目提交:本地从local_config启动项目:
pyspider -c config/local_config.json
在webUI中贴入代码保存,项目会自动保存到本地sqlite,spider_tender/data/project.db中,然后进行git提交即可更新项目
3.项目部署:环境配置及项目部署见部署文档
1.config文件说明:
pre, online的配置文件在spider_tender/config文件夹下
以online_config.json为例:
{
"result_worker": {
"result_cls": "mongo_result_worker.OnlineResultWorker"
},
"court_mongo_param": {
"ip": "10.171.250.51",
"port": "27017",
"database": "proxy",
"username": "proxy",
"password": "proxy_1",
"replica": "online"
}
}
result_worker这个是属于框架自带设置
"result_worker": {
"result_cls": "mongo_result_worker.OnlineResultWorker"
},
这部分是属于自定义的配置
"court_mongo_param": {
"ip": "10.171.250.51",
"port": "27017",
"database": "proxy",
"username": "proxy",
"password": "proxy_1",
"replica": "online"
}
自定义配置的读取利用了click模块的api,读取court的mongodb配置的代码写在spider_tender/court/court_db.py里,
config = click.get_current_context().__dict__['obj']['config']
db_param = config.get('court_mongo_param', 0)
ip = db_param['ip']
port = int(db_param['port'])
database = db_param['database']
replica = db_param['replica']
username = db_param['username']
password = db_param['password']
2.抓取结果保存
两种方式:一种通过写result_worker,另一种在抓取脚本里写on_result
如果想从一个response里面返回多个结果,可以使用self.send_message和on_message这两个方法。参考文档中http://docs.pyspider.org/en/latest/apis/self.send_message/
def detail_page(self, response):
for i, each in enumerate(response.json['products']):
self.send_message(self.project_name, {
"name": each['name'],
'price': each['prices'],
}, url="%s#%s" % (response.url, i))
def on_message(self, project, msg):
return msg
并且在court_wenshu.py中有例子
3.代理配置:
from pyspider.libs.base_handler import *
from settings import USER_AGENT
class Handler(BaseHandler):
crawl_config = {
'itag': 'v10',
'proxy': 'http://duoipyewfynqt:wAgJx3wLYjF6A@222.184.35.196:33610',
'headers': {
'User-Agent': USER_AGENT
}
}
@every(minutes=3 * 60)
def on_start(self):
for i in range(20):
self.crawl("http://ip.chinaz.com/getip.aspx#{}".format(i), callback=self.page_parse)
def page_parse(self, response):
print(response.text)