@twein89
2016-06-23T03:30:25.000000Z
字数 1321
阅读 817
爬虫
python
scrapy
本爬虫基于scrapy框架开发,包含以下几个模块
- 基础模块
- js数据处理模块
- 代理模块
环境:pyenv-virtualenv python3.5.1
pipy包:scrapy/pymongo/scrapy-splash
功能:解析html页面数据,存入数据库,数据库暂为mongodb
运行爬虫:
$ scraoy crawl <spidername>
有两种方案,splash和phantomjs
scrapy-splash的github地址
docker下安装splash
包依赖:
$ pip install scrapy-splash
运行splash服务
$ docker run -p 8050:8050 scrapinghub/splash
$ npm -g install phantomjs
异步调用通过扩展downloader handler实现
在downloader.py中调用js渲染脚本render.js
a. 通过扩展middlewares.py实现,每次发送请求从proxy.txt中随机选一个ip发送
验证代理有效性,需要运行脚本:
$ python proxychecker.py
未验证代理列表,proxylist.txt
已验证代理列表,proxy.txt
b. 如果通过splash进行抓取,仅需要在splash中配置代理
curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash
按提示修改.bash_profile
pyenv install 3.5.1
参照github的flyingelephantlab/aquarium
First, make sure Docker and Docker Compose are installed.
Then install cookiecutter:
pip install cookiecutter
Then generate a folder with config files:
cookiecutter gh:flyingelephantlab/aquarium
With all default options it'll create an aquarium folder in the current path. Go to this folder and start the Splash cluster:
cd ./aquarium
docker-compose up
部署完毕后配置代理,如代理为192.168.199.177:1080,类型为socks5
则修改~/aquarium/proxy-profiles/default.ini文件如下:
; enable tor for .onion links
[proxy]
;host = tor
;host = 192.168.99.100
host = 192.168.199.177
;port = 9050
port = 1080
type = socks5
[rules]
whitelist=