@twein89 2016-06-23T03:30:25.000000Z 字数 1321 阅读 868

酒店信息爬虫架构

爬虫 python scrapy

简介

本爬虫基于scrapy框架开发，包含以下几个模块

基础模块

js数据处理模块

代理模块

1.基础模块

环境：pyenv-virtualenv python3.5.1
pipy包：scrapy/pymongo/scrapy-splash
功能：解析html页面数据，存入数据库,数据库暂为mongodb
运行爬虫:

$ scraoy crawl <spidername>

2.JS数据处理模块

有两种方案，splash和phantomjs

a. splash方案

scrapy-splash的github地址
docker下安装splash
包依赖：

$ pip install scrapy-splash

运行splash服务

$ docker run -p 8050:8050 scrapinghub/splash

b. phantomjs方案

$ npm -g install phantomjs

异步调用通过扩展downloader handler实现
在downloader.py中调用js渲染脚本render.js

3.代理

a. 通过扩展middlewares.py实现,每次发送请求从proxy.txt中随机选一个ip发送
验证代理有效性，需要运行脚本:

$ python proxychecker.py

未验证代理列表，proxylist.txt
已验证代理列表，proxy.txt

b. 如果通过splash进行抓取，仅需要在splash中配置代理

4.异步任务调用

5.部署

安装python环境

curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash

按提示修改.bash_profile

pyenv install 3.5.1

a. splash服务的部署

参照github的flyingelephantlab/aquarium
First, make sure Docker and Docker Compose are installed.
Then install cookiecutter:

pip install cookiecutter

Then generate a folder with config files:

cookiecutter gh:flyingelephantlab/aquarium

With all default options it'll create an aquarium folder in the current path. Go to this folder and start the Splash cluster:

cd ./aquarium
docker-compose up

部署完毕后配置代理，如代理为192.168.199.177:1080,类型为socks5
则修改~/aquarium/proxy-profiles/default.ini文件如下：

; enable tor for .onion links
[proxy]
;host = tor
;host = 192.168.99.100
host = 192.168.199.177
;port = 9050
port = 1080
type = socks5

[rules]
whitelist=