@buoge
2017-06-05T14:50:41.000000Z
字数 2447
阅读 1276
Python
https://github.com/dongweiming/daenerys
https://www.zhihu.com/question/20899988/answer/124261539
https://zhuanlan.zhihu.com/p/21479334
https://github.com/luyishisi/Anti-Anti-Spider
tensorflow中文资源汇总:
https://www.urlteam.org/2017/03/tensorflow-%E8%B5%84%E6%BA%90%E5%A4%A7%E5%85%A8-%E4%B8%AD%E6%96%87%E7%89%88/
https://github.com/awolfly9/IPProxyTool
https://github.com/lepture/captcha
https://github.com/coleifer/huey
requests+lxml+xpath
https://github.com/dongweiming/mtime
https://xlzd.me/2016/01/12/python-crawler-08
https://zhuanlan.zhihu.com/p/25633789
https://juejin.im/post/58dce2248d6d8100613a4cfb
fiddler 参数和输入输出监听
爬页面不是最好的方法,app api 是一个很不错的实现
https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py
https://llimllib.github.io/bloomfilter-tutorial/
https://github.com/camsong/blog/issues/2
SlimIt is a JavaScript minifier written in Python. It compiles JavaScript into more compact code so that it downloads and runs faster.
SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor.
http://www.bjhee.com/celery.html
http://funhacks.net/2016/12/13/celery/
新手还是自己写的好,不要用框架,后期可以用框架提高效率
scrapy 爬取豆瓣网站
http://www.ituring.com.cn/article/114408
* 高密代理
* 隐藏域input,返回假数据或测试数据
* 检查robots.txt文件
* 蜜罐:隐藏A链接,被ip访问时就可以判定为不友好访问
* 绝大多数场景大公司爬虫ip是可以拿到,可以被判断出来
* 判断为爬虫后返回给你测试数据或假数据,循环数据,部分数据
* 封禁策略,手工封禁,
* 可能你的机器是肉机,所以封禁也不是永久的封禁,一段时间可以解封
from multiprocessing import Pool
from multiprocessing.dummy import Pool
哪个速度快就用那个。从此以后我都尽量在写兼容的方式,这样在多线程/多进程之间切换非常方便。
FROM https://zhuanlan.zhihu.com/p/22246193
https://github.com/istresearch/scrapy-cluster