@Dukebf
2017-07-11T16:07:01.000000Z
字数 1864
阅读 1581
python
lxml 是 python 中解析 html 的一个模块。最好配合cssselect模块来使用
安装方法:
git clone https://github.com/lxml/lxml.git lxmlcd lxmlpyton setup.py installpython -m pip install cssselece
假设
假设有个 网站 http://demo.org ,代码如下:
<html><body><div><ul><li class="item-0" title='0.3'><a href="link1.html">first item</a></li><li class="item-1"><a href="link2.html">second item</a></li><li class="item-inactive"><a href="link3.html">third item</a></li><li class="item-1"><a href="link4.html">fourth item</a></li><li class="item-0"><a href="link5.html">fifth item</a></li></ul></div></body></html>
使用
lxml的html模块 来解析html字符串,获取class=item-0的节点,并打印第一个节点
使用例子:
import lxml.htmlimport requestshtml = requests.get('http://demo.org')tree = lxml.html.fromstring( html )item = tree.cssselect("li.item-0")print lxml.html.tostring(item[0],method='text',encoding='utf-8')
lxml的html.tostring方法是用来打印节点,它的使用例子有如下:
>>> from lxml import html>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')>>> html.tostring(root)'<p>Hello<br>world!</p>'>>> html.tostring(root, method='html')'<p>Hello<br>world!</p>'>>> html.tostring(root, method='xml')'<p>Hello<br/>world!</p>'>>> html.tostring(root, method='text')'Helloworld!'>>> html.tostring(root, method='text', encoding='unicode')u'Helloworld!'>>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')>>> html.tostring(root[0], method='text', encoding='unicode')u'Helloworld!TAIL'>>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False)u'Helloworld!'>>> doc = html.document_fromstring('<p>Hello<br>world!</p>')>>> html.tostring(doc, method='html', encoding='unicode')u'<html><body><p>Hello<br>world!</p></body></html>'
>>> item0 = lxml.html.cssselect(".item-0")[0]>>> value = itme0.get('title')
假设有个demo.html的文件,内容还是demo.org的内容
利用 lxml 的etree.parse 来读取文件,例子
from lxml import etreehtml = etree.parse('html')result = etree.tostring(html,pretty_print=True)print result