@Dukebf
2017-07-12T00:07:01.000000Z
字数 1864
阅读 1400
python
lxml 是 python 中解析 html 的一个模块。最好配合cssselect
模块来使用
安装方法:
git clone https://github.com/lxml/lxml.git lxml
cd lxml
pyton setup.py install
python -m pip install cssselece
假设
假设有个 网站 http://demo.org ,代码如下:
<html>
<body>
<div>
<ul>
<li class="item-0" title='0.3'><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</body>
</html>
使用
lxml
的html
模块 来解析html字符串,获取class=item-0
的节点,并打印第一个节点
使用例子:
import lxml.html
import requests
html = requests.get('http://demo.org')
tree = lxml.html.fromstring( html )
item = tree.cssselect("li.item-0")
print lxml.html.tostring(item[0],method='text',encoding='utf-8')
lxml
的html.tostring
方法是用来打印节点,它的使用例子有如下:
>>> from lxml import html
>>> root = html.fragment_fromstring('<p>Hello<br>world!</p>')
>>> html.tostring(root)
'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='html')
'<p>Hello<br>world!</p>'
>>> html.tostring(root, method='xml')
'<p>Hello<br/>world!</p>'
>>> html.tostring(root, method='text')
'Helloworld!'
>>> html.tostring(root, method='text', encoding='unicode')
u'Helloworld!'
>>> root = html.fragment_fromstring('<div><p>Hello<br>world!</p>TAIL</div>')
>>> html.tostring(root[0], method='text', encoding='unicode')
u'Helloworld!TAIL'
>>> html.tostring(root[0], method='text', encoding='unicode', with_tail=False)
u'Helloworld!'
>>> doc = html.document_fromstring('<p>Hello<br>world!</p>')
>>> html.tostring(doc, method='html', encoding='unicode')
u'<html><body><p>Hello<br>world!</p></body></html>'
>>> item0 = lxml.html.cssselect(".item-0")[0]
>>> value = itme0.get('title')
假设有个demo.html
的文件,内容还是demo.org
的内容
利用 lxml
的etree.parse
来读取文件,例子
from lxml import etree
html = etree.parse('html')
result = etree.tostring(html,pretty_print=True)
print result