@Wayne-Z
2017-11-19T10:18:09.000000Z
字数 8505
阅读 3018
NLP
word2dvec
网上找到了很多帖子,但是都千篇一律地相似,但是其中的一个关键点楼主又总觉得没有说清楚,即python3.5.2中,byte和str类型的不同使得直接从github上下载的代码在楼主的机子上并不能跑起来,经过几次迅(蛋)速(疼)地尝试和摸索,还是找到了解决方案的,所以在这里po一下(没错,我就是这么小白)。
好的,还是先从下载语料开始。
英文wiki语料链接 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
中文wiki语料链接 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
text8链接 http://mattmahoney.net/dc/text8.zip
下载语料的同时,我们进入process_wiki.py,修改这几行代码用与处理wiki,原作是这样的:
__author__ = 'huang'
import os
import logging
import sys
from gensim.corpora import WikiCorpus
if __name__=='__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
if len(sys.argv) < 3:
print(globals()['__doc__'] %locals())
sys.exit(1)
inp, outp = sys.argv[1:3]
space = ' '
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + '\n')
i = i + 1
if i % 10000 == 0:
logger.info('Saved ' + str(i) + ' articles')
output.close()
logger.info('Finished ' + str(i) + ' articles')
恩,就删了几行空行,语料下载完以后,放到同一个文件夹里,我们可以先直接尝试在cmd里运行如下代码
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
如果没问题,congratulations!但是楼主的平台并不能跑起来,这也是这篇文章存在的原因。问题出在这里
(C:\Anaconda3) E:\NLP\word2vec-for-wiki-master>python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 30, in <module>
output.write(space.join(text).decode() + '\n')
TypeError: sequence item 0: expected str instance, bytes found
度娘了一下,是因为python3里面bytes和str是两种东西,这个在原文里面也说过,但是原文的解决方案,还是让楼主一头雾水(没错,就是这么小白)。根据之前的文章,再查阅了一下文档,原来是因为join()函数返回的类型也是bytes或bytearray,而write()函数只接受str类型。楼主原先尝(zuo)试(si)过如下几种方案,但是都失败了
Traceback (most recent call last):
File "process_wiki.py", line 30, in <module>
output.write(bytes.join(space,text).decode() + '\n')
TypeError: descriptor 'join' requires a 'bytes' object but received a 'str'
(C:\Anaconda3) E:\NLP\word2vec-for-wiki-master>python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 30, in <module>
output.write(bytes.join(space.encode(),text).decode() + '\n')
UnicodeEncodeError: 'gbk' codec can't encode character '\u1f00' in position 1714: illegal multibyte sequence
(C:\Anaconda3) E:\NLP\word2vec-for-wiki-master>python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 30, in <module>
output.write(bytes.join(''.encode(),text).decode() + '\n')
UnicodeEncodeError: 'gbk' codec can't encode character '\u1f00' in position 1474: illegal multibyte sequence
(C:\Anaconda3) E:\NLP\word2vec-for-wiki-master>python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 30, in <module>
output.write(bytes.join(b'',text).decode() + '\n')
UnicodeEncodeError: 'gbk' codec can't encode character '\u1f00' in position 1474: illegal multibyte sequence
最终楼主将下面这行代码进行了改写
space = ' '.encode()
先用下面的代码进行了测试
data = sapce.join(text)
print(data)
#output.write(str(data) + '\n')
在cmd中看到了输出,但是print可以直接输出bytes而wirte只接受str,所以运行代码最终改成了这样
data = sapce.join(text)
output.write(str(data) + '\n')
然后结果如下
(C:\Anaconda3) E:\NLP\word2vec-for-wiki-master>python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2016-07-28 10:48:11,057: INFO: Saved 10000 articles
2016-07-28 10:49:44,660: INFO: Saved 20000 articles
2016-07-28 10:51:04,023: INFO: Saved 30000 articles
2016-07-28 10:52:13,199: INFO: Saved 40000 articles
2016-07-28 10:53:07,548: INFO: Saved 50000 articles
2016-07-28 10:53:45,695: INFO: Saved 60000 articles
2016-07-28 10:54:18,993: INFO: Saved 70000 articles
2016-07-28 10:54:51,188: INFO: Saved 80000 articles
2016-07-28 10:55:50,520: INFO: Saved 90000 articles
·
·
·
·
2016-07-28 15:24:22,182: INFO: Saved 4040000 articles
2016-07-28 15:25:09,770: INFO: Saved 4050000 articles
2016-07-28 15:25:46,915: INFO: Saved 4060000 articles
2016-07-28 15:26:24,892: INFO: Saved 4070000 articles
2016-07-28 15:27:05,343: INFO: Saved 4080000 articles
2016-07-28 15:27:48,280: INFO: Saved 4090000 articles
2016-07-28 15:28:22,146: INFO: finished iterating over Wikipedia corpus of 4099408 documents with 2229304913 positions (total 16753779 articles, 2290359456 positions before pruning articles shorter than 50 words)
2016-07-28 15:28:22,155: INFO: Finished 4099408 articles
五个小时四十分钟,文章数量已经有409万多了,还真不少···
而训练生成的text文件已经长到无法用记事本打开了,而后输入
python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
开始训练英文wiki词向量
2016-07-28 15:47:35,297: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2016-07-28 15:47:35,302: INFO: collecting all words and their counts
2016-07-28 15:47:35,370: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-07-28 15:48:05,500: INFO: PROGRESS: at sentence #10000, processed 29336126 words, keeping 434884 word types
2016-07-28 15:48:39,042: INFO: PROGRESS: at sentence #20000, processed 55594275 words, keeping 628122 word types
训练时打开另一个cmd,进入文件夹,处理中文wiki语料
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
E:\NLP\word2vec-for-wiki-master>python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2016-07-28 16:28:21,686: INFO: Saved 10000 articles
2016-07-28 16:29:07,536: INFO: Saved 20000 articles
然后就是漫长的等待了···
text8是现成的文本,所以直接进行训练,
python train_word2vec_model.py text8 text8.model text8.vector.
结果如下
2016-07-28 20:03:42,295: INFO: PROGRESS: at 99.82% examples, 405001 words/s, in_qsize 12, out_qsize 3
2016-07-28 20:03:42,435: INFO: worker thread finished; awaiting finish of 7 more threads
2016-07-28 20:03:42,445: INFO: worker thread finished; awaiting finish of 6 more threads
2016-07-28 20:03:42,445: INFO: worker thread finished; awaiting finish of 5 more threads
2016-07-28 20:03:42,445: INFO: worker thread finished; awaiting finish of 4 more threads
2016-07-28 20:03:42,465: INFO: worker thread finished; awaiting finish of 3 more threads
2016-07-28 20:03:42,495: INFO: worker thread finished; awaiting finish of 2 more threads
2016-07-28 20:03:42,495: INFO: worker thread finished; awaiting finish of 1 more threads
2016-07-28 20:03:42,505: INFO: worker thread finished; awaiting finish of 0 more threads
2016-07-28 20:03:42,505: INFO: training on 85026035 raw words (62532401 effective words) took 154.3s, 405163 effective words/s
2016-07-28 20:03:42,505: INFO: saving Word2Vec object under text8.model, separately None
2016-07-28 20:03:42,505: INFO: storing numpy array 'syn0' to text8.model.syn0.npy
2016-07-28 20:03:43,506: INFO: not storing attribute syn0norm
2016-07-28 20:03:43,506: INFO: not storing attribute cum_table
2016-07-28 20:03:43,506: INFO: storing numpy array 'syn1neg' to text8.model.syn1neg.npy
2016-07-28 20:03:45,225: INFO: storing 71290x400 projection weights into text8.vector.
此时可以进行如下测试:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.l
gensim.models.Word2Vec.load gensim.models.Word2Vec.log_accuracy
gensim.models.Word2Vec.load_word2vec_format
In [2]: model = gensim.models.Word2Vec.looad('text8.model')
In [3]: model = gensim.models.Word2Vec.load('text8.model')
In [4]: model.mo
model.most_similar model.most_similar_cosmul
In [4]: model.most_similar('man')
Out[4]:
[('woman', 0.6650575399398804),
('girl', 0.5865204334259033),
('creature', 0.5350353717803955),
('boy', 0.510942816734314),
('person', 0.5094308257102966),
('men', 0.5073959827423096),
('evil', 0.48292240500450134),
('totoro', 0.47985178232192993),
('god', 0.476554274559021),
('vanity', 0.47478240728378296)]
In [5]: model.most_similar('girl')
Out[5]:
[('blonde', 0.7728073596954346),
('baby', 0.7689986824989319),
('kid', 0.7603048086166382),
('woman', 0.7313079833984375),
('girls', 0.7117128968238831),
('boy', 0.6976305246353149),
('joey', 0.6945637464523315),
('boys', 0.6894382238388062),
('bride', 0.685029149055481),
('rabbit', 0.6838369369506836)]
英文wiki的处理时间太长,楼主就没有坚持训练下去了,接下来中文wiki语料在与处理之后还要进行繁简转换,参考的文章
以及网上普遍看好opencc,在github下载并阅读完下面的安装指导后,发现这对于但是在楼主搭建的win10平台上并不是很适用,最后还是参考了Licstar的文章,直接进入下载opencc0.4.2-win32,根据安装说明,解压以后,其实还要添加到path里面去才能直接在cmd中使用,可是输入
opencc -help
看安装是否成功,输出
Open Chinese Convert (OpenCC) Command Line Tool
Version 0.4.2
Author: BYVoid <byvoid@byvoid.com>
Bug Report: http://github.com/BYVoid/OpenCC/issues
Usage:
opencc [Options]
Options:
-i [file], --input=[file] Read original text from [file].
-o [file], --output=[file] Write converted text to [file].
-c [file], --config=[file] Load configuration of conversion from [file].
-v, --version Print version and build information.
-h, --help Print this help.
With no input file, reads standard input and writes converted stream to standard output.
Default configuration(zhs2zht.ini) will be loaded if not set.
说明成功了,就直接输入
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
大约3分钟后,任务完成,如果没有安装jieba,先安装一下
pip install jieba
安装完成以后,就可以直接运行
python separate_words.py wiki.zh.text.jian wiki.zh.text.jian.seq
然后等待完成,输入
python train_word2vec_model.py wiki.zh.text.jian
训练完成后测试即可。