[关闭]
@hanxiaoyang 2016-11-08T23:15:26.000000Z 字数 9648 阅读 16871

word2vec训练中文模型

word2vec


1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。中文维基百科的打包文件地址为
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用 process_wiki_data.py处理这个XML压缩文件,执行:python process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # process_wiki_data.py 用于解析XML,将XML的wiki数据转换为text格式
  4. import logging
  5. import os.path
  6. import sys
  7. from gensim.corpora import WikiCorpus
  8. if __name__ == '__main__':
  9. program = os.path.basename(sys.argv[0])
  10. logger = logging.getLogger(program)
  11. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  12. logging.root.setLevel(level=logging.INFO)
  13. logger.info("running %s" % ' '.join(sys.argv))
  14. # check and process input arguments
  15. if len(sys.argv) < 3:
  16. print globals()['__doc__'] % locals()
  17. sys.exit(1)
  18. inp, outp = sys.argv[1:3]
  19. space = " "
  20. i = 0
  21. output = open(outp, 'w')
  22. wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
  23. for text in wiki.get_texts():
  24. output.write(space.join(text) + "\n")
  25. i = i + 1
  26. if (i % 10000 == 0):
  27. logger.info("Saved " + str(i) + " articles")
  28. output.close()
  29. logger.info("Finished Saved " + str(i) + " articles")

得到信息:

  1. 2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
  2. 2016-08-11 20:40:08,329: INFO: Saved 10000 articles
  3. 2016-08-11 20:40:45,501: INFO: Saved 20000 articles
  4. 2016-08-11 20:41:23,659: INFO: Saved 30000 articles
  5. 2016-08-11 20:42:01,748: INFO: Saved 40000 articles
  6. 2016-08-11 20:42:33,779: INFO: Saved 50000 articles
  7. ......
  8. 2016-08-11 20:55:23,094: INFO: Saved 200000 articles
  9. 2016-08-11 20:56:14,692: INFO: Saved 210000 articles
  10. 2016-08-11 20:57:04,614: INFO: Saved 220000 articles
  11. 2016-08-11 20:57:57,979: INFO: Saved 230000 articles
  12. 2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405 positions before pruning articles shorter than 50 words)
  13. 2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articles

Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg
接着用word2vec工具训练:
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector

  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*-
  3. # train_word2vec_model.py用于训练模型
  4. import logging
  5. import os.path
  6. import sys
  7. import multiprocessing
  8. from gensim.corpora import WikiCorpus
  9. from gensim.models import Word2Vec
  10. from gensim.models.word2vec import LineSentence
  11. if __name__ == '__main__':
  12. program = os.path.basename(sys.argv[0])
  13. logger = logging.getLogger(program)
  14. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
  15. logging.root.setLevel(level=logging.INFO)
  16. logger.info("running %s" % ' '.join(sys.argv))
  17. # check and process input arguments
  18. if len(sys.argv) < 4:
  19. print globals()['__doc__'] % locals()
  20. sys.exit(1)
  21. inp, outp1, outp2 = sys.argv[1:4]
  22. model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
  23. workers=multiprocessing.cpu_count())
  24. # trim unneeded model memory = use(much) less RAM
  25. #model.init_sims(replace=True)
  26. model.save(outp1)
  27. model.save_word2vec_format(outp2, binary=False)

运行信息

  1. 2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector
  2. 2016-08-12 09:50:02,592: INFO: collecting all words and their counts
  3. 2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
  4. 2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types
  5. 2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types
  6. 2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types
  7. ...
  8. 2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types
  9. 2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types
  10. 2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types
  11. 2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences
  12. 2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<5
  13. 2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words
  14. 2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 25
  15. 2016-08-12 09:52:29,683: INFO: resetting layer weights
  16. 2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
  17. 2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s
  18. 2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s
  19. 2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
  20. 2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s
  21. 2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s
  22. 2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s
  23. 2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s
  24. ......
  25. 2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s
  26. 2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s
  27. 2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s
  28. 2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s
  29. 2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
  30. 2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None
  31. 2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm
  32. 2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy
  33. 2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy
  34. 2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector

测试模型效果:

  1. In [1]: import gensim
  2. In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
  3. In [3]: model.most_similar(u"足球")
  4. Out[3]:
  5. [(u'\u8054\u8d5b', 0.6553816199302673),
  6. (u'\u7532\u7ea7', 0.6530429720878601),
  7. (u'\u7bee\u7403', 0.5967546701431274),
  8. (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
  9. (u'\u4e59\u7ea7', 0.5840631723403931),
  10. (u'\u8db3\u7403\u961f', 0.5560152530670166),
  11. (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
  12. (u'allsvenskan', 0.5249762535095215),
  13. (u'\u4ee3\u8868\u961f', 0.5214947462081909),
  14. (u'\u7532\u7ec4', 0.5177896022796631)]
  15. In [4]: result = model.most_similar(u"足球")
  16. In [5]: for e in result:
  17. print e[0], e[1]
  18. ....:
  19. 联赛 0.65538161993
  20. 甲级 0.653042972088
  21. 篮球 0.596754670143
  22. 俱乐部 0.587228953838
  23. 乙级 0.58406317234
  24. 足球队 0.556015253067
  25. 亚足联 0.530800580978
  26. allsvenskan 0.52497625351
  27. 代表队 0.521494746208
  28. 甲组 0.51778960228
  29. In [6]: result = model.most_similar(u"男人")
  30. In [7]: for e in result:
  31. print e[0], e[1]
  32. ....:
  33. 女人 0.77537125349
  34. 家伙 0.617369174957
  35. 妈妈 0.567102909088
  36. 漂亮 0.560832381248
  37. 잘했어 0.540875017643
  38. 谎言 0.538448691368
  39. 爸爸 0.53660941124
  40. 傻瓜 0.535608053207
  41. 예쁘다 0.535151124001
  42. mc 0.529670000076
  43. In [8]: result = model.most_similar(u"女人")
  44. In [9]: for e in result:
  45. print e[0], e[1]
  46. ....:
  47. 男人 0.77537125349
  48. 我的某 0.589010596275
  49. 妈妈 0.576344847679
  50. 잘했어 0.562340974808
  51. 美丽 0.555426716805
  52. 爸爸 0.543958246708
  53. 新娘 0.543640494347
  54. 谎言 0.540272831917
  55. 妞儿 0.531066179276
  56. 老婆 0.528521537781
  57. In [10]: result = model.most_similar(u"青蛙")
  58. In [11]: for e in result:
  59. print e[0], e[1]
  60. ....:
  61. 老鼠 0.559612870216
  62. 乌龟 0.489831030369
  63. 蜥蜴 0.478990525007
  64. 0.46728849411
  65. 鳄鱼 0.461885392666
  66. 蟾蜍 0.448014199734
  67. 猴子 0.436584025621
  68. 白雪公主 0.434905380011
  69. 蚯蚓 0.433413207531
  70. 螃蟹 0.4314712286
  71. In [12]: result = model.most_similar(u"姨夫")
  72. In [13]: for e in result:
  73. print e[0], e[1]
  74. ....:
  75. 堂伯 0.583935439587
  76. 祖父 0.574735701084
  77. 妃所生 0.569327116013
  78. 内弟 0.562012672424
  79. 早卒 0.558042645454
  80. 0.553856015205
  81. 胤祯 0.553288519382
  82. 陈潜 0.550716996193
  83. 愔之 0.550510883331
  84. 叔父 0.550032019615
  85. In [14]: result = model.most_similar(u"衣服")
  86. In [15]: for e in result:
  87. print e[0], e[1]
  88. ....:
  89. 鞋子 0.686688780785
  90. 穿着 0.672499775887
  91. 衣物 0.67173999548
  92. 大衣 0.667605519295
  93. 裤子 0.662670075893
  94. 内裤 0.662210345268
  95. 裙子 0.659705817699
  96. 西装 0.648508131504
  97. 洋装 0.647238850594
  98. 围裙 0.642895817757
  99. In [16]: result = model.most_similar(u"公安局")
  100. In [17]: for e in result:
  101. print e[0], e[1]
  102. ....:
  103. 司法局 0.730189085007
  104. 公安厅 0.634275555611
  105. 公安 0.612798035145
  106. 房管局 0.597343325615
  107. 商业局 0.597183346748
  108. 军管会 0.59476184845
  109. 体育局 0.59283208847
  110. 财政局 0.588721752167
  111. 戒毒所 0.575558543205
  112. 新闻办 0.573395550251
  113. In [18]: result = model.most_similar(u"铁道部")
  114. In [19]: for e in result:
  115. print e[0], e[1]
  116. ....:
  117. 盛光祖 0.565509021282
  118. 交通部 0.548688530922
  119. 批复 0.546967327595
  120. 刘志军 0.541010737419
  121. 立项 0.517836689949
  122. 报送 0.510296344757
  123. 计委 0.508456230164
  124. 水利部 0.503531932831
  125. 国务院 0.503227233887
  126. 经贸委 0.50156635046
  127. In [20]: result = model.most_similar(u"清华大学")
  128. In [21]: for e in result:
  129. print e[0], e[1]
  130. ....:
  131. 北京大学 0.763922810555
  132. 化学系 0.724210739136
  133. 物理系 0.694550514221
  134. 数学系 0.684280991554
  135. 中山大学 0.677202701569
  136. 复旦 0.657914161682
  137. 师范大学 0.656435549259
  138. 哲学系 0.654701948166
  139. 生物系 0.654403865337
  140. 中文系 0.653147578239
  141. In [22]: result = model.most_similar(u"卫视")
  142. In [23]: for e in result:
  143. print e[0], e[1]
  144. ....:
  145. 湖南 0.676812887192
  146. 中文台 0.626506924629
  147. 収蔵 0.621356606483
  148. 黄金档 0.582251906395
  149. cctv 0.536769032478
  150. 安徽 0.536752820015
  151. 非同凡响 0.534517168999
  152. 唱响 0.533438682556
  153. 最强音 0.532605051994
  154. 金鹰 0.531676828861
  155. In [24]: result = model.most_similar(u"习近平")
  156. In [25]: for e in result:
  157. print e[0], e[1]
  158. ....:
  159. 胡锦涛 0.809472680092
  160. 江泽民 0.754633367062
  161. 李克强 0.739740967751
  162. 贾庆林 0.737033963203
  163. 曾庆红 0.732847094536
  164. 吴邦国 0.726941585541
  165. 总书记 0.719057679176
  166. 李瑞环 0.716384887695
  167. 温家宝 0.711952567101
  168. 王岐山 0.703570842743
  169. In [26]: result = model.most_similar(u"林丹")
  170. In [27]: for e in result:
  171. print e[0], e[1]
  172. ....:
  173. 黄综翰 0.538035452366
  174. 蒋燕皎 0.52646958828
  175. 刘鑫 0.522252976894
  176. 韩晶娜 0.516120731831
  177. 王晓理 0.512289524078
  178. 王适 0.508560419083
  179. 杨影 0.508159279823
  180. 陈跃 0.507353425026
  181. 龚智超 0.503159761429
  182. 李敬元 0.50262516737
  183. In [28]: result = model.most_similar(u"语言学")
  184. In [29]: for e in result:
  185. print e[0], e[1]
  186. ....:
  187. 社会学 0.632598280907
  188. 人类学 0.623406708241
  189. 历史学 0.618442356586
  190. 比较文学 0.604823827744
  191. 心理学 0.600066184998
  192. 人文科学 0.577783346176
  193. 社会心理学 0.575571238995
  194. 政治学 0.574541330338
  195. 地理学 0.573896467686
  196. 哲学 0.573873817921
  197. In [30]: result = model.most_similar(u"计算机")
  198. In [31]: for e in result:
  199. print e[0], e[1]
  200. ....:
  201. 自动化 0.674171924591
  202. 应用 0.614087462425
  203. 自动化系 0.611132860184
  204. 材料科学 0.607891201973
  205. 集成电路 0.600370049477
  206. 技术 0.597518980503
  207. 电子学 0.591316461563
  208. 建模 0.577238917351
  209. 工程学 0.572855889797
  210. 微电子 0.570086717606
  211. In [32]: model.similarity(u"计算机", u"自动化")
  212. Out[32]: 0.67417196002404789
  213. In [33]: model.similarity(u"女人", u"男人")
  214. Out[33]: 0.77537125129824813
  215. In [34]: model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
  216. Out[34]: u'\u4e2d\u5fc3'
  217. In [35]: print model.doesnt_match(u"早餐 晚餐 午餐 中心".split())
  218. 中心
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注