@kpatrick
2019-10-29T11:42:20.000000Z
字数 3605
阅读 128
交接
conda activate /home/xiaojie/.conda/envs/xiaojie
英中双语语料库目录(未清洗):/home/xiaojie/big_files/corpus
训练使用的数据目录:
/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/data
/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_en-zh/training/data
工具代码根目录:/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox
.
├── apis
├── bleu
├── jieba
├── mosesdecoder
├── nematus
└── subword-nmt
主要的工具作用:
中-英,英-中的目录结构一致,所以这里只拿中-英的讲解。
/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/data
corpus.en
:英文训练集corpus.zh
:中文训练集corpus_valid.en
:英文验证集corpus_valid.zh
:中文验证集corpus.bpe.en
:英文训练集,输入给模型corpus.bpe.zh
:中文训练集,输入给模型vocab.en
:英文词典vocab.zh
:中文词典目录:
/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/scripts
.
├── bleu.py
├── evaluate.sh
├── evaluation_utils.py
├── postprocess.sh
├── preprocess.sh
├── rouge.py
├── test.py
├── train.sh
└── validate.sh
其中几个主要的脚本:
preprocess.sh
:文本输入模型前的预处理postprocess.sh
:模型输出后的后处理,最终转化成目标语言的文本输出validate.sh
:在验证集上评估指标train.sh
:重要,里面定义了各种超参(具体超参定义,需要在nematus里的训练脚本中寻找),然后传给nematus的训练脚本,对翻译的模型执行训练。/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/model
见1.3.1节。
中文和英文的预处理方式不同,中文需要用jieba分词,对应的脚本见1.3.2,是shell的可执行脚本。
执行方式:
cd /home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/scripts
./preprocess.sh
配置的目录(工具路径和训练用到的gpu配置):
/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/vars
配置内容:
# scripts directory of moses decoder: http://www.statmt.org/moses/
# you do not need to compile moses; a simple download is sufficient
moses_scripts=/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/mosesdecoder/scripts
#scripts for subword segmentation: https://github.com/rsennrich/subword-nmt
bpe_scripts=/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/subword-nmt
#nematus (theano version): https://github.com/EdinburghNLP/nematus/tree/theano
nematus_home=/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/nematus
#jieba word segmentation utility: https://pypi.python.org/pypi/jieba/
#this is only required for Chinese
zh_segment_home=/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/jieba
# Theano/TensorFlow device; change this to execute Nematus on GPU
#
# For Theano, a typical value is 'cuda'
#
# For TensorFlow, the value will be passed to CUDA_VISIBLE_DEVICES. It should
# be a list of GPU identifiers. For example, '1' or '0,1,3'
device='3'
指定训练用的硬件,就是最后一个device
的配置
cd /home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/scripts
./train.sh
训练时会有模型的指标输出,如果需要手动计算,可以利用1.2中的bleu工具计算,在之前对模型评价时,我整理中英和英中的计算脚本,目录分别是,可以用notebook打开查看:
zh-en
:/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/bleu/Bleu_en.ipynb
en-zh
:/home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/toolbox/bleu/Bleu_zh.ipynb
接口部署使用的是nematus的工具包,实现的逻辑是将模型load进显存,然后每一次的接口请求都使用gpu中的模型执行一次预测,最终将结果返回给调用方。
执行方式(已经封装成shell脚本,端口号在脚本中可配置):
cd /home/xiaojie/URun.ResearchPrototype/People/Xiaojie/MachineTranslation/Transformer_zh-en/training/server
./start_server_zh-en.sh