- 64
- 0
- 约1.59千字
- 约 3页
- 2017-08-22 发布于江苏
- 举报
SRILM详解过程
Generating the N-gram Count File
ngram-count -text train.zh -order 5 -write train.count -unk
-text:training corpus name
-order:n-gram count
-write:output countfile name
-unk:mark OOV asunk
train.count里面的内容:
第一列分别为一元,二元,三元,四元,五元;第二列为counts in training corpus
Generating the N-gram Language model
ngram-count -read train.count -order 5 -lm train.lm -gt1min 3 -gt1max 7 -gt2min 3 -gt2max 7 -gt3min 3 -gt3max 7 -gt4min 3 -gt4max 7 -gt5min 3 -gt5max 7
-read:read count file
-lm:output LM file name
-grnmin:Good-Turing discounting for n-gram
train.lm里面的内容
第一列为log probability(base 10);第三列为log of backoff weight(base)
Calculate the Test Data Perplexity
ngram -ppl test.zh -order 5 -lm train.lm
-ppl:calculate perplexity for test data
ppl和ppl1分别的计算公式:
其中为句子的个数
从以上的过程可以看出SRILM的功能:
Generate the n-gram count file from the corpus
Train the language model from the n-gram count file
Calculate the test data perplexity using the trained language model
平滑方法为Good-Turing时的困惑度:
ngram-count -read train.count -order 5 -lm train.lm -gt1min 3 -gt1max 7 -gt2min 3 -gt2max 7 -gt3min 3 -gt3max 7 -gt4min 3 -gt4max 7 -gt5min 3 -gt5max 7
平滑方法为Absolute Discounting时的困惑度:
ngram-count -read train.count -order 5 -lm train.lm -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5 -cdiscount4 0.5 -cdiscount5 0.5
平滑方法为Witten-Bell Discounting时的困惑度:
ngram-count -read train.count -order 5 -lm train.lm -wbdiscount1 -wbdiscount2 -wbdiscount3 -wbdiscount4 -wbdiscount5
平滑方法为Modified Knerser-Ney Discounting时的困惑度
ngram-count -read train.count -order 5 -lm train.lm -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5
您可能关注的文档
- ResurgentJapaneseMilitarism(people).doc
- REUSEALVGRIDDISPLAYLVC超详细讲解.doc
- ReviewingAccountingInformationSPD.doc
- reviewofSignalsandsystems.doc
- ReviewofUnitWouldyoumindturningdownthemusic说课稿.doc
- revisedfrontoperation.doc
- Revisionsheet.doc
- RFEndpointDetectorCalibration.doc
- Riddlesandproblems.doc
- RIPandOSPFRedistribution.doc
原创力文档

文档评论(0)