Punctuation prediction for Vietnamese texts using conditional random fields

We investigate the punctuation prediction for the Vietnamese language. This problem is crucial as it can be used to add suitable punctuation marks to machine-transcribed speeches, which usually do not have such information. Similar to previous works for English and Chinese languages, we formulate th...

全面介紹

Saved in:
書目詳細資料
Main Authors: PHAM, Hong Quang, NGUYEN, Binh T., CUONG, Nguyen Viet
格式: text
語言:English
出版: Institutional Knowledge at Singapore Management University 2019
主題:
在線閱讀:https://ink.library.smu.edu.sg/sis_research/7816
https://ink.library.smu.edu.sg/context/sis_research/article/8819/viewcontent/3368926.3369716_pv.pdf
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:We investigate the punctuation prediction for the Vietnamese language. This problem is crucial as it can be used to add suitable punctuation marks to machine-transcribed speeches, which usually do not have such information. Similar to previous works for English and Chinese languages, we formulate this task as a sequence labeling problem. After that, we apply the conditional random field model for solving the problem and propose a set of appropriate features that are useful for prediction. Moreover, we build two corpora from Vietnamese online news and movie subtitles and perform extensive experiments on these data. Finally, we ask four volunteers to insert punctuations into a small sample of our dataset. The experimental results show that this problem is challenging, even for a human, and our model can achieve near performance in comparison to a human.