转译器 | 对专业特征强的现代汉语文本的无监督分析

Original 陈静零壹Lab 2022-10-08

对专业特征强的现代汉语文本的无监督分析

词发现文本切词 EM算法中国历史博客

随着公开或私下的数字化的文本数据越来越多，出现了对从文本中有效、自动提取信息的计算工具的巨大需求。因为中文语言完全不同于字母语言，没有特定的词分界，所以大部分已有的中文文本挖掘工具都需要一个预设的词汇表或者大量的相关训练语料，但这并不是所有文本都具备的。我们在此介绍的是一种无监督的方法，从上至下的词语发现和分词（TopWORDS），从大体积的、非结构的中文文本中同时进行词语发掘、切分语词，并能提供方法对已发现词语进行排序，并进行更高层次的文本分析。TopWORDS特别适用于挖掘那些潜在词汇表未知，或者目标文本与可用的训练语料差异很大的，在线的、主题明确的文本。从TopWORDS导出的数据可以导入像Topic modeling, word embedding,以及association pattern finding这样的文本分析工具，其结果同于甚至优于使用监督切词工具的结果。

On the unsupervised analysis of domain-specific Chinese texts

Ke Denga, Peter K. Bolb, Kate J. Lic, and Jun S. Liu a,d,1
A Center for Statistical Science & Department of Industry Engineering, Tsinghua University, Beijing 100084, China; Department of East Asian Languages & Civilizations, Harvard
University, Cambridge,MA 02138; cSawyer Business School, Suffolk University, Boston,MA 02108; and dDepartment of Statistics, Harvard University, Cambridge,MA 02138

Word Discovery | Text Segmentations | EM Algorithm | Chinese History | Blogs

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

（论文详见“阅读原文”）

转载声明

未经许可，禁止转载零壹Lab所发文章。如需转载，请于显著位置注明出处，并放置清晰零壹Lab二维码。如需长期授权或其他未尽事宜请联系零壹Lab邮箱。零壹Lab保留追究私自转载者法律责任之权利。

零壹Lab邮箱：dh01lab@hotmail.com

END

主编 / 陈静

责编 / 徐力恒顾佳蕙

美编 / 张家伟

零壹Lab

记录数字媒介之日常

反思科技与人文精神

长按关注

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

要么空仓！要么盯紧这个！

劲爆！为了姜萍两位女CEO互揭老底！

转译器 | 对专业特征强的现代汉语文本的无监督分析

您可能也对以下帖子感兴趣

“家属和记者取得联系”：记者的退场意味深长

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

这位副市长，跨省升正厅

要么空仓！要么盯紧这个！

劲爆！为了姜萍两位女CEO互揭老底！

生成图片，分享到微信朋友圈

转译器 | 对专业特征强的现代汉语文本的无监督分析

您可能也对以下帖子感兴趣