text-generation/automatic-summarization.md · 鲁珀特之泪/luge-ai

自动摘要

数据集简介：

LCSTS(Large-scale Chinese Short Text Summarization)数据集从新浪微博自动采集，以原作者撰写的微博全文作为输入，微博头部中括号内的概括句作为输出，提供了目前为止规模最大的中文摘要数据集。在240余万自动标注数据中，人工精标注了1万余数据的质量得分。LCSTS数据集在后续的（中文）短摘要算法研究中被广泛采用。
数据集详情：

名称规模创建日期作者单位论文下载评测

LCSTS 240万短摘要数据 2015-08 户保田等哈尔滨工业大学（深圳）智能计算研究中心链接链接 N/A
基于该数据集发表的相关论文：
- Gu, Jiatao, Zhengdong Lu, Hang Li, and Victor OK Li. "Incorporating Copying Mechanism in Sequence-to-Sequence Learning." In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631-1640. 2016.
- Li, Piji, Wai Lam, Lidong Bing, and Zihao Wang. "Deep Recurrent Generative Decoder for Abstractive Text Summarization." In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2091-2100. 2017.
- Lin, Junyang, Xu Sun, Shuming Ma, and Qi Su. "Global Encoding for Abstractive Summarization." In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 163-169. 2018.

名称	规模	创建日期	作者	单位	论文	下载	评测
LCSTS	240万短摘要数据	2015-08	户保田等	哈尔滨工业大学（深圳）智能计算研究中心	链接	链接	N/A

数据集简介：

NLPCC（国际自然语言处理与中文计算会议）是由中国计算机学会中文信息技术专业委员会主办的年度学术会议，自2012年起每年举办一次。NLPCC2017的一个评测任务（Task3）是单文档摘要任务，包含5万条经过标注的新闻数据，标注的结果是不多于60字的短摘要。
数据集详情：

名称规模创建日期作者单位论文下载评测

NLPCC2017 5万短摘要数据 2017-08 N/A N/A N/A 链接链接
基于该数据集发表的相关论文：
- Zheng, Hao, and Mirella Lapata. "Sentence Centrality Revisited for Unsupervised Summarization." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6236-6247. 2019.

名称	规模	创建日期	作者	单位	论文	下载	评测
NLPCC2017	5万短摘要数据	2017-08	N/A	N/A	N/A	链接	链接