English | 简体中文
Data Annotation Instructions
DeepKE is an open source knowledge graph extraction and construction tool that supports low-resource, long-text and multi-modal knowledge extraction tools. Based on PyTorch, it can realize named entity recognition, relation extraction and attribute extraction functions. This version, DeepKE-cnSchema, is an out-of-the-box version that allows users to download the model for entity and relational knowledge extraction that supports cnSchema.
Chapter | Description |
---|---|
Introduction | The basic principles and supported data types of DeepKE |
Manual Data Annotation | How to manually annotate data |
Automatic Data Annotation | How to automatically annotate data based on DeepKE |
FAQ | Frequently Asked Questions |
References | Technical reports for this catalogue |
DeepKE is an open source knowledge graph extraction and construction tool that supports low-resource, long-text and multi-modal knowledge extraction tools. Based on PyTorch, it can realize named entity recognition, relationship extraction and attribute extraction functions. Also available for beginners are detailed documentation, Google Colab tutorials, online presentationsand slideshows.
It is well known that data is very important for model training. To facilitate the use of this tool, DeepKE provides detailed annotation of entity identification and relationship extraction data, so that users can obtain training data manually or automatically. The annotated data can be directly used by DeepKE for model training.
doccano
is an open source manual data annotation tool. It provides annotation functions for text classification, sequence labeling, and sequence-to-sequence. So you can create labeled data for sentiment analysis, named entity recognition, text summaries, and so on. Simply create a project and upload the data and start labeling, and you can build a dataset ready for DeepKE
training in a matter of hours. Using doccano
to extract annotation data for entity recognition and relationships is described below.
For details about doccano installation and configuration, see Github(doccano)
Once the server is installed and started, point your browser to http://0.0.0.0:8000
and click Log in.
Create a Project. Click Create
in the upper left corner to jump to the following interface.
doccano
supports a variety of text formats. The differences are as follows:
Textfile
:The uploaded file is in the format of txt
. When marking, a whole txt
file is displayed as one page of content.TextLine
:The uploaded file is in the format of txt
. When marking, a line of text in the txt
file is displayed as a page of content.JSONL
:Short for JSON Lines
, where each line is a valid JSON
value;CoNLL
: A file in CoNLL
format. Each line contains a series of tab-separated words.Add task labels
Span
and Relation
. Here, Span
refers to the target information fragment in the original text, that is, the entity of a certain type in entity recognition.PER
, LOC
, ORG
, etc.p
for the PER
label) and define the label color.Task annotation
Annotate
button to the far right of each data to annotate.Span
type tags for people and places.Click on Options
, Export Dataset
in the Dataset column to export the annotated data.
The markup data is stored in the same text file, one line per sample and in jsonl
format, which contains the following fields:
id
: The unique identifier ID
of the sample in the dataset.text
: Raw text data.entities
: The Span
tags contained in the data, each Span
tag contains four fields:
id
: The unique identification ID of Span
in the dataset.start_offset
: The starting position of Span
.end_offset
: The next position from the end of Span
.label
: Type of Span
.Example of exported data
{
"id":10,
"text":"University of California is located in California, United States.",
"entities":[
{
"id":15,
"label":"ORG",
"start_offset":0,
"end_offset":24
},
{
"id":16,
"label":"LOC",
"start_offset":39,
"end_offset":49
},
{
"id":17,
"label":"LOC",
"start_offset":51,
"end_offset":64
}
],
"relations":[
]
}
DeepKE
is a txt
file, with each line including words, separators and labels (see CoNLL
data format). The exported data will be pre-processed into the DeepKE
input format for training, please go to the detailed README
{text}*{head entity}*{tail entity}*{head entity type}*{tail entity type}
, where the head and tail entity types can be empty.Add task labels
Span
and Relation
. The Relation
type is used here. Relation
refers to the relation between Span
in the original text, that is, the relation between two entities in the relation extraction.Graduation
, Causal
, etc.b
for the Graduation
label) and define the label color.Task annotation
Annotate
button to the far right of each data to annotate.Span
type tags, PER
and LOC
, followed by the relationship tag Graduation
between entities. The Relation
tag points from the Subject
corresponding entity to the Object
corresponding entity.Click on Options
, Export Dataset
in the Dataset column to export the annotated data.
The markup data is stored in the same text file, one line per sample and in jsonl
format, which contains the following fields:
id
: The unique identifier ID
of the sample in the dataset.
text
: Raw text data.
entities
: The Span
tags contained in the data, each Span
tag contains four fields:
id
: The unique identification ID of Span
in the dataset.start_offset
: The starting position of Span
.end_offset
: The next position from the end of Span
.label
: Type of Span
.relations
: Relation
tags contained in the data, each Relation
tag contains four fields:
id
: (Span1
, Relation
, Span2
)Triples are uniquely identified in the dataset by their ID
, and the same triple in different samples corresponds to the same ID
.from_id
: The identifier ID
corresponding to Span1
.to_id
: The identifier ID
corresponding to Span2
.type
: Type of Relation
.Example of exported data
{
"id":13,
"text":"The collision resulted in two more crashes in the intersection, including a central concrete truck that was about to turn left onto college ave. *collision*crashes**",
"entities":[
{
"id":20,
"label":"MISC",
"start_offset":4,
"end_offset":13
},
{
"id":21,
"label":"MISC",
"start_offset":35,
"end_offset":42
}
],
"relations":[
{
"id":2,
"from_id":20,
"to_id":21,
"type":"Cause-Effect"
}
]
}
In order for users to better use DeepKE
to complete entity recognition tasks, we provide an easy-to-use dict matching based entity recognition automatic annotation tool.
The format of Dict:
Two entity Dicts (one in Chinese and one in English) are provided in advance, and the samples are automatically tagged using the entity dictionary + jieba part-of-speech tagging.
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
If you need to build a domain self-built dictionary, please refer to the pre-provided dictionary format (csv)
Entity | Label |
---|---|
Washington | LOC |
... | ... |
The input dictionary format is csv
(contains two columns, entities and corresponding labels).
Data to be automatically marked (txt format and separated by lines, as shown in the figure below) should be placed under the source_data
path, the script will traverse all txt format files in this folder, and automatically mark line by line.
training set
, validation set
, and test set
can be customized) can be directly used as training data in DeepKE.Implementation Environment:
language
: cn
or en
source_dir
: Corpus path (traverse all files in txt format under this folder, automatically mark line by line, the default is source_data
)dict_dir
: Entity dict path (defaults to vocab_dict.csv
)test_rate, dev_rate, test_rate
: The ratio of training_set, validation_set, and test_set (please make sure the sum is 1
, default 0.8:0.1:0.1
)python prepare_weaksupervised_data.py --language cn --dict_dir vocab_dict_cn.csv
python prepare_weaksupervised_data.py --language en --dict_dir vocab_dict_en.csv
We provide a simple distant supervised based tool to label relation labels for our RE tasks.
We specify the source file (dataset to be labeled) as .json
format and include one pair of entities, head entity and tail entity respectively. Each piece of data should contain at least the following five items: sentence
, head
, tail
, head_offset
, tail_offset
. The detailed json pattern is as follows:
[
{
"sentence": "This summer, the United States Embassy in Beirut, Lebanon, once again made its presence felt on the cultural scene by sponsoring a photo exhibition, an experimental jazz performance, a classical music concert and a visit from the Whiffenpoofs, Yale University's a cappella singers.",
"head": "Lebanon",
"tail": "Beirut",
"head_offset": "50",
"tail_offset": "42",
//...
},
//...
]
Entity pairs in source file will be matched with the triples in the triple file. The entity pairs will be labeled with the relation type if matched with the triples in triple file. If there is no triples match, the pairs will be labeled as None
type.
We provide an English and a Chinese triple file respectively. The English triple file comes from NYT
dataset which contains the following relation types:
"/business/company/place_founded",
"/people/person/place_lived",
"/location/country/administrative_divisions",
"/business/company/major_shareholders",
"/sports/sports_team_location/teams",
"/people/person/religion",
"/people/person/place_of_birth",
"/people/person/nationality",
"/location/country/capital",
"/business/company/advisors",
"/people/deceased_person/place_of_death",
"/business/company/founders",
"/location/location/contains",
"/people/person/ethnicity",
"/business/company_shareholder/major_shareholder_of",
"/people/ethnicity/geographic_distribution",
"/people/person/profession",
"/business/person/company",
"/people/person/children",
"/location/administrative_division/country",
"/people/ethnicity/people",
"/sports/sports_team/location",
"/location/neighborhood/neighborhood_of",
"/business/company/industry"
The Chinese triple file are from here with the following relation types:
{"object_type": "地点", "predicate": "祖籍", "subject_type": "人物"}
{"object_type": "人物", "predicate": "父亲", "subject_type": "人物"}
{"object_type": "地点", "predicate": "总部地点", "subject_type": "企业"}
{"object_type": "地点", "predicate": "出生地", "subject_type": "人物"}
{"object_type": "目", "predicate": "目", "subject_type": "生物"}
{"object_type": "Number", "predicate": "面积", "subject_type": "行政区"}
{"object_type": "Text", "predicate": "简称", "subject_type": "机构"}
{"object_type": "Date", "predicate": "上映时间", "subject_type": "影视作品"}
{"object_type": "人物", "predicate": "妻子", "subject_type": "人物"}
{"object_type": "音乐专辑", "predicate": "所属专辑", "subject_type": "歌曲"}
{"object_type": "Number", "predicate": "注册资本", "subject_type": "企业"}
{"object_type": "城市", "predicate": "首都", "subject_type": "国家"}
{"object_type": "人物", "predicate": "导演", "subject_type": "影视作品"}
{"object_type": "Text", "predicate": "字", "subject_type": "历史人物"}
{"object_type": "Number", "predicate": "身高", "subject_type": "人物"}
{"object_type": "企业", "predicate": "出品公司", "subject_type": "影视作品"}
{"object_type": "Number", "predicate": "修业年限", "subject_type": "学科专业"}
{"object_type": "Date", "predicate": "出生日期", "subject_type": "人物"}
{"object_type": "人物", "predicate": "制片人", "subject_type": "影视作品"}
{"object_type": "人物", "predicate": "母亲", "subject_type": "人物"}
{"object_type": "人物", "predicate": "编剧", "subject_type": "影视作品"}
{"object_type": "国家", "predicate": "国籍", "subject_type": "人物"}
{"object_type": "Number", "predicate": "海拔", "subject_type": "地点"}
{"object_type": "网站", "predicate": "连载网站", "subject_type": "网络小说"}
{"object_type": "人物", "predicate": "丈夫", "subject_type": "人物"}
{"object_type": "Text", "predicate": "朝代", "subject_type": "历史人物"}
{"object_type": "Text", "predicate": "民族", "subject_type": "人物"}
{"object_type": "Text", "predicate": "号", "subject_type": "历史人物"}
{"object_type": "出版社", "predicate": "出版社", "subject_type": "书籍"}
{"object_type": "人物", "predicate": "主持人", "subject_type": "电视综艺"}
{"object_type": "Text", "predicate": "专业代码", "subject_type": "学科专业"}
{"object_type": "人物", "predicate": "歌手", "subject_type": "歌曲"}
{"object_type": "人物", "predicate": "作词", "subject_type": "歌曲"}
{"object_type": "人物", "predicate": "主角", "subject_type": "网络小说"}
{"object_type": "人物", "predicate": "董事长", "subject_type": "企业"}
{"object_type": "Date", "predicate": "成立日期", "subject_type": "机构"}
{"object_type": "学校", "predicate": "毕业院校", "subject_type": "人物"}
{"object_type": "Number", "predicate": "占地面积", "subject_type": "机构"}
{"object_type": "语言", "predicate": "官方语言", "subject_type": "国家"}
{"object_type": "Text", "predicate": "邮政编码", "subject_type": "行政区"}
{"object_type": "Number", "predicate": "人口数量", "subject_type": "行政区"}
{"object_type": "城市", "predicate": "所在城市", "subject_type": "景点"}
{"object_type": "人物", "predicate": "作者", "subject_type": "图书作品"}
{"object_type": "Date", "predicate": "成立日期", "subject_type": "企业"}
{"object_type": "人物", "predicate": "作曲", "subject_type": "歌曲"}
{"object_type": "气候", "predicate": "气候", "subject_type": "行政区"}
{"object_type": "人物", "predicate": "嘉宾", "subject_type": "电视综艺"}
{"object_type": "人物", "predicate": "主演", "subject_type": "影视作品"}
{"object_type": "作品", "predicate": "改编自", "subject_type": "影视作品"}
{"object_type": "人物", "predicate": "创始人", "subject_type": "企业"}
You can also use your customized triple file, but the file format should be .csv
and with the following parttern:
head | tail | rel |
---|---|---|
Lebanon | Beirut | /location/location/contains |
... | ... | ... |
The output file names are labeled_train.json
, labeled_dev.json
, labeled_test.json
for the train
, dev
, test
dataset. The format of the output file is as follows:
[
{
"sentence": "This summer, the United States Embassy in Beirut, Lebanon, once again made its presence felt on the cultural scene by sponsoring a photo exhibition, an experimental jazz performance, a classical music concert and a visit from the Whiffenpoofs, Yale University's a cappella singers.",
"head": "Lebanon",
"tail": "Beirut",
"head_offset": "50",
"tail_offset": "42",
"relation": "/location/location/contains",
//...
},
//...
]
We automatically split the source data into three splits with the rate 0.8:0.1:0.1
.You can set your own split rate.
language
: en
or cn
source_file
: data file to be labeledtriple_file
: triple file pathtest_rate, dev_rate, test_rate
: The ratio of training_set, validation_set, and test_set (please make sure the sum is 1
, default 0.8:0.1:0.1
)python ds_label_data.py --language en --source_file source_data.json --triple_file triple_file.csv
Q: How much data do I need to mark?
10K
or so.Q: Is there any labeled data available?
Q: Automatic labeling of data training models does not work
Self-Traning
If the resources or techniques in this project have been useful to your research, you are welcome to cite the following paper in your thesis.
@article{zhang2022deepke,
title={DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population},
author={Zhang, Ningyu and Xu, Xin and Tao, Liankuan and Yu, Haiyang and Ye, Hongbin and Qiao, Shuofei and Xie, Xin and Chen, Xiang and Li, Zhoubo and Li, Lei and Liang, Xiaozhuan and others},
journal={arXiv preprint arXiv:2201.03335},
year={2022}
}
The contents of this project are for technical research purposes only and are not intended as a basis for any conclusive findings. Users are free to use the model as they wish within the scope of the licence, but we cannot be held responsible for direct or indirect damage resulting from the use of the contents of the project.
If you have any questions, please submit them in the GitHub Issue.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。