1 Star 0 Fork 0

lyyong / ease

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

Hugging Face Transformers Hugging Face Models Arxiv

EASE is a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities proposed in our paper EASE: Entity-Aware Contrastive Learning of Sentence Embedding. This repository contains the source code to train the model and evaluate it with downstream tasks. Our code is mainly based on that of SimCSE.

Released Models

Hugging Face Models

Our published models are listed as follows. You can use these models by using HuggingFace's Transformers.

Monolingual Models Avg. STS Avg. STC
sosuke/ease-bert-base-uncased 77.0 63.1
sosuke/ease-roberta-base 76.8 58.6
Multilingual Models Avg. mSTS Avg. mSTC
sosuke/ease-bert-base-multilingual-cased 57.2 36.1
sosuke/ease-xlm-roberta-base 57.1 36.3

Use EASE with Huggingface


import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our pretrained model. 
tokenizer = AutoTokenizer.from_pretrained("sosuke/ease-bert-base-multilingual-cased")
model = AutoModel.from_pretrained("sosuke/ease-bert-base-multilingual-cased")

# Set pooler.
pooler = lambda last_hidden, att_mask: (last_hidden * att_mask.unsqueeze(-1)).sum(1) / att_mask.sum(-1).unsqueeze(-1)

# Tokenize input texts.
texts = [
    "Ils se préparent pour un spectacle à l'école.",
    "They are preparing for a show at school.",
    "Two medical professionals in green look on at something."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    last_hidden = model(**inputs, output_hidden_states=True, return_dict=True).last_hidden_state
embeddings = pooler(last_hidden, inputs["attention_mask"])

# Calculate cosine similarities
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print(f"Cosine similarity between {texts[0]} and {texts[1]} is {cosine_sim_0_1}")
print(f"Cosine similarity between {texts[0]} and {texts[2]} is {cosine_sim_0_2}")

Please see here for other pooling methods.

Setups

Python

Run the following script to install the dependent libraries.

pip install -r requirements.txt

Before training, please download the datasets for training and evaluation.

bash download_all.sh

Evaluation

We provide evaluation code for sentence embeddings including Semantic Textual Similarity (STS 2012-2016, STS Benchmark, SICK-elatedness, and the extended version of STS 2017 dataset), Short Text Clustering (Eight STC benchmarks and MewsC-16), Cross-lingual Parallel Matching (Tatoeba) and Cross-lingual Text Classification (MLDoc).

Set your model or path of tranformers-based checkpoint (--model_name_or_path), pooling method type (--pooler), and what set of tasks (--task_set). See the example code below.

Semantic Textual Similarity
python evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg \ 
    --task_set cl-sts 
Short Text Clustering
python downstreams/text-clustering/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \
    --pooler avg \ 
    --task_set cl
Cross-lingual Parallel Matching
python downstreams/parallel-matching/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg 
Cross-lingual Text Classification
python downstreams/cross-lingual-transfer/evaluation.py \
    --model_name_or_path sosuke/ease-bert-base-multilingual-cased \ 
    --pooler avg

Please refer to each evaluation code for detailed descriptions of arguments.

Training

You can train an EASE model in a monolingual setting using English Wikipedia sentences or in a multilingual setting using Wikipedia sentences in 18 languages.

We provide example trainig scripts for both monolingual (train_monolingual_ease.sh) and multilingual (train_multilingual_ease.sh) settings.

MewsC-16

We construct MewsC-16 (Multilingual Short Text Clustering Dataset for News in 16 languages) from Wikinews. This dataset contains topic sentences from Wikinews articles in 13 categories and 16 languages. More detailed information is available in our paper, Appendix E.

Statistics and Scores
Language Sentences Label types XLM-Rbase EASE-XLM-Rbase
ar 2,224 11 27.9 27.4
ca 3,310 11 27.1 27.9
cs 1,534 9 25.2 41.2
de 6,398 8 30.5 39.5
en 12,892 13 25.8 39.6
eo 227 8 24.7 37.0
es 6,415 11 20.8 38.2
fa 773 9 37.2 41.5
fr 10,697 13 25.3 33.3
ja 1,984 12 44.0 47.6
ko 344 10 24.1 33.7
pl 7,247 11 28.8 39.9
pt 8,921 11 27.4 32.9
ru 1,406 12 20.1 27.2
sv 584 7 30.1 29.8
tr 459 7 30.7 44.9
Avg. 28.1 36.3

Note that the results are slightly different from those reported in the original paper since we further cleaned the data after the publication.

Citation

Arxiv

@inproceedings{nishikawa-etal-2022-ease,
    title = "{EASE}: Entity-Aware Contrastive Learning of Sentence Embedding",
    author = "Nishikawa, Sosuke  and
      Ri, Ryokan  and
      Yamada, Ikuya  and
      Tsuruoka, Yoshimasa  and
      Echizen, Isao",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.284",
    pages = "3870--3885",
    abstract = "We present EASE, a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.The advantage of using entity supervision is twofold: (1) entities have been shown to be a strong indicator of text semantics and thus should provide rich training signals for sentence embeddings; (2) entities are defined independently of languages and thus offer useful cross-lingual alignment supervision.We evaluate EASE against other unsupervised models both in monolingual and multilingual settings.We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks and it significantly outperforms baseline methods in multilingual settings on a variety of tasks.Our source code, pre-trained models, and newly constructed multi-lingual STC dataset are available at https://github.com/studio-ousia/ease.",
}
MIT License Copyright (c) 2022 Studio Ousia Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/910024445/ease.git
git@gitee.com:910024445/ease.git
910024445
ease
ease
main

搜索帮助