代码拉取完成,页面将自动刷新
https://arxiv.org/pdf/1911.02116.pdf
XLM-R (XLM-RoBERTa) is a generic cross lingual sentence encoder that obtains state-of-the-art results on many cross-lingual understanding (XLU) benchmarks. It is trained on 2.5T of filtered CommonCrawl data in 100 languages (list below).
Language | Language | Language | Language | Language |
---|---|---|---|---|
Afrikaans | Albanian | Amharic | Arabic | Armenian |
Assamese | Azerbaijani | Basque | Belarusian | Bengali |
Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese |
Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian |
Czech | Danish | Dutch | English | Esperanto |
Estonian | Filipino | Finnish | French | Galician |
Georgian | German | Greek | Gujarati | Hausa |
Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic |
Indonesian | Irish | Italian | Japanese | Javanese |
Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji) |
Kyrgyz | Lao | Latin | Latvian | Lithuanian |
Macedonian | Malagasy | Malay | Malayalam | Marathi |
Mongolian | Nepali | Norwegian | Oriya | Oromo |
Pashto | Persian | Polish | Portuguese | Punjabi |
Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian |
Sindhi | Sinhala | Slovak | Slovenian | Somali |
Spanish | Sundanese | Swahili | Swedish | Tamil |
Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish |
Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek |
Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish |
Model | Description | #params | vocab size | Download |
---|---|---|---|---|
xlmr.base |
XLM-R using the BERT-base architecture | 250M | 250k | xlm.base.tar.gz |
xlmr.large |
XLM-R using the BERT-large architecture | 560M | 250k | xlm.large.tar.gz |
(Note: Above are final model checkpoints. If you were using previously released v0
version, we recommend using above. They have same architecture and dictionary.)
Model | average | en | fr | es | de | el | bg | ru | tr | ar | vi | th | zh | hi | sw | ur |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
roberta.large.mnli (TRANSLATE-TEST)
|
77.8 | 91.3 | 82.9 | 84.3 | 81.2 | 81.7 | 83.1 | 78.3 | 76.8 | 76.6 | 74.2 | 74.1 | 77.5 | 70.9 | 66.7 | 66.8 |
xlmr.large (TRANSLATE-TRAIN-ALL)
|
83.6 | 89.1 | 85.1 | 86.6 | 85.7 | 85.3 | 85.9 | 83.5 | 83.2 | 83.1 | 83.7 | 81.5 | 83.7 | 81.6 | 78.0 | 78.1 |
Model | average | en | es | de | ar | hi | vi | zh |
---|---|---|---|---|---|---|---|---|
BERT-large |
- | 80.2/67.4 | - | - | - | - | - | - |
mBERT |
57.7 / 41.6 | 77.7 / 65.2 | 64.3 / 46.6 | 57.9 / 44.3 | 45.7 / 29.8 | 43.8 / 29.7 | 57.1 / 38.6 | 57.5 / 37.3 |
xlmr.large |
70.7 / 52.7 | 80.6 / 67.8 | 74.1 / 56.0 | 68.5 / 53.6 | 63.1 / 43.5 | 69.2 / 51.6 | 71.3 / 50.9 | 68.0 / 45.4 |
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
# Download xlmr.large model
wget https://dl.fbaipublicfiles.com/fairseq/models/xlmr.large.tar.gz
tar -xzvf xlmr.large.tar.gz
# Load the model in fairseq
from fairseq.models.roberta import XLMRModel
xlmr = XLMRModel.from_pretrained('/path/to/xlmr.large', checkpoint_file='model.pt')
xlmr.eval() # disable dropout (or leave in train mode to finetune)
en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378, 8999, 38, 2]
xlmr.decode(en_tokens) # 'Hello world!'
zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens) # '你好,世界'
hi_tokens = xlmr.encode('नमस्ते दुनिया')
assert hi_tokens.tolist() == [0, 68700, 97883, 29405, 2]
xlmr.decode(hi_tokens) # 'नमस्ते दुनिया'
ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'
fr_tokens = xlmr.encode('Bonjour le monde')
assert fr_tokens.tolist() == [0, 84602, 95, 11146, 2]
xlmr.decode(fr_tokens) # 'Bonjour le monde'
# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])
# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)
@article{conneau2019unsupervised,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。