10 Star 68 Fork 14

PaddlePaddle / ERNIE

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

English|简体中文

./.metas/ERNIE_milestone.png

ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,该框架将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术,在国际权威的通用语言理解评估基准GLUE上,得分首次突破90分,获得全球第一。在今年3月落下帷幕的全球最大语义评测SemEval 2020上,ERNIE摘得5项世界冠军, 该技术也被全球顶级科技商业杂志《麻省理工科技评论》官方网站报道,相关创新成果也被国际顶级学术会议AAAI、IJCAI收录。ERNIE在工业界得到了大规模应用,如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。

提醒: ERNIE老版本代码已经迁移至repro分支,欢迎使用我们全新升级的基于动静结合的新版ERNIE套件进行开发。另外,也欢迎上EasyDL体验更丰富的功能(如ERNIE 2.0、ERNIE 2.1、ERNIE领域模型等)。

【了解更多】

新闻

  • 2020.12.29:

    • ERNIE开源工具套件全面升级 PaddlePaddle v2.0
    • 所有demo教程均引入AMP(混合精度训练), 平均提速达2.3倍。
    • 引入Gradient accumulation, 8G显存也可运行ERNIE-large模型。
  • 2020.9.24:

    • ERNIE-ViL 模型正式开源! (点击进入)
      • 面向视觉-语言知识增强的预训练框架,首次在视觉-语言预训练引入结构化的知识。
        • 利用场景图中的知识,构建了物体、属性和关系预测任务,精细刻画模态间细粒度语义对齐。
      • 五项视觉-语言下游任务取得最好效果,视觉常识推理榜单取得第一。
  • 2020.5.20:

    • 欢迎试用动态图实现的 ERNIE:
      • 动态执行, 所见即所得。
      • 大规模分布式训练。
      • 易于部署。
      • 通过Aistudio 教程快速入门NLP。
      • 向后兼容老版 checkpoint。
    • ERNIE-GEN 模型正式开源! (点击进入)
      • 最强文本生成预训练模型正式开源,相关工作已被 IJCAI-2020 收录。
        • 首次把 ERNIE 预训练技术能力扩展至文本生成领域,在多个典型任务上取得最佳。
        • 您现在即可下载论文报告的所有模型(包含 base/large/large-430G)。
      • 首次在预训练阶段加入span-by-span 生成任务,让模型每次能够生成一个语义完整的片段。
      • 提出填充式生成机制和噪声感知机制来缓解曝光偏差问题。
      • 精巧的 Mulit-Flow Attention 实现框架。
  • 2020.4.30 发布ERNIESage, 一种新型图神经网络模型,采用ERNIE做为aggreagtor. 由PGL实现。

  • 2020.3.27 在SemEval2020五项子任务上夺冠

  • 2019.12.26 GLUE榜第一名

  • 2019.11.6 发布ERNIE Tiny

  • 2019.7.7 发布ERNIE 2.0

  • 2019.3.16 发布ERNIE 1.0

导航

快速上手

import numpy as np
import paddle as P
from ernie.tokenizing_ernie import ErnieTokenizer
from ernie.modeling_ernie import ErnieModel

model = ErnieModel.from_pretrained('ernie-1.0')    # Try to get pretrained model from server, make sure you have network connection
model.eval()
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')

ids, _ = tokenizer.encode('hello world')
ids = P.to_tensor(np.expand_dims(ids, 0))  # insert extra `batch` dimension
pooled, encoded = model(ids)                 # eager execution
print(pooled.numpy())                        # convert  results to numpy

教程

手边没有GPU?欢迎在AIStudio中直接试用 ERNIE. (请选择最新版本的教程并申请GPU运行环境)

  1. 从0开始学ERNIE
  2. 情感识别
  3. 完形填空
  4. 知识蒸馏
  5. 万事不决问ERNIE
  6. 加载并读取老式checkpoint
  7. ERNIE作诗

安装

1. 安装 PaddlePaddle

本项目依赖PaddlePaddle 1.7.0+, 请参考这里安装 PaddlePaddle。

2. 安装 ERNIE 套件
pip install paddle-ernie

或者

git clone https://github.com/PaddlePaddle/ERNIE.git --depth 1
cd ERNIE
pip install -r requirements.txt
pip install -e .

propeller是辅助模型训练的高级框架,包含NLP常用的前、后处理流程。你可以通过将本repo根目录放入PYTHONPATH的方式导入propeller:

export PYTHONPATH=$PWD:$PYTHONPATH
3. 下载预训练模型(可选)
Model 细节参数 下载简写
ERNIE 1.0 Base 中文 Layer:12, Hidden:768, Heads:12 ernie-1.0
ERNIE Tiny Layer:3, Hdden:1024, Heads:16 ernie-tiny
ERNIE 2.0 Base 英文 Layer:12, Hidden:768, Heads:12 ernie-2.0-en
ERNIE 2.0 Large 英文 Layer:24, Hidden:1024, Heads16 ernie-2.0-large-en
ERNIE Gen Base 英文 Layer:12, Hidden:768, Heads:12 ernie-gen-base-en
ERNIE Gen Large 英文 Layer:24, Hidden:1024, Heads:16 ernie-gen-large-en
ERNIE Gen Large 430G英文 Layer:24, Hidden:1024, Heads:16 + 额外430G 预训练语料 ernie-gen-large-430g-en
4. 下载数据集

英文数据集

运行脚本,下载GLUE datasets.

请将数据目录整理成以下格式,方便在后续 demo 教程中使用(通过--data_dir参数将数据路径传入训练脚本);

data/xnli
├── dev
│   └── 1
├── test
│   └── 1
└── train
    └── 1

示例数据(MNLI任务测试、训练集合)。

中文数据

数据集 描述
XNLI XNLI 是由 Facebook 和纽约大学的研究者联合构建的自然语言推断数据集,包括 15 种语言的数据。我们用其中的中文数据来评估模型的语言理解能力。链接
ChnSentiCorp ChnSentiCorp 是一个中文情感分析数据集,包含酒店、笔记本电脑和书籍的网购评论。
MSRA-NER MSRA-NER (SIGHAN2006) 数据集由微软亚研院发布,其目标是识别文本中具有特定意义的实体,包括人名、地名、机构名。
NLPCC2016-DBQA NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务,其目标是从候选中找到合适的文档作为问题的答案。链接
CMRC2018 CMRC2018 是中文信息学会举办的评测,评测的任务是抽取类阅读理解。链接

支持的NLP任务

  • 使用 动态图 模型进行finetune:
python3 ./ernie_d/demo/finetune_classifier.py \
       --from_pretrained ernie-1.0 \
       --data_dir ./data/xnli
  • 加入--use_amp以启用AMP功能(请在支持TensorCore设备上启用AMP)

  • 通过--bsz指定全局batch_size(一步优化中模型所能见到的样本数), 通过--micro_bsz 指定输入给每一张GPU卡的样本数 若--bsz > --micro_bsz 脚本会自动开启梯度累计功能.

  • 分布式 finetune

paddle.distributed.launch 是一个进程管理器,我们采用它在每一张GPU上启动一个python进程,并配置相应的环境变量以进行分布式训练:

当采用分布式训练时,我们采用max_steps做为终止条件而非epoch, 这样处理是为了避免进程间死锁。 你可以通过EPOCH * NUM_TRAIN_EXAMPLES / TOTAL_BATCH的方式计算出所需执行的max_steps. 另外值得注意的是训练集需要在不同的进程间进行切分;以避免所有进程训练同一份数据造成的过拟合。

示例脚本(请确保你有两张以上GPU卡, 在线模型下载功能在paddle.distributed.launch下无法工作, 你可能需要一个先通过单卡finetune方式下载预训练模型,或者根据这里手动下载并解压预训练模型):

python3 -m paddle.distributed.launch \
./demo/finetune_classifier_distributed.py \
    --data_dir data/mnli \
    --max_steps 10000 \
    --from_pretrained ernie2.0-en

更多示例脚本:

  1. 情感分析
  2. 语义匹配
  3. 命名实体识别(NER)
  4. 机器阅读理解 (需要多卡环境运行;参见上面"分布式 finetune"一节)
  5. 文本摘要生成
  6. 使用静态图完成文本分类

推荐超参数设置:

任务 batch size learning rate
CoLA 32 / 64 (base) 3e-5
SST-2 64 / 256 (base) 2e-5
STS-B 128 5e-5
QQP 256 3e-5(base)/5e-5(large)
MNLI 256 / 512 (base) 3e-5
QNLI 256 2e-5
RTE 16 / 4 (base) 2e-5(base)/3e-5(large)
MRPC 16 / 32 (base) 3e-5
WNLI 8 2e-5
XNLI 512 1e-4(base)/4e-5(large)
CMRC2018 64 3e-5
DRCD 64 5e-5(base)/3e-5(large)
MSRA-NER(SIGHAN2006) 16 5e-5(base)/1e-5(large)
ChnSentiCorp 24 5e-5(base)/1e-5(large)
LCQMC 32 2e-5(base)/5e-6(large)
NLPCC2016-DBQA 64 2e-5(base)/1e-5(large)
VCR 64 2e-5(base)/2e-5(large)

预训练 (ERNIE 1.0)

请见这里

在线预测

如果finetune_classifier.py中指定了--inference_model_dir参数,funetune脚本会将你的模型序列化并产出可以直接部署线上预测的inference_model.

关于生产环境中使用线上预测代码的实现细节,请见C++ inference API. 或者你可以使用propeller启动一个多GPU预测服务(需要GPU环境),只需执行:

python -m propeller.tools.start_server -m /path/to/saved/inference_model  -p 8881

即可启动预测服务;随后在Python端采用如下命令访问该服务(仅限 python3):

from propeller.service.client import InferenceClient
from ernie.tokenizing_ernie import ErnieTokenizer

client = InferenceClient('tcp://localhost:8881')
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
ids, sids = tokenizer.encode('hello world')
ids = np.expand_dims(ids, 0)
sids = np.expand_dims(sids, 0)
result = client(ids, sids)

你也可从此处下载一个预先制作好的ernie-1.0 base模型的 inference_model. 该模型没有经过finetune,一般可以用做上层模型结构的 feature-base finetune或者做为一个文本特征抽取器。 因为该模行由老版API 产出,在进行客户端请求时需要在输入tensor后面追加一个维度:

ids = np.expand_dims(ids, -1) # ids.shape==[BATCH, SEQLEN, 1]

蒸馏

知识蒸馏是进行ERNIE模型压缩、加速的有效方式;关于知识蒸馏的实现细节请参见这里

文献引用

ERNIE 1.0

@article{sun2019ernie,
  title={Ernie: Enhanced representation through knowledge integration},
  author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Chen, Xuyi and Zhang, Han and Tian, Xin and Zhu, Danxiang and Tian, Hao and Wu, Hua},
  journal={arXiv preprint arXiv:1904.09223},
  year={2019}
}

ERNIE 2.0

@article{sun2019ernie20,
  title={ERNIE 2.0: A Continual Pre-training Framework for Language Understanding},
  author={Sun, Yu and Wang, Shuohuan and Li, Yukun and Feng, Shikun and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:1907.12412},
  year={2019}
}

ERNIE-GEN

@article{xiao2020ernie-gen,
  title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
  author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2001.11314},
  year={2020}
}

ERNIE-ViL

@article{yu2020ernie,
  title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph},
  author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2006.16934},
  year={2020}
}

若希望复现 paper 中的所有实验,请切换至本repo的repro分支。

讨论组

  • ERNIE官方主页
  • Github Issues: bug reports, feature requests, install issues, usage issues, etc.
  • QQ 群: 760439550 (ERNIE discussion group).
  • QQ 2群: 958422639 (ERNIE discussion group-v2).
  • Forums: discuss implementations, research, etc.
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

ERNIE 2.0 是基于持续学习的语义理解预训练框架,使用多任务学习增量式构建预训练任务 展开 收起
Python 等 4 种语言
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Python
1
https://gitee.com/paddlepaddle/ERNIE.git
git@gitee.com:paddlepaddle/ERNIE.git
paddlepaddle
ERNIE
ERNIE
develop

搜索帮助

14c37bed 8189591 565d56ea 8189591