2 Star 13 Fork 52

chinnasamy / CodeGeeX

forked from CodeGeeX / CodeGeeX 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

🏠 主页 | 📖 博客 | 🪧 示例 | 🤖 模型下载 | 📒 API申请 | 📃 论文(即将推出!)

🛠 VS CodeJetbrains 插件 | 👋 欢迎加入 SlackTelegram微信开发者交流群 | 🌐 English

CodeGeeX vscode extension version CodeGeeX vscode extension last update CodeGeeX download

CodeGeeX: 多语言代码生成模型

CodeGeeX是一个具有130亿参数的多编程语言代码生成预训练模型。CodeGeeX采用华为MindSpore框架实现,在鹏城实验室“鹏城云脑II”中的192个节点(共1536个国产昇腾910 AI处理器)上训练而成。截至2022年6月22日,CodeGeeX历时两个月在20多种编程语言的代码语料库(>8500亿Token)上预训练得到。CodeGeeX有以下特点:

  • 高精度代码生成:支持生成Python、C++、Java、JavaScript和Go等多种主流编程语言的代码,在HumanEval-X代码生成任务上取得47%~60%求解率,较其他开源基线模型有更佳的平均性能。代码生成示例
  • 跨语言代码翻译:支持代码片段在不同编程语言间进行自动翻译转换,翻译结果正确率高,在HumanEval-X代码翻译任务上超越了其它基线模型。代码翻译示例
  • 自动编程插件:CodeGeeX插件现已上架VSCode插件市场(完全免费),用户可以通过其强大的少样本生成能力,自定义代码生成风格和能力,更好辅助代码编写。插件下载
  • 模型跨平台开源: 所有代码和模型权重开源开放,用作研究用途。CodeGeeX同时支持昇腾和英伟达平台,可在单张昇腾910或英伟达V100/A100上实现推理。申请模型权重

全新多编程语言评测基准HumanEval-X:HumanEval-X是第一个支持功能正确性评测的多语言、多任务的基准,包含820个人工编写的高质量代码生成题目、测试用例与参考答案,覆盖5种编程语言(Python、C++、Java、JavaScript、Go),支持代码生成与代码翻译能力的评测。如何使用

在HumanEval-X代码生成任务上,与其它开源基线模型相比,CodeGeeX取得了最佳的平均性能。

新闻

  • 🌟 2022-02: CodeGeeX "Coding With AI"黑客松正在进行中,为CodeGeeX设计应用并赢取奖品(RTX 4090、DJI无人机等)!

  • 2022-12-31: 我们在 codegeex-fastertransformer 中发布了 CodeGeeX 的 FasterTransformer 版本。INT8加速版本达到 <15ms/token 的平均速度。祝大家新年快乐!

  • 2022-12-13: 我们开源了VS Code插件源码:codegeex-vscode-extension,参考 QuickStart 开始开发吧!

  • 2022-12-11: CodeGeeX for Jetbrains IDEs已上线,支持IntelliJ IDEA, PyCharm, GoLand, CLion等,点击下载

  • 2022-12-04: 我们开源了量化代码(需要更少的显存:27GB -> 15GB)以及模型并行代码(可以运行在多个显存至少8GB的GPUs上)。

  • 2022-09-30: 我们开源了跨平台代码和模型权重,同时支持昇腾和英伟达平台。

使用指南

CodeGeeX最初使用Mindspore框架实现,并在昇腾910AI芯片上进行训练。为适配更多平台,我们将其转换到Megatron-LM框架,支持Pytorch+GPU环境。

安装

需要Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+,通过以下命令安装 codegeex:

git clone git@github.com:THUDM/CodeGeeX.git
cd CodeGeeX
pip install -e .

模型权重

通过该链接申请权重,您将收到一个包含临时下载链接文件urls.txt的邮件。推荐使用aria2通过以下命令快速下载(请保证有足够的硬盘空间存放权重(~26GB)):

aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt 

使用以下命令合并得到完整的权重:

cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
tar xvf codegeex_13b.tar.gz

用GPU进行推理

尝试使用CodeGeeX模型生成第一个程序吧!首先,在配置文件configs/codegeex_13b.sh中写明存放权重的路径。其次,将提示(可以是任意描述或代码片段)写入文件tests/test_prompt.txt,运行以下脚本即可开始推理(需指定GPU序号):

# On a single GPU (with more than 27GB RAM)
bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt

# With quantization (with more than 15GB RAM)
bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt

# On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions)
bash ./scripts/convert_ckpt_parallel.sh <LOAD_CKPT_PATH> <SAVE_CKPT_PATH> <MP_SIZE>
bash ./scripts/test_inference_parallel.sh <MP_SIZE> ./tests/test_prompt.txt

插件使用指南

基于CodeGeeX,我们开发了免费的插件,支持 VS Code 与 Jetbrains IDEs,未来会支持更多平台。

VS Code版本,在应用市场搜索“codegeex”或通过该链接安装。详细的使用指南在CodeGeeX VS Code插件使用指南。我们也开源了VS Code插件源码:codegeex-vscode-extension,参考QuickStart 开始开发吧!

Jetbrains版本,在Plugins市场搜索“codegeex”或通过该链接安装。 请确保IDE版本在2021.1或更高。CodeGeeX目前支持 IntelliJ IDEA, PyCharm, GoLand, CLion, Android Studio, AppCode, Aqua, DataSpell, DataGrip, Rider, RubyMine, WebStorm。

CodeGeeX: 多语言代码生成模型

架构:CodeGeeX是一个基于transformers的大规模预训练编程语言模型。它是一个从左到右生成的自回归解码器,将代码或自然语言标识符(token)作为输入,预测下一个标识符的概率分布。CodeGeeX含有40个transformer层,每层自注意力块的隐藏层维数为5120,前馈层维数为20480,总参数量为130亿。模型支持的最大序列长度为2048。

左侧:CodeGeeX训练数据中各编程语言占比。 右侧:CodeGeeX训练损失函数随训练步数下降曲线。

语料:CodeGeeX的训练语料由两部分组成。第一部分是开源代码数据集,The PileCodeParrot。The Pile包含GitHub上拥有超过100颗星的一部分开源仓库,我们从中选取了23种编程语言的代码。第二部分是补充数据,直接从GitHub开源仓库中爬取Python、Java、C++代码;为了获取高质量数据,我们根据以下准则选取代码仓库:1)至少拥有1颗星;2)总大小<10MB;3)不在此前的开源代码数据集中。我们还去掉了符合下列任一条件的文件:1)平均每行长度大于100字符;2)由自动生成得到;3)含有的字母不足字母表内的40%;4)大于100KB或小于1KB。为了让模型区分不同语言,我们在每个样本的开头加上一个前缀,其形式为[注释符] language: [语言],例如:# language: Python。我们使用与GPT-2相同的分词器,并将空格处理为特殊标识符,词表大小为50400。整个代码语料含有23种编程语言、总计1587亿个标识符(不含填充符)。

国产平台实现与训练

我们在Mindspore 1.7框架上实现了CodeGeeX模型,并使用鹏城实验室的全国产计算平台上进行训练。具体来说,CodeGeeX使用了其一个计算集群中的1536个昇腾910 AI处理器(32GB)进行了两个月左右的训练(2022年4月18日至6月22日)。除了Layer-norm与Softmax使用FP32格式以获得更高的精度与稳定性,模型参数整体使用FP16格式,最终整个模型需要占用约27GB显存。为了增加训练效率,我们使用8路模型并行和192路数据并行的训练策略,微批大小为16、全局批大小为3072,并采用ZeRO-2优化器降低显存占用。

在开发与训练过程中,我们和华为Mindspore团队合作,对MindSpore框架进行了部分优化,进而大幅度提升训练效率。比如,我们发现矩阵乘法的计算时间占比仅为22.9%,大量时间被用于各类其它算子,因此实现了一系列算子融合,包括单元素算子融合、层归一化算子融合、FastGelu与矩阵乘法融合、批量矩阵乘法与加法融合等;再比如我们还对矩阵乘法算子的维度实现自动搜索调优,使其搜索出效率最高的计算维度组合。这些优化为训练速度带来了显著提升,在同等GPU卡数规模下(128卡),昇腾910对CodeGeeX这一模型的训练效率从约为NVIDIA A100的16.7%提升至43%;在千卡规模下,昇腾910训练效率相比自身优化前提升近300%。使用优化后的软硬件训练时,CodeGeeX单日训练量可达到54.3B个标识符(含填充符),证明了国产深度学习平台与工具的快速迭代能力以及强大竞争力。

HumanEval-X: 多语言代码生成基准

为了更好地评测代码生成模型的多语言生成能力,我们构建了一个新基准HumanEval-X。此前,多语言代码生成能力是基于语义相似度(比如CodeBLEU)衡量的,具有一定误导性;HumanEval-X则可用于衡量生成代码的功能正确性。HumanEval-X包含820个高质量手写样本,覆盖Python、C++、Java、JavaScript、Go,可用于多种任务。

HumanEval-X支持的任务示例。声明、描述、解答分别用红、绿、蓝色标注。代码生成将声明与描述作为输入,输出解答。代码翻译将两种语言的声明与源语言的解答作为输入,输出目标语言的解答。

HumanEval-X中每个语言的样本,包含了声明、描述和解答,它们之间的组合可以支持不同的下游任务,包括生成、翻译、概括等。我们目前关注两个任务:代码生成代码翻译。对于代码生成任务,模型将函数声明与文档字符串作为输入,输出函数实现;对于代码翻译任务,模型将两种语言的函数声明与源语言的实现作为输入,输出目标语言上的实现。我们在代码翻译任务中不将文档字符串输入模型,以避免模型直接通过描述生成答案。在两种任务下,我们都采用Codex所使用的无偏pass@k指标,判断生成代码的功能正确性: $\text{pass}@k:= \mathbb{E}[1-\frac{\tbinom{n-c}{k}}{\tbinom{n}{k}}]$, $n=200$, $k\in(1,10,100)$.

多语言代码生成

左侧: HumanEval-X中五种语言具体的pass@k(k=1,10,100)性能。右侧: 模型在所有语言上的平均性能。CodeGeeX的平均表现优于InCoder-6.7B和CodeGen-Multi-6B/16B。

我们将CodeGeeX与另外两个开源代码生成模型进行比较,分别为Meta的InCoder与Salesforce的CodeGen,选取InCoder-6.7B、CodeGen-Multi-6B 与 CodeGen-Multi-16B。CodeGeeX能获得最佳的平均性能,显著超越了参数量更小的模型(7.5%~16.3%的提升),与参数量更大的模型CodeGen-Multi-16B表现相当(平均性能 54.76% vs. 54.39%)。

跨语言代码翻译

HumanEval-X上的代码翻译任务结果。加粗结果表示在每种语言pass@k上的最佳效果。

我们还评测了模型在多语言间代码翻译上的性能。对于CodeGeeX,我们评测了未经微调的CodeGeeX-13B与经过微调的CodeGeeX-13B-FT(使用XLCoST中代码翻译任务的训练集与一部分Go语言数据微调)。如上表显示,模型对特定语言存在偏好,比如CodeGeeX擅长将其他语言翻译为Python与C++,而CodeGen-Multi-16B擅长翻译为JavaScript和Go,这可能是由于训练集中的语料占比存在差异。在20个翻译对中,我们还观察到两种语言互相翻译的表现常常是呈负相关的,这可能说明现有的模型还不足以学好所有的语言。

在线生成与翻译DEMO 我们为上述两个任务开发了DEMO:代码生成代码翻译,欢迎点击体验!
致谢
这一项目由国家自然科学基金杰出青年科学基金项目(No. 61825602)支持。 ​

学生负责人

郑勤锴(清华大学知识工程实验室),夏箫(清华大学知识工程实验室),邹旭(清华大学知识工程实验室)

技术贡献

清华大学知识工程实验室:曾奥涵,郑问迪,薛理龙

清华大学交叉信息学院:刘益枫,陈彦儒,徐奕辰(北邮大四访问清华期间研究工作)

鹏城实验室:陈庆玉,李忠琦,范高俊

智谱AI:薛宇飞,王山,陕杰才,姜皓瀚,刘璐,薛旋,张鹏

华为昇腾团队:姚逸璠,苏腾,邓启辉,周斌

数据标注

程锐杰(清华大学),于沛楠(清华大学),张竞尧(智谱AI),黄铂文(智谱AI),王炤宇(智谱AI)

指导教师

杨植麟(清华大学交叉信息学院),东昱晓(清华大学知识工程实验室),陈文光(清华大学PACMAN实验室),唐杰(清华大学知识工程实验室)

计算资源支持

鹏城实验室

智谱AI

项目总负责

唐杰(清华大学知识工程实验室 & 北京智源人工智能研究院)

如果遇到问题或有任何建议,欢迎通过邮件与我们联系codegeex@aminer.cn.

许可证

代码使用Apache-2.0许可证

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

CodeGeeX是一个具有130亿参数的多编程语言代码生成预训练模型 展开 收起
Python
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Python
1
https://gitee.com/chinnasamy/CodeGeeX.git
git@gitee.com:chinnasamy/CodeGeeX.git
chinnasamy
CodeGeeX
CodeGeeX
main

搜索帮助

53164aa7 5694891 3bd8fe86 5694891