同步操作将从 Ascend/ModelLink 强制同步,此操作会覆盖自 Fork 仓库以来所做的任何修改,且无法恢复!!!
确定后同步将在后台操作,完成时将刷新页面,请耐心等待。
简体中文 | English
Baichuan-7B 训练的硬件配置如下:
硬件 | 配置 |
---|---|
NPU | 8 x Ascend NPUs |
git clone https://gitee.com/ascend/ModelLink.git
cd ModeLlink
mkdir logs
mkdir ckpt
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
从 huggingface 下载预训练权重:
mkdir baichuan-7B-hf
cd ./baichuan-7B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ..
将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重 (该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练)
mkdir baichuan-7B-mt
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./baichuan-7B-hf \
--save-dir ./baichuan-7B-mt \
--tokenizer-model ./baichuan-7B-hf/tokenizer.model \
--w-pack True
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重 (该场景一般用于将训练好的megatron模型重新转回HuggingFace格式)
cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ../Baichuan7B-v0.1-pt8-pp1 \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ../Baichuan7B_downloaded # <-- 需要填入原始HF模型路径,新权重会存于../Baichuan7B_downloaded/mg2hg
从 这里 下载 BaiChuan-7B 的数据集:
# 下载数据集
mkdir dataset-baichuan-7B
cd ./dataset-baichuan-7B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
# 准备数据集
python ./tools/preprocess_data.py \
--input ./dataset-baichuan-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset-baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-7B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-7B-mt"
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh
Baichuan-7B 在 昇腾芯片 和 参考芯片 上的性能对比:
设备 | 模型 | 迭代数 | 样本吞吐 (samples/s) | tokens吞吐 (tokens/s/p) | 单步迭代时间 (s/step) |
---|---|---|---|---|---|
NPUs | Baichuan-7B | 1000 | 5.24 | 2685 | 6.1 |
参考 | Baichuan-7B | - | - | 2036 | - |
首先需要配置baichuan-7B的推理脚本: tasks/inference/generate_baichuan_7b_ptd.sh
# 根据您自己的 ascend-toolkit 路径,执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
然后可直接启动generate_baichuan_7b_ptd.sh
bash tasks/inference/generate_baichuan_7b_ptd.sh
推理的示例如下:
我们使用boolq基准来评估我们的模型。基准下载.
# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"
bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
任务 | 验证集 | 模型 | 昇腾值 | 社区值 |
---|---|---|---|---|
Boolq | test | Baichuan 7B | 0.69 | 0.67 |
Baichuan-13B 训练的硬件配置如下:
硬件 | 配置 |
---|---|
NPU | 8 x Ascend NPUs |
git clone https://gitee.com/ascend/ModelLink.git
cd ModeLlink
mkdir logs
mkdir ckpt
mkdir ckpt_lora
# python3.8
conda create -n test python=3.8
conda activate test
# 安装 torch 和 torch_npu
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt
pip3 install -e .
cd ..
# 安装其余依赖库
pip install -r requirements.txt
**注意:**在后面的任务执行过程中如果出现报错:AttributeError: 'BaichuanTokenizer’ object has no attribute 'sp_model'
,请执行下面命令解决这个问题:
pip install transformers==4.32.0 --force
从 huggingface 下载预训练权重
mkdir baichuan-13B-hf
cd ./baichuan-13B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ..
将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式 (该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练)
mkdir baichuan-13B-mt
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama2_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--load-dir ./baichuan-13B-hf \
--save-dir ./baichuan-13B-mt \
--tokenizer-model ./baichuan-13B-hf/tokenizer.model \
--w-pack True
任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重 (该场景一般用于将训练好的megatron模型重新转回HuggingFace格式)
cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
--loader megatron \
--saver megatron \
--save-model-type save_huggingface_llama \
--load-dir ../Baichuan13B-v0.1-pt8-pp1 \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--w-pack True \
--save-dir ../Baichuan13B_downloaded # <-- 需要填入原始HF模型路径,新权重会存于../Baichuan13B_downloaded/mg2hg
下载 Baichuan-13B 数据集
mkdir dataset-baichuan-13B
cd ./dataset-baichuan-13B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..
python ./tools/preprocess_data.py \
--input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-13B-hf \
--output-prefix ./dataset-baichuan-13B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-13B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-13B-mt"
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh
Baichuan-13B 在 昇腾芯片 和 参考芯片 上的性能对比:
设备 | 模型 | 迭代数 | 样本吞吐 (samples/s) | token吞吐 (tokens/p/s) | 单步迭代时间 (s/step) |
---|---|---|---|---|---|
NPUs | Baichuan-13B | 1000 | 2.37 | 1213 | 13.5 |
参考 | Baichuan-13B | - | - | 862 | - |
我们支持使用 Baichuan-13B 进行lora微调。
微调时使用指令微调数据集
,制作过程如下,注意添加--handler-name GeneralInstructionHandler
mkdir alpaca_preprocessed
python tools/preprocess_data.py \
--input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--output-prefix ./alpaca_preprocessed/alpaca \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ./baichuan-13B-hf \
--tokenizer-not-use-fast \
--handler-name GeneralInstructionHandler \
--append-eod
配置 Baichuan-13B 的lora脚本:tasks/finetune/tune_baichuan_ptd_13B.sh
# 配置保存权重的路径、数据集路径、权重路径以及词表路径
CKPT_SAVE_DIR="./ckpt_lora"
DATA_PATH="./alpaca_preprocessed/alpaca"
CHECKPOINT="./baichuan-13B-mt"
TOKENIZER_PATH="./baichuan-13B-hf"
启动 Baichuan-13B lora微调脚本: examples/baichuan/tune_baichuan_ptd_13B.sh
bash ./tasks/finetune/tune_baichuan_ptd_13B.sh
再使用微调后的权重进行推理:
# 配置权重路径、lora权重的路径以及词表路径
CHECKPOINT="./baichuan-13B-mt"
LORA_CHECKPOINT="./ckpt_lora"
TOKENIZER_PATH="./baichuan-13B-hf"
bash ./tasks/inference/generate_baichuan_13b_lora_ptd.sh
使用lora进行微调后的推理功能:
配置baichuan-13B的推理脚本: tasks/inference/generate_baichuan_13b_ptd.sh
# 根据您自己的 ascend-toolkit 路径,执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"
然后可直接启动generate_baichuan_13b_ptd.sh
bash tasks/inference/generate_baichuan_13b_ptd.sh
推理的示例如下:
我们使用boolq基准来评估我们的模型。基准下载.
# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"
bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
任务 | 验证集 | 模型 | 昇腾值 | 社区值 |
---|---|---|---|---|
Boolq | test | Baichuan 13B | 0.747 | 0.736 |
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。