BaiChuan

简体中文 | English

Baichuan-7B

训练

Baichuan-7B 训练的硬件配置如下：

硬件	配置
NPU	8 x Ascend NPUs

脚本

拷贝仓库到你的个人服务器：

git clone https://gitee.com/ascend/ModelLink.git 
cd ModeLlink 
mkdir logs
mkdir ckpt

搭建环境

# python3.8
conda create -n test python=3.8
conda activate test

# 安装 torch 和 torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# 安装其余依赖库
pip install -r requirements.txt

（可选）准备预训练权重

从 huggingface 下载预训练权重：

mkdir baichuan-7B-hf
cd ./baichuan-7B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ..

数据转换

将模型权重文件从 HuggingFace权重格式转化为 Megatron 权重 （该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）

mkdir baichuan-7B-mt

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
   
python tools/checkpoint/util.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./baichuan-7B-hf \
    --save-dir ./baichuan-7B-mt \
    --tokenizer-model ./baichuan-7B-hf/tokenizer.model \
    --w-pack True

任意并行切分策略的Megatron权重格式转化为 HuggingFace权重 （该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）

cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../Baichuan7B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ../Baichuan7B_downloaded     # <-- 需要填入原始HF模型路径，新权重会存于../Baichuan7B_downloaded/mg2hg

准备数据集

从这里下载 BaiChuan-7B 的数据集：

# 下载数据集
mkdir dataset-baichuan-7B
cd ./dataset-baichuan-7B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# 准备数据集                              
python ./tools/preprocess_data.py \
--input ./dataset-baichuan-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset-baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF

配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-7B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-7B-mt"

启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh

bash examples/baichuan/pretrain_baichuan_ptd_7B.sh

性能

吞吐

Baichuan-7B 在 昇腾芯片 和 参考芯片 上的性能对比：

设备	模型	迭代数	样本吞吐 (samples/s)	tokens吞吐 (tokens/s/p)	单步迭代时间 (s/step)
NPUs	Baichuan-7B	1000	5.24	2685	6.1
参考	Baichuan-7B	-	-	2036	-

推理

首先需要配置baichuan-7B的推理脚本: tasks/inference/generate_baichuan_7b_ptd.sh

# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"

然后可直接启动generate_baichuan_7b_ptd.sh

bash tasks/inference/generate_baichuan_7b_ptd.sh

推理的示例如下:

Inference

评估

我们使用boolq基准来评估我们的模型。基准下载.

# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"

bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh

任务	验证集	模型	昇腾值	社区值
Boolq	test	Baichuan 7B	0.69	0.67

Baichuan-13B

训练

Baichuan-13B 训练的硬件配置如下:

硬件	配置
NPU	8 x Ascend NPUs

脚本

拷贝仓库到你的个人服务器

git clone https://gitee.com/ascend/ModelLink.git 
cd ModeLlink 
mkdir logs
mkdir ckpt
mkdir ckpt_lora

搭建环境

# python3.8
conda create -n test python=3.8
conda activate test

# 安装 torch 和 torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# 安装其余依赖库
pip install -r requirements.txt

**注意：**在后面的任务执行过程中如果出现报错：AttributeError: 'BaichuanTokenizer’ object has no attribute 'sp_model'，请执行下面命令解决这个问题：

pip install transformers==4.32.0 --force

（可选的）准备预训练权重

从 huggingface 下载预训练权重

mkdir baichuan-13B-hf
cd ./baichuan-13B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ..

权重转换

将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式 （该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练）

mkdir baichuan-13B-mt

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
   
python tools/checkpoint/util.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./baichuan-13B-hf \
    --save-dir ./baichuan-13B-mt \
    --tokenizer-model ./baichuan-13B-hf/tokenizer.model \
    --w-pack True

任意并行切分策略的Megatron权重格式转化为 HuggingFace权重 （该场景一般用于将训练好的megatron模型重新转回HuggingFace格式）

cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../Baichuan13B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ../Baichuan13B_downloaded     # <-- 需要填入原始HF模型路径，新权重会存于../Baichuan13B_downloaded/mg2hg

准备数据集

下载 Baichuan-13B 数据集

mkdir dataset-baichuan-13B
cd ./dataset-baichuan-13B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

python ./tools/preprocess_data.py \
    --input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./baichuan-13B-hf \
    --output-prefix ./dataset-baichuan-13B/alpaca \
    --workers 4 \
    --log-interval 1000 \
    --tokenizer-type PretrainedFromHF

配置 Baichuan-13B 训练脚本(Baichuan-13B暂不支持Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-13B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-13B-mt"

启动 Baichuan-13B 训练脚本: examples/baichuan/pretrain_baichuan_ptd_13B.sh

bash examples/baichuan/pretrain_baichuan_ptd_13B.sh

性能

吞吐

Baichuan-13B 在 昇腾芯片 和 参考芯片 上的性能对比:

设备	模型	迭代数	样本吞吐 (samples/s)	token吞吐 (tokens/p/s)	单步迭代时间 (s/step)
NPUs	Baichuan-13B	1000	2.37	1213	13.5
参考	Baichuan-13B	-	-	862	-

Lora微调

我们支持使用 Baichuan-13B 进行lora微调。微调时使用指令微调数据集，制作过程如下，注意添加--handler-name GeneralInstructionHandler

mkdir alpaca_preprocessed
python tools/preprocess_data.py \
    --input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --output-prefix ./alpaca_preprocessed/alpaca \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path ./baichuan-13B-hf \
    --tokenizer-not-use-fast \
    --handler-name GeneralInstructionHandler \
    --append-eod

配置 Baichuan-13B 的lora脚本:tasks/finetune/tune_baichuan_ptd_13B.sh

# 配置保存权重的路径、数据集路径、权重路径以及词表路径
CKPT_SAVE_DIR="./ckpt_lora"
DATA_PATH="./alpaca_preprocessed/alpaca"
CHECKPOINT="./baichuan-13B-mt"
TOKENIZER_PATH="./baichuan-13B-hf"

启动 Baichuan-13B lora微调脚本: examples/baichuan/tune_baichuan_ptd_13B.sh

bash ./tasks/finetune/tune_baichuan_ptd_13B.sh

再使用微调后的权重进行推理:

# 配置权重路径、lora权重的路径以及词表路径
CHECKPOINT="./baichuan-13B-mt"
LORA_CHECKPOINT="./ckpt_lora"
TOKENIZER_PATH="./baichuan-13B-hf"

bash ./tasks/inference/generate_baichuan_13b_lora_ptd.sh

使用lora进行微调后的推理功能：

推理

配置baichuan-13B的推理脚本: tasks/inference/generate_baichuan_13b_ptd.sh

# 根据您自己的 ascend-toolkit 路径，执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"

然后可直接启动generate_baichuan_13b_ptd.sh

bash tasks/inference/generate_baichuan_13b_ptd.sh

推理的示例如下: Inference

评估

我们使用boolq基准来评估我们的模型。基准下载.

# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"

bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh

任务	验证集	模型	昇腾值	社区值
Boolq	test	Baichuan 13B	0.747	0.736

甄文奇 / ModelLink

BaiChuan

目录

Baichuan-7B

训练

脚本

性能

吞吐

推理

评估

Baichuan-13B

训练

脚本

性能

吞吐

Lora微调

推理

评估

简介

发行版

贡献者

近期动态

甄文奇 / ModelLink .gitee-modal { width: 500px !important; }

BaiChuan

目录

Baichuan-7B

训练

脚本

性能

吞吐

推理

评估

Baichuan-13B

训练

脚本

性能

吞吐

Lora微调

推理

评估

简介

发行版

贡献者

近期动态

搜索帮助

甄文奇 / ModelLink