1 Star 0 Fork 522

甄文奇 / ModelLink

forked from Ascend / ModelLink 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
README.md 15.38 KB
一键复制 编辑 原始数据 按行查看 历史

BaiChuan

简体中文 | English

目录

Baichuan-7B

训练

Baichuan-7B 训练的硬件配置如下:

硬件 配置
NPU 8 x Ascend NPUs

脚本

  1. 拷贝仓库到你的个人服务器:
git clone https://gitee.com/ascend/ModelLink.git 
cd ModeLlink 
mkdir logs
mkdir ckpt
  1. 搭建环境
# python3.8
conda create -n test python=3.8
conda activate test

# 安装 torch 和 torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# 安装其余依赖库
pip install -r requirements.txt 
  1. (可选)准备预训练权重

huggingface 下载预训练权重:

mkdir baichuan-7B-hf
cd ./baichuan-7B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/handler.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/pytorch_model.bin
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer.model
wget https://huggingface.co/baichuan-inc/Baichuan-7B/resolve/main/tokenizer_config.json
cd ..
  1. 数据转换

将模型权重文件从 HuggingFace权重 格式转化为 Megatron 权重 (该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练)

mkdir baichuan-7B-mt

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
   
python tools/checkpoint/util.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./baichuan-7B-hf \
    --save-dir ./baichuan-7B-mt \
    --tokenizer-model ./baichuan-7B-hf/tokenizer.model \
    --w-pack True    

任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重 (该场景一般用于将训练好的megatron模型重新转回HuggingFace格式)

cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../Baichuan7B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ../Baichuan7B_downloaded     # <-- 需要填入原始HF模型路径,新权重会存于../Baichuan7B_downloaded/mg2hg
  1. 准备数据集

这里 下载 BaiChuan-7B 的数据集:

# 下载数据集
mkdir dataset-baichuan-7B
cd ./dataset-baichuan-7B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

# 准备数据集                              
python ./tools/preprocess_data.py \
--input ./dataset-baichuan-7B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
--tokenizer-name-or-path ./baichuan-7B-hf \
--output-prefix ./dataset-baichuan-7B/alpaca \
--workers 4 \
--log-interval 1000 \
--tokenizer-type PretrainedFromHF
  1. 配置 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-7B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-7B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-7B-mt"
  1. 启动 Baichuan-7B 预训练脚本: examples/baichuan/pretrain_baichuan_ptd_7B.sh
bash examples/baichuan/pretrain_baichuan_ptd_7B.sh 

性能

吞吐

Baichuan-7B 在 昇腾芯片参考芯片 上的性能对比:

设备 模型 迭代数 样本吞吐 (samples/s) tokens吞吐 (tokens/s/p) 单步迭代时间 (s/step)
NPUs Baichuan-7B 1000 5.24 2685 6.1
参考 Baichuan-7B - - 2036 -

推理

首先需要配置baichuan-7B的推理脚本: tasks/inference/generate_baichuan_7b_ptd.sh

# 根据您自己的 ascend-toolkit 路径,执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"

然后可直接启动generate_baichuan_7b_ptd.sh

bash tasks/inference/generate_baichuan_7b_ptd.sh

推理的示例如下:

Inference

评估

我们使用boolq基准来评估我们的模型。基准下载.

# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"
bash ./tasks/evaluation/evaluate_baichuan_7B_ptd.sh
任务 验证集 模型 昇腾值 社区值
Boolq test Baichuan 7B 0.69 0.67

Baichuan-13B

训练

Baichuan-13B 训练的硬件配置如下:

硬件 配置
NPU 8 x Ascend NPUs

脚本

  1. 拷贝仓库到你的个人服务器
git clone https://gitee.com/ascend/ModelLink.git 
cd ModeLlink 
mkdir logs
mkdir ckpt
mkdir ckpt_lora
  1. 搭建环境
# python3.8
conda create -n test python=3.8
conda activate test

# 安装 torch 和 torch_npu 
pip install torch-2.1.0-cp38-cp38m-linux_aarch64.whl
pip install torch_npu-2.1.0.XXX-cp38-cp38m-linux_aarch64.whl
pip install apex-0.1_ascend*-cp38-cp38m-linux_aarch64.whl

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 安装加速库
git clone https://gitee.com/ascend/AscendSpeed.git
cd AscendSpeed
pip install -r requirements.txt 
pip3 install -e .
cd ..

# 安装其余依赖库
pip install -r requirements.txt 

**注意:**在后面的任务执行过程中如果出现报错:AttributeError: 'BaichuanTokenizer’ object has no attribute 'sp_model',请执行下面命令解决这个问题:

pip install transformers==4.32.0 --force
  1. (可选的)准备预训练权重

huggingface 下载预训练权重

mkdir baichuan-13B-hf
cd ./baichuan-13B-hf
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/configuration_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/generation_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/modeling_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model-00003-of-00003.bin
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/pytorch_model.bin.index.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/quantizer.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/special_tokens_map.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenization_baichuan.py
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer_config.json
wget https://huggingface.co/baichuan-inc/Baichuan-13B-Base/resolve/main/tokenizer.model
cd ..
  1. 权重转换

将 BaiChuan-13B 模型权重从 huggingface 格式转换为 megatron 格式 (该场景一般用于使能开源的HuggingFace模型在Megatron上进行训练)

mkdir baichuan-13B-mt

# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
   
python tools/checkpoint/util.py \
    --model-type GPT \
    --loader llama2_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --load-dir ./baichuan-13B-hf \
    --save-dir ./baichuan-13B-mt \
    --tokenizer-model ./baichuan-13B-hf/tokenizer.model \
    --w-pack True      

任意并行切分策略的Megatron权重 格式转化为 HuggingFace权重 (该场景一般用于将训练好的megatron模型重新转回HuggingFace格式)

cd ModelLink/
# 请按照您的真实环境修改 set_env.sh 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh
python tools/checkpoint/util.py --model-type GPT \
    --loader megatron \
    --saver megatron \
    --save-model-type save_huggingface_llama \
    --load-dir ../Baichuan13B-v0.1-pt8-pp1 \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --w-pack True \
    --save-dir ../Baichuan13B_downloaded     # <-- 需要填入原始HF模型路径,新权重会存于../Baichuan13B_downloaded/mg2hg
  1. 准备数据集

下载 Baichuan-13B 数据集

mkdir dataset-baichuan-13B
cd ./dataset-baichuan-13B
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..

python ./tools/preprocess_data.py \
    --input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --tokenizer-name-or-path ./baichuan-13B-hf \
    --output-prefix ./dataset-baichuan-13B/alpaca \
    --workers 4 \
    --log-interval 1000 \
    --tokenizer-type PretrainedFromHF 
  1. 配置 Baichuan-13B 训练脚本(Baichuan-13B暂不支持Flash Attention): examples/baichuan/pretrain_baichuan_ptd_13B.sh
# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh 

CKPT_SAVE_DIR="./ckpt"
DATA_PATH="./dataset-baichuan-13B/alpaca_text_document"
TOKENIZER_MODEL="./baichuan-13B-hf/tokenizer.model"
CKPT_LOAD_DIR="./baichuan-13B-mt" 
  1. 启动 Baichuan-13B 训练脚本: examples/baichuan/pretrain_baichuan_ptd_13B.sh
bash examples/baichuan/pretrain_baichuan_ptd_13B.sh

性能

吞吐

Baichuan-13B 在 昇腾芯片参考芯片 上的性能对比:

设备 模型 迭代数 样本吞吐 (samples/s) token吞吐 (tokens/p/s) 单步迭代时间 (s/step)
NPUs Baichuan-13B 1000 2.37 1213 13.5
参考 Baichuan-13B - - 862 -

Lora微调

我们支持使用 Baichuan-13B 进行lora微调。 微调时使用指令微调数据集,制作过程如下,注意添加--handler-name GeneralInstructionHandler

mkdir alpaca_preprocessed
python tools/preprocess_data.py \
    --input ./dataset-baichuan-13B/train-00000-of-00001-a09b74b3ef9c3b56.parquet \
    --output-prefix ./alpaca_preprocessed/alpaca \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path ./baichuan-13B-hf \
    --tokenizer-not-use-fast \
    --handler-name GeneralInstructionHandler \
    --append-eod

配置 Baichuan-13B 的lora脚本:tasks/finetune/tune_baichuan_ptd_13B.sh

# 配置保存权重的路径、数据集路径、权重路径以及词表路径
CKPT_SAVE_DIR="./ckpt_lora"
DATA_PATH="./alpaca_preprocessed/alpaca"
CHECKPOINT="./baichuan-13B-mt"
TOKENIZER_PATH="./baichuan-13B-hf"

启动 Baichuan-13B lora微调脚本: examples/baichuan/tune_baichuan_ptd_13B.sh

bash ./tasks/finetune/tune_baichuan_ptd_13B.sh

再使用微调后的权重进行推理:

# 配置权重路径、lora权重的路径以及词表路径
CHECKPOINT="./baichuan-13B-mt"
LORA_CHECKPOINT="./ckpt_lora"
TOKENIZER_PATH="./baichuan-13B-hf"
bash ./tasks/inference/generate_baichuan_13b_lora_ptd.sh

使用lora进行微调后的推理功能: 13B-lora-inference.png

推理

配置baichuan-13B的推理脚本: tasks/inference/generate_baichuan_13b_ptd.sh

# 根据您自己的 ascend-toolkit 路径,执行set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh 
 
# 修改模型权重路径和词表路径
CHECKPOINT="your model directory path"
TOKENIZER_PATH="your tokenizer directory path"

然后可直接启动generate_baichuan_13b_ptd.sh

bash tasks/inference/generate_baichuan_13b_ptd.sh

推理的示例如下: Inference

评估

我们使用boolq基准来评估我们的模型。基准下载.

# 配置原始权重与词表的路径
CHECKPOINT=<origin-ckpt-path>
TOKENIZER_PATH=<tokenizer-path>
# 配置任务以及数据路径
DATA_PATH="./boolq/"
TASK="boolq"
bash ./tasks/evaluation/evaluate_baichuan_13B_ptd.sh
任务 验证集 模型 昇腾值 社区值
Boolq test Baichuan 13B 0.747 0.736
Python
1
https://gitee.com/zhen-wenqi/ModelLink.git
git@gitee.com:zhen-wenqi/ModelLink.git
zhen-wenqi
ModelLink
ModelLink
master

搜索帮助