# Model Name(使用英文原名) ## 模型描述 Model Name是个……(简单描述模型与规格,仓库支持情况等) [论文名](论文网址)作者信息等 论文或github引用信息(citing) 例如: ``` text @article{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} } ``` ## 模型性能(包括设备性能+评测指标) >注:性能与精度均以config配置为准,config一定要匹配 | config | task | Datasets | metric | score | train performance | predict performance | | :-------------------------------------------: | :------------------: | :---------: | :------: | :---: | :---------------: | :-------------------------: | | [model_name_typeA](path/to/configA.yaml) | text_generation | wikitext2 | ppl | xx | xxx tokens/s | xxx tokens/s(use past True) | | [model_name_typeB](path/to/configB.yaml) | text_generation | ADGEN | accuracy | xx% | xxx tokens/s | xxx tokens/s(use past True) | | [model_name_typeC_lora](path/to/configC.yaml) | text_generation | alpaca | ppl | xx | xxx tokens/s | xxx tokens/s(use past True) | | [model_name_typeD](path/to/configD.yaml) | image_classification | ImageNet-1K | accuracy | xx% | xxx frame/s | xxx frame/s | ## 仓库介绍 `Model Name` 基于 `mindformers` 实现,主要涉及的文件有: 1. 模型具体实现:`path/to/model_src_folder` ```bash model ├── __init__.py ├── convert_weight.py # 权重转换脚本 ├── model.py # 模型实现 ├── model_config.py # 模型配置项 ├── model_layer.py # Model网络层定义 ├── model_processor.py # Model预处理 ├── model_tokenizer.py # tokenizer └── model_transformer.py # transformer层实现 ``` 2. 模型配置:`path/to/model_config` ```bash model ├── run_model_typeA.yaml # typeA模型全量微调启动配置 ├── run_model_typeA_lora.yaml # typeA lora低参微调启动配置 ├── run_model_typeB.yaml # typeB全量微调启动配置 └── run_model_typeC.yaml # typeC全量微调启动配置 ``` 以上2条为必须 后面可以根据模型需要增加 3. 预处理脚本和任务启动脚本:`path/to/model_preprocess` ```bash model ├── datasetA_data_preprocess.py # datasetA数据集预处理 ├── datasetB_data_preprocess.py # datasetB数据集预处理 ├── convert_weight.py # 权重转换 └── run_model.py # 高阶接口使用脚本 ``` ## 前期准备 ### 生成RANK_TABLE_FILE(**多卡运行必须环节**) 运行mindformers/tools/hccl_tools.py生成RANK_TABLE_FILE的json文件 ```shell # 运行如下命令,生成当前机器的RANK_TABLE_FILE的json文件 python ./mindformers/tools/hccl_tools.py --device_num "[0,8)" ``` > 注:若使用ModelArts的notebook环境,可从 `/user/config/jobstart_hccl.json` 路径下直接获取rank table,无需手动生成 RANK_TABLE_FILE 单机8卡参考样例: ```json { "version": "1.0", "server_count": "1", "server_list": [ { "server_id": "xx.xx.xx.xx", "device": [ {"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"}, {"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"}, {"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"}, {"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"}, {"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"}, {"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"}, {"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"}, {"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}], "host_nic_ip": "reserve" } ], "status": "completed" } ``` ### 多机RANK_TABLE_FILE合并(**多机多卡必备环节**) - step 1. 首先根据上章节内容,在每个机器上生成各自的`RANK_TABLE_FILE`文件,然后将不同机器上生成的`RANK_TABLE_FILE`文件全部拷贝到同一台机器上。 - step 2. 运行mindformers/tools/merge_hccl.py将不同机器上生成的`RANK_TABLE_FILE`文件合并 ```shell # 运行如下命令,合并每个机器上的RANK_TABLE_FILE的json文件。 python ./mindformers/tools/merge_hccl.py hccl*.json ``` - step 3. 将合并后的`RANK_TABLE_FILE`文件拷贝到所有机器中,保证不同机器上的`RANK_TABLE_FILE`相同。 RANK_TABLE_FILE 双机16卡参考样例: ```json { "version": "1.0", "server_count": "2", "server_list": [ { "server_id": "xx.xx.xx.xx", "device": [ { "device_id": "0", "device_ip": "192.168.0.0", "rank_id": "0" }, { "device_id": "1", "device_ip": "192.168.1.0", "rank_id": "1" }, { "device_id": "2", "device_ip": "192.168.2.0", "rank_id": "2" }, { "device_id": "3", "device_ip": "192.168.3.0", "rank_id": "3" }, { "device_id": "4", "device_ip": "192.168.0.1", "rank_id": "4" }, { "device_id": "5", "device_ip": "192.168.1.1", "rank_id": "5" }, { "device_id": "6", "device_ip": "192.168.2.1", "rank_id": "6" }, { "device_id": "7", "device_ip": "192.168.3.1", "rank_id": "7" } ], "host_nic_ip": "reserve" }, { "server_id": "xx.xx.xx.xx", "device": [ { "device_id": "0", "device_ip": "192.168.0.1", "rank_id": "8" }, { "device_id": "1", "device_ip": "192.168.1.1", "rank_id": "9" }, { "device_id": "2", "device_ip": "192.168.2.1", "rank_id": "10" }, { "device_id": "3", "device_ip": "192.168.3.1", "rank_id": "11" }, { "device_id": "4", "device_ip": "192.168.0.2", "rank_id": "12" }, { "device_id": "5", "device_ip": "192.168.1.2", "rank_id": "13" }, { "device_id": "6", "device_ip": "192.168.2.2", "rank_id": "14" }, { "device_id": "7", "device_ip": "192.168.3.2", "rank_id": "15" } ], "host_nic_ip": "reserve" } ], "status": "completed" } ``` ### 模型权重下载与转换(convert_weight.py) 作为参考,这里描述CheckPoint在HuggingFace或者官方开源github仓库和MindSpore间的转换,在不同分布式策略间的转换。如果不需要加载权重,或者使用from_pretrained功能自动下载,则可以跳过此章节。 提供目标权重下载地址(hugging face网址,或者官方github仓权重) 对无需下载可以from pretrained的权重进行说明 权重转换步骤+权重转换命令 ```bash python path/to/convert_weight.py --torch_path "PATH OF model.pth" --mindspore_path "SAVE PATH OF model.ckpt" --otherargs xxx ``` 可以解释额外入参 ### [模型权重切分与合并](path/to/docs/feature_cards/Transform_Ckpt.md) 从hugging face或官方github仓库转换而来的权重通常是单卡权重,基于该权重进行多卡微调,评测,推理,涉及ckpt从单机策略到分布式策略的切换。 通常训练采用分布式训练,基于该权重进行评测,推理多采用单卡,涉及ckpt从分布式策略到单机策略的切换。 以上涉及到ckpt的单卡,多卡转换,详细教程请参考特性文档模型[权重切分与合并](path/to/docs/feature_cards/Transform_Ckpt.md) ## 基于API的快速使用 ### 基于AutoClass的使用 可以使用AutoClass接口,通过模型名称获取相应的模型/preprocess/tokenizer等实例,并自动下载并加载权重 `from_pretrained()` 接口会自动从云上下载预训练的模型,存储路径:`mindformers/checkpoint_download/model_name` ```python from mindformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("model_name_type") model = AutoModel.from_pretrained("model_name_type") query = "你好" prompted_inputs = tokenizer.build_prompt(query) input_tokens = tokenizer([prompted_inputs]) outputs = model.generate(input_tokens["input_ids"], max_length=100) response = tokenizer.decode(outputs)[0] print(response) ``` 以glm为例: 可以使用AutoClass接口,通过模型名称获取相应的模型/tokenizer实例,并自动下载并加载权重 `from_pretrained()` 接口会自动从云上下载预训练的模型,存储路径:`mindformers/checkpoint_download/glm` 首次运行pipeline推理时需要进行模型编译,需等待一段时间 ```python >>> from mindformers import AutoModel, AutoTokenizer, TextGenerationPipeline >>> model = AutoModel.from_pretrained("glm_6b_chat") >>> tokenizer = AutoTokenizer.from_pretrained("glm_6b") >>> pipeline = TextGenerationPipeline(model, tokenizer, max_length=2048) >>> pipeline("你好") [{'text_generation_text': ['你好 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。']}] ``` > 注:`AutoModel.from_pretrained()` 接口当前支持 `glm_6b` 和 `glm_6b_chat` 两类模型,前者为通用模型,后者具备推理加速特性,仅用于推理,两者共享权重,在推理场景下建议使用后者,以获得更快的推理体验 ### 基于Trainer的训练,微调,评测,推理 通常基于Trainer的训练,微调,评测,推理应该把trainer定义一次,然后分别train,finetune,eval,predict,展示接口调用方式即可,需要给出结果。 如果模型额外支持低参微调,也可以在此处增加 glm2示例: ```python from mindformers import Trainer trainer = Trainer(task="text_generation", model="glm2_6b", pet_method="lora", train_dataset="/path/to/AdvertiseGen/train.json") trainer.finetune(finetune_checkpoint="glm2_6b") ``` clip示例: ```python from mindformers.trainer import Trainer from mindformers.tools.image_tools import load_image # 初始化预训练任务 trainer = Trainer(task='contrastive_language_image_pretrain', model='clip_vit_b_32', train_dataset='./Flickr8k') trainer.train() # 开启预训练 #初始化零样本图像分类下游任务 trainer = Trainer(task='zero_shot_image_classification', model='clip_vit_b_32', eval_dataset='./cifar-100-python') img = load_image("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png") # 方式1: 使用训练好的权重进行评测和推理 trainer.evaluate(eval_checkpoint=True) predict_result = trainer.predict(predict_checkpoint=True, input_data=img, top_k=3) print(predict_result) # 方式2: 从obs下载训练好的权重并进行评测和推理 trainer.evaluate() #下载权重进行评测 predict_result = trainer.predict(input_data=img, top_k=3) #下载权重进行推理 print(predict_result) ``` 如果基于基于Trainer的训练,微调,评测,推理前处理,数据,流程等差别过大可以参考blip2分开写,以`-`来分段,不要加入额外标题。 blip2示例: - 基于Trainer API训练: ```python from mindformers.dataset.dataloader.multi_image_cap_dataloader import MultiImgCapDataLoader from mindformers.trainer import Trainer # 初始化图像-文本数据集 dataset_dir = "/data" annotation_files = [ "vg/annotations/vg_caption.json", "coco2014/coco/annotations/coco_karpathy_train.json" ] image_dirs = [ "vg/images", "coco2014/coco/images" ] train_dataset = MultiImgCapDataLoader(dataset_dir=dataset_dir, annotation_files=annotation_files, image_dirs = image_dirs, stage="train") # blip2一阶段初始化预训练任务 trainer = Trainer(task='contrastive_language_image_pretrain', model='blip2_stage1_vit_g', train_dataset=train_dataset) # 开启预训练 trainer.train() ``` - 基于Trainer API微调: ```python from mindformers.dataset.dataloader.multi_image_cap_dataloader import MultiImgCapDataLoader from mindformers.trainer import Trainer # 初始化图像-文本数据集 dataset_dir = "/data" annotation_files = [ "vg/annotations/vg_caption.json", "coco2014/coco/annotations/coco_karpathy_train.json" ] image_dirs = [ "vg/images", "coco2014/coco/images" ] train_dataset = MultiImgCapDataLoader(dataset_dir=dataset_dir, annotation_files=annotation_files, image_dirs = image_dirs, stage="train") # blip2一阶段初始化预训练任务 trainer = Trainer(task='contrastive_language_image_pretrain', model='blip2_stage1_vit_g', train_dataset=train_dataset) # 开启预训练 trainer.train() ``` - 基于Trainer API评测: ```python from mindformers.dataset.dataloader.multi_image_cap_dataloader import MultiImgCapDataLoader from mindformers.trainer import Trainer # 初始化图像-文本数据集 dataset_dir: "/data" annotation_files: [ "flickr30k/annotations/test.json" ] image_dirs: [ "flickr30k/images" ] eval_dataset = MultiImgCapDataLoader(dataset_dir=dataset_dir, annotation_files=annotation_files, image_dirs = image_dirs, stage="eval") # 初始化评测任务 trainer = Trainer(task='blip2_retireval', model='blip2_stage1_vit_g', eval_dataset=eval_dataset) # 开启评测 trainer.eval() ``` - 基于Trainer API推理: ```python from mindformers.tools.image_tools import load_image from mindformers import Trainer cls_trainer = Trainer(task='zero_shot_image_classification', model='blip2_stage1_classification', candidate_labels=["sunflower", "tree", "dog", "cat", "toy"]) # 加载输入,一张太阳花图片 input_data = load_image("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png") # 加载指定的权重以完成推理 predict_result = cls_trainer.predict(input_data=input_data, predict_checkpoint='your_path_to/blip2_pretrained.ckpt') print(predict_result) # 输出 # output result is: [[{'score': 0.99999976, 'label': 'sunflower'}]] # output result is saved at: zero_shot_image_classification_result.txt # .........Predict Over!............. ``` ### 基于Pipeline的推理 ```python from mindformers.pipeline import pipeline pipeline_task = pipeline(task="task_name", model="model_name_type") pipeline_result = pipeline_task("predict_data", top_k=3) print(pipeline_result) # output: # pipeline_result ``` text_generation示例: ```python from mindformers.pipeline import pipeline pipeline_task = pipeline(task="text_generation", model="llama_7b", max_length=50) pipeline_result = pipeline_task("I love Beijing, because", top_k=3) print(pipeline_result) # output: # [{'text_generation_text': ['I love Beijing, because it’s a city that’s constantly changing. It’s a city that’s constantly evolving. It’s a city that’s constantly reinventing itself. And I think that’s what makes it']}] ``` imageclassification示例: ```python from mindformers.pipeline import pipeline from mindformers.tools.image_tools import load_image pipeline_task = pipeline("image_classification", model='swin_base_p4w7') img = load_image("https://ascend-repo-modelzoo.obs.cn-east-2." "myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png") pipeline_result = pipeline_task(img, top_k=3) print(pipeline_result) # output: # [[{'score': 0.89573187, 'label': 'daisy'}, {'score': 0.005366202, 'label': 'bee'}, {'score': 0.0013296203, 'label': 'fly'}]] ``` ## 预训练 ### 数据集准备-预训练(可以没有,需要做出说明,例如官方训练权重的数据未开源等,也可以提供推荐数据集) 简单数据集不需要额外增加标题,直接在三级标题下加内容即可,如果过数据相关内容较多,可以按照需要增加四集标题。 需要包括数据集来源,下载链接,格式规范。 如果需要转换MindRecord格式,需要提供preprocess脚本,以及相关使用命令。 如果需要修改yaml dataset config内容,则需增加相关说明 该处以llama为例: 以Wikitext2数据集为例: - 数据集下载:[WikiText2数据集](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip) - 分词模型下载:例如下载huggingface的[tokenizer.model](https://huggingface.co/openlm-research/open_llama_7b/blob/main/tokenizer.model) - 使用以下预处理脚本生成mindrecord训练数据 ```bash # 使用tools/dataset_preprocess/llama/llama_preprocess.py进行数据预处理+Mindrecord数据生成 python llama_preprocess.py \ --dataset_type wiki \ --input_glob /{path}/wiki.train.tokens \ --model_file /{path}/tokenizer.model \ --seq_length 2048 \ --output_file /{path}/wiki2048.mindrecord ``` 以PanguAlpha为例: 以悟道数据集为例 - 数据集下载:[悟道数据集](https://data.baai.ac.cn/details/WuDaoCorporaText#a2) - 词表下载:[model.vocab](https://openi.pcl.ac.cn/PCL-Platform.Intelligence/PanGu-Alpha/src/branch/master/tokenizer/vocab.model) - 参考[ModelZoo](https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha#%E6%95%B0%E6%8D%AE%E9%9B%86%E7%94%9F%E6%88%90),将数据处理成Mindrecord格式。注:训练数据处理时,长度应等于模型接收长度加一 ```bash # 数据预处理示例代码,代码来源于ModelZoo # 生成Mindrecord数据,其中output_file需以字符串mindrecord结尾 python -m preprocess.py --input_glob 'data/*.txt' --tokenizer jieba --eot 40000 --data_column_name input_ids --seq_length 1025 ``` 以vit为例: 使用的数据集:[WMT16](https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz) 对应的文件路径如下: ```bash └── wmt_en_ro ├── test.source ├── test.target ├── train.source ├── train.target ├── val.source └── val.target ``` ### 脚本启动 #### 单卡训练 ```python python run_mindformer.py --config path/to/config.yaml --run_mode train ``` ```shell cd scripts bash run_standalone.sh path/to/config.yaml [DEVICE_ID] train ``` #### 多卡训练 多卡运行需要RANK_FILE_TABLE,请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节) - 单机多卡 ```shell cd scripts bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] train 8 ``` 多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节) - 多机多卡 在每台机器上启动`bash run_distribute.sh`。 ```bash server_count=12 device_num=8*$server_count # launch ranks in the 0th server cd scripts bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] train $device_num # launch ranks in the 1-11 server via ssh for idx in {1..11} do let rank_start=8*$idx let rank_end=$rank_start+8 ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] train $device_num" done ``` 其中 - `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件; - `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11] ```bash IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11") ``` ## 微调(针对SFT,RLHF等数据集不同的微调,或者lora等微调) ### 数据集准备-微调数据集A 简单数据集不需要额外增加标题,直接在三级标题下加内容即可,如果过数据相关内容较多,可以按照需要增加四集标题。 需要包括数据集来源,下载链接,格式规范。 如果需要转换MindRecord格式,需要提供preprocess脚本,以及相关使用命令。 如果需要修改yaml dataset config内容,则需增加相关说明 以bloom为例: 这里以Alpaca为例,数据大概21MB,用于调试。 首先去官方下载[alpaca_data.json文件](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) 然后调用`mindformers/tools/dataset_preprocess/bloom/make_mindrecord.py`脚本将json转换成mindrecord文件。 ```bash python mindformers/tools/dataset_preprocess/bloom/make_mindrecord.py --input_dataset_file=XXX/alpaca_data.json --output_path=XXX --N=51200 ``` 其中`--N=51200`表示将json中的52002条数据中的前51200转换成mindrecord(推荐),`--N=-1`将转换全部json中的数据. 在执行此脚本时,对于每个prompt如下操作将被执行: - 将问题和回答按照模板制作成prompt text; - 使用BloomTokenizer将prompt从text转成token ids; - 添加eos_token_id直到seq_length。 执行文本后,`--output_path`目录下将生成mindrecord文件。 以llama为例: 目前提供alpaca数据集的预处理脚本用于全参微调/lora微调任务。 数据集下载链接如下: - [alpaca_data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) alpaca数据集原始格式样例: ```text # alpaca examples: { "instruction": "Describe a time when you had to make a difficult decision.", "input": "", "output": "I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client\u2019s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team\u2019s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client\u2019s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities." }, { "instruction": "Identify the odd one out.", "input": "Twitter, Instagram, Telegram", "output": "Telegram" }, ``` - step 1. 执行`alpaca_converter.py`,使用fastchat工具添加prompts模板,将原始数据集转换为多轮对话格式。 ``` bash # 脚本路径:tools/dataset_preprocess/llama/alpaca_converter.py # 执行转换脚本 python alpaca_converter.py \ --data_path /{path}/alpaca_data.json \ --output_path /{path}/alpaca-data-conversation.json ``` ```text # 参数说明 data_path: 存放alpaca数据的路径 output_path: 输出转换后对话格式的数据路径 ``` 转换后格式样例: ```text { "id": "1", "conversations": [ { "from": "human", "value": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:" }, { "from": "gpt", "value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule." } ] }, ``` - step 2. 执行`llama_preprocess.py`,进行数据预处理、Mindrecord数据生成,将带有prompt模板的数据转换为mindrecord格式。 ```bash # 脚本路径:tools/dataset_preprocess/llama/llama_preprocess.py # 由于此工具依赖fschat工具包解析prompt模板,请提前安装fschat >= 0.2.13 python = 3.9 python llama_preprocess.py \ --dataset_type qa \ --input_glob /{path}/alpaca-data-conversation.json \ --model_file /{path}/tokenizer.model \ --seq_length 2048 \ --output_file /{path}/alpaca-fastchat2048.mindrecord ``` ### 全参微调 #### 单卡微调 ```python python run_mindformer.py --config path/to/config.yaml --run_mode finetune ``` ```shell cd scripts bash run_standalone.sh path/to/config.yaml [DEVICE_ID] finetune ``` #### 多卡微调 多卡运行需要RANK_FILE_TABLE,请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节) - 单机多卡 ```shell cd scripts bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] finetune 8 ``` 多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节) - 多机多卡 在每台机器上启动`bash run_distribute.sh`。 ```bash server_count=12 device_num=8*$server_count # launch ranks in the 0th server cd scripts bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] finetune $device_num # launch ranks in the 1-11 server via ssh for idx in {1..11} do let rank_start=8*$idx let rank_end=$rank_start+8 ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] finetune $device_num" done ``` 其中 - `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件; - `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11] ```bash IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11") ``` ### Lora微调 #### 单卡微调 ```python python run_mindformer.py --config path/to/config_lora.yaml --run_mode finetune ``` ```shell cd scripts bash run_standalone.sh path/to/config_lora.yaml [DEVICE_ID] finetune ``` #### 多卡微调 多卡运行需要RANK_FILE_TABLE,请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节) - 单机多卡 ```shell cd scripts bash run_distribute.sh RANK_TABLE_FILE path/to/config_lora.yaml [0,8] finetune 8 ``` 多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节) - 多机多卡 在每台机器上启动`bash run_distribute.sh`。 ```bash server_count=12 device_num=8*$server_count # launch ranks in the 0th server cd scripts bash run_distribute.sh $RANK_TABLE_FILE path/to/config_lora.yaml [0,8] finetune $device_num # launch ranks in the 1-11 server via ssh for idx in {1..11} do let rank_start=8*$idx let rank_end=$rank_start+8 ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config_lora.yaml [$rank_start,$rank_end] finetune $device_num" done ``` 其中 - `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件; - `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11] ```bash IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11") ``` ## 评测 ### 任务名称 以GPT2为例: - 文本生成: - 文本分类: ### 数据集准备-任务名称 - 获取数据集: - [WikiText2数据集](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip)是从维基百科上经过验证的优质文章集中提取的超过1亿个token的集合。 - 处理数据成mindrecord格式 - WikiText2: ```bash cd mindformers/tools/dataset_preprocess/gpt2 python wikitext2_data_process.py --input_file {your_path/wiki.valid.tokens} \ --output_file {your_path/wikitext-2.valid.mindrecord} ``` - 获取数据集: - [SST-2数据集](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip)数据集包含电影评论中的句子和它们情感的人类注释。类别分为两类正面情感(positive,样本标签对应为1)和负面情感(negative,样本标签对应为0) - [IMDB数据集](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)影评数据集,包含5万条IMDB影评,评论的情绪是二元的,专门用于情绪分析。 - [AG-News数据集](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)数据集包含496,835条来自AG新闻语料库4大类别超过2000个新闻源的新闻文章。 - [COLA数据集](https://nyu-mll.github.io/CoLA/)数据集来自语言理论的书籍和期刊,每个句子被标注为是否合乎语法的单词序列。 - 处理数据成mindrecord格式 ```bash # 因评测前需要微调模型,所以需要生成训练/评测数据集。注:生成的数据集文件需以.mindrecord结尾 cd mindformers/tools/dataset_preprocess/gpt2 python txtcls_dataset_to_mindrecord.py --dataset_name {select one from ['cola', 'sst_2', 'ag_news', 'imdb']} --input_file {your_path/train.tsv} \ --output_file {your_path/dataset_name.train.mindrecord} python txtcls_dataset_to_mindrecord.py --dataset_name {the same as above} --input_file {your_path/dev.tsv} \ --output_file {your_path/dataset_name.dev.mindrecord} ``` #### 单卡评测 ```python python run_mindformer.py --config path/to/config.yaml --run_mode eval ``` - 开启评测: - WikiText2 ```bash python run_mindformer.py --config configs/gpt2/run_gpt2.yaml \ --eval_dataset_dir {your_path/wikitext-2.valid.mindrecord} \ --run_mode eval \ --epochs 1 # gpt2: PerplexityMetric: {'PerplexityMetric': {'loss': 3.24, 'PPL': 25.55} # gpt2_13b(需替换yaml文件): PerplexityMetric: {'PerplexityMetric': {'loss': 2.35, 'PPL': 10.49} ``` - 开启微调:因为原始权重中不包含隐向量向类别映射的参数,所以无法进行zero-shot,评测前需要事先进行微调。 ```bash # 运行前请确保run_gpt2_txtcls.yaml中的model.model_config.num_labels准确,具体的, # sst2/cola/imdb: num_labels = 2, agnews: num_labels = 4 python run_mindformer.py --config configs/gpt2/run_gpt2_txtcls.yaml \ --train_dataset_dir {your_path/dataset_name.train.mindrecord} \ --run_mode finetune ``` - 开启评测:评测指标为ACC ```bash # 运行前请确保run_gpt2_txtcls.yaml中的model.model_config.num_labels准确,具体的, # sst2/cola/imdb: num_labels = 2, agnews: num_labels = 4 python run_mindformer.py --config configs/gpt2/run_gpt2_txtcls.yaml \ --eval_dataset_dir {your_path/dataset_name.dev.mindrecord} \ --run_mode eval \ --epochs 1 # ACC: COLA-0.693, SST-2-0.908, IMDB-0.934, AG-News-0.941 ``` ```shell cd scripts bash run_standalone.sh path/to/config.yaml [DEVICE_ID] eval ``` #### 多卡评测 多卡运行需要RANK_FILE_TABLE,请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节) - 单机多卡 ```shell cd scripts bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] eval 8 ``` 多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节) - 多机多卡 在每台机器上启动`bash run_distribute.sh`。 ```bash server_count=12 device_num=8*$server_count # launch ranks in the 0th server cd scripts bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] eval $device_num # launch ranks in the 1-11 server via ssh for idx in {1..11} do let rank_start=8*$idx let rank_end=$rank_start+8 ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] eval $device_num" done ``` 其中 - `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件; - `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11] ```bash IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11") ``` ## 导出 ### 导出过程 > 提供export脚本用法 ### 导出结果 > 提供export结果日志 ## 推理 ### 基于generate的推理 以glm2为例: 下面提供一个模型推理样例脚本 `infer.py` ```python from mindformers import AutoConfig, AutoModel, AutoTokenizer import mindspore as ms ms.set_context(mode=ms.GRAPH_MODE, device_target="Ascend", device_id=0) config = AutoConfig.from_pretrained("glm2_6b") config.checkpoint_name_or_path = "/path/to/glm2_6b_finetune.ckpt" model = AutoModel.from_config(config) tokenizer = AutoTokenizer.from_pretrained("glm2_6b") inputs = tokenizer(tokenizer.build_prompt("你好"))["input_ids"] print(inputs) print(tokenizer.decode(inputs)) outputs = model.generate(inputs, max_length=128) print(tokenizer.decode(outputs)) inputs = tokenizer(tokenizer.build_prompt("请介绍一下华为"))["input_ids"] print(inputs) outputs = model.generate(inputs, max_length=128) print(tokenizer.decode(outputs)) inputs = tokenizer(tokenizer.build_prompt("晚上睡不着应该怎么办"))["input_ids"] print(inputs) outputs = model.generate(inputs, max_length=128) print(tokenizer.decode(outputs)) inputs = tokenizer(tokenizer.build_prompt("类型#上衣*材质#牛仔布*颜色#白色*风格#简约*图案#刺绣*衣样式#外套*衣款式#破洞"))["input_ids"] print(inputs) outputs = model.generate(inputs, max_length=128) print(tokenizer.decode(outputs)) ``` 以bloom为例: ```python import numpy as np import mindspore as ms from mindformers import AutoTokenizer from mindformers.models.bloom import BloomConfig, BloomLMHeadModel ms.set_context(mode=ms.GRAPH_MODE, device_target="Ascend", device_id=0) # ############################## # # bloom_560m config # CKPT_FILE="bloom_560m" # SEQ_LENGTH = 256 # config = BloomConfig( # param_init_type="float16", # embedding_init_type="float16", # checkpoint_name_or_path=CKPT_FILE, # max_decode_length=SEQ_LENGTH, # seq_length=SEQ_LENGTH, # hidden_size=1024, # num_layers=24, # num_heads=16, # hidden_dropout_rate=0.0, # attention_dropout_rate=0.0, # batch_size = 1, # use_past = True # # ) # ############################## # 7B CKPT_FILE = "bloom_7.1b" # CKPT_FILE also takes absolute path to ckpt file, e.g. # "/home/xxx/mindformers/checkpoint_download/bloom/bloom_7.1b.ckpt" SEQ_LENGTH = 256 config = BloomConfig( param_init_type="float16", embedding_init_type="float16", checkpoint_name_or_path=CKPT_FILE, max_decode_length=SEQ_LENGTH, seq_length=SEQ_LENGTH, hidden_size=4096, num_layers=30, num_heads=32, hidden_dropout_rate=0.0, attention_dropout_rate=0.0, batch_size = 1, use_past = True ) def chat(): tokenizer = AutoTokenizer.from_pretrained("bloom_560m") model = BloomLMHeadModel(config) model.set_train(False) question_list = [ "what color is the sky?", "Translate to English: Je t’aime.", ] while True: if question_list: question = question_list.pop(0) else: question = input("please input your question: ") inputs = tokenizer.encode(question) inputs = np.array([inputs]).astype(np.int32) # add batch dim outputs = model.generate(inputs, max_length=None, do_sample=False, eos_token_id=2) outputs = outputs[0] # remove batch dim print(tokenizer.decode(outputs)) if __name__ == "__main__": chat() ``` ### 脚本启动 #### 单卡推理 ```python python run_mindformer.py --config path/to/config.yaml --run_mode predict ``` ```shell cd scripts bash run_standalone.sh path/to/config.yaml [DEVICE_ID] predict ``` #### 多卡推理 多卡运行需要RANK_FILE_TABLE,请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节) - 单机多卡 ```shell cd scripts bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] predict 8 ``` 多机多卡运行需要合并不同机器的RANK_FILE_TABLE,参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节) - 多机多卡 在每台机器上启动`bash run_distribute.sh`。 ```bash server_count=12 device_num=8*$server_count # launch ranks in the 0th server cd scripts bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] predict $device_num # launch ranks in the 1-11 server via ssh for idx in {1..11} do let rank_start=8*$idx let rank_end=$rank_start+8 ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] predict $device_num" done ``` 其中 - `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件; - `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11] ```bash IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11") ```