>说明：所有“>说明：”开头的部分都是模版重大的说明，文档完成后需要删除。
>说明：初次之外的部分均为可复用部分，也可根据自身需要进行修改。
>说明：不能随意增加修改1,2,3级###标题顺序及内容，仅有4级目录可按需增加
>说明：如果模型没有部分标题内容，可根据说明删除
>说明：模版中`model`，`model_name`，`model_name_type`，`config.yaml`，`config_lora.yaml`等需要根据模型进行修改。
>说明：模版中所有路径都为`path/to/`可以批量替换，涉及到文档，config链接的都需要替换。

# Model Name

## 模型描述

Model Name是个……（简单描述模型与规格，仓库支持情况等）
[论文名](论文网址)作者信息等

>说明：论文或github引用信息（citing）
>例如：
>``` text
>@article{touvron2023llama,
>  title={LLaMA: Open and Efficient Foundation Language Models},
>  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
>  journal={arXiv preprint arXiv:2302.13971},
>  year={2023}
>}
>```

## 模型性能

>说明：性能与精度均以config配置为准，config一定要匹配

|                    config                     |         task         |  Datasets   |  metric  | score | [train performance](#预训练) |     predict performance(#)     |
| :-------------------------------------------: | :------------------: | :---------: | :------: | :---: | :---------------: | :-------------------------: |
|   [model_name_typeA](path/to/configA.yaml)    |   text_generation    |  wikitext2  |   ppl    |  xx   |   xxx tokens/s    | xxx tokens/s(use past True) |
|   [model_name_typeB](path/to/configB.yaml)    |   text_generation    |    ADGEN    | accuracy |  xx%  |   xxx tokens/s    | xxx tokens/s(use past True) |
| [model_name_typeC_lora](path/to/configC.yaml) |   text_generation    |   alpaca    |   ppl    |  xx   |   xxx tokens/s    | xxx tokens/s(use past True) |
|   [model_name_typeD](path/to/configD.yaml)    | image_classification | ImageNet-1K | accuracy |  xx%  |   xxx  frame/s    |        xxx  frame/s         |

## 仓库介绍

>>说明：本条内容需要根据模型内容替换

`Model Name` 基于 `mindformers` 实现，主要涉及的文件有：

1. 模型具体实现：`path/to/model_src_folder`

    ```bash
    model
        ├── __init__.py
        ├── convert_weight.py         # 权重转换脚本
        ├── model.py                  # 模型实现
        ├── model_config.py           # 模型配置项
        ├── model_layer.py            # Model网络层定义
        ├── model_processor.py        # Model预处理
        ├── model_tokenizer.py        # tokenizer
        └── model_transformer.py      # transformer层实现
    ```

2. 模型配置：`path/to/model_config`

    ```bash
    model
        ├── run_model_typeA.yaml         # typeA模型启动配置
        ├── run_model_typeA_lora.yaml    # typeA lora低参微调启动配置
        ├── run_model_typeB.yaml         # typeB模型启动配置
        └── run_model_typeC.yaml         # typeC模型启动配置
    ```

>说明：以上2条为必须
>说明：后面可以根据模型需要增加

3. 预处理脚本和任务启动脚本：`path/to/model_preprocess`

    ```bash
    model
        ├── datasetA_data_preprocess.py     # datasetA数据集预处理
        ├── datasetB_data_preprocess.py     # datasetB数据集预处理
        ├── convert_weight.py               # 权重转换
        └── run_model.py                    # 高阶接口使用脚本
    ```

## 前期准备

### [mindformers安装](path/to/README.md#二mindformers安装)

### 生成RANK_TABLE_FILE(多卡运行必须环节)

运行mindformers/tools/hccl_tools.py生成RANK_TABLE_FILE的json文件

```bash
# 运行如下命令，生成当前机器的RANK_TABLE_FILE的json文件
python ./mindformers/tools/hccl_tools.py --device_num "[0,8)"
```

**注：若使用ModelArts的notebook环境，可从 `/user/config/jobstart_hccl.json` 路径下直接获取rank table，无需手动生成**

RANK_TABLE_FILE 单机8卡参考样例:

```json
{
    "version": "1.0",
    "server_count": "1",
    "server_list": [
        {
            "server_id": "xx.xx.xx.xx",
            "device": [
                {"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"},
                {"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"},
                {"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"},
                {"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"},
                {"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"},
                {"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"},
                {"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"},
                {"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}],
             "host_nic_ip": "reserve"
        }
    ],
    "status": "completed"
}
```

### 多机RANK_TABLE_FILE合并(多机多卡必备环)

- step 1. 首先根据上章节内容，在每个机器上生成各自的`RANK_TABLE_FILE`文件，然后将不同机器上生成的`RANK_TABLE_FILE`文件全部拷贝到同一台机器上。

```bash
# 运行如下命令，生成当前机器的RANK_TABLE_FILE的json文件
python ./mindformers/tools/hccl_tools.py --device_num "[0,8)" --server_ip xx.xx.xx.xx
```

**注：需要根据机器的ip地址指定 --server_ip，避免由于不同机器server_ip不同，导致多节点间通信失败。**

- step 2. 运行mindformers/tools/merge_hccl.py将不同机器上生成的`RANK_TABLE_FILE`文件合并

```bash
# 运行如下命令，合并每个机器上的RANK_TABLE_FILE的json文件。
python ./mindformers/tools/merge_hccl.py hccl*.json
```

- step 3. 将合并后的`RANK_TABLE_FILE`文件拷贝到所有机器中，保证不同机器上的`RANK_TABLE_FILE`相同。

RANK_TABLE_FILE 双机16卡参考样例:

```json
{
    "version": "1.0",
    "server_count": "2",
    "server_list": [
        {
            "server_id": "xx.xx.xx.xx",
            "device": [
                {
                    "device_id": "0", "device_ip": "192.168.0.0", "rank_id": "0"
                },
                {
                    "device_id": "1", "device_ip": "192.168.1.0", "rank_id": "1"
                },
                {
                    "device_id": "2", "device_ip": "192.168.2.0", "rank_id": "2"
                },
                {
                    "device_id": "3", "device_ip": "192.168.3.0", "rank_id": "3"
                },
                {
                    "device_id": "4", "device_ip": "192.168.0.1", "rank_id": "4"
                },
                {
                    "device_id": "5", "device_ip": "192.168.1.1", "rank_id": "5"
                },
                {
                    "device_id": "6", "device_ip": "192.168.2.1", "rank_id": "6"
                },
                {
                    "device_id": "7", "device_ip": "192.168.3.1", "rank_id": "7"
                }
            ],
            "host_nic_ip": "reserve"
        },
        {
            "server_id": "xx.xx.xx.xx",
            "device": [
                {
                    "device_id": "0", "device_ip": "192.168.0.1", "rank_id": "8"
                },
                {
                    "device_id": "1", "device_ip": "192.168.1.1", "rank_id": "9"
                },
                {
                    "device_id": "2", "device_ip": "192.168.2.1", "rank_id": "10"
                },
                {
                    "device_id": "3", "device_ip": "192.168.3.1", "rank_id": "11"
                },
                {
                    "device_id": "4", "device_ip": "192.168.0.2", "rank_id": "12"
                },
                {
                    "device_id": "5", "device_ip": "192.168.1.2", "rank_id": "13"
                },
                {
                    "device_id": "6", "device_ip": "192.168.2.2", "rank_id": "14"
                },
                {
                    "device_id": "7", "device_ip": "192.168.3.2", "rank_id": "15"
                }
            ],
            "host_nic_ip": "reserve"
        }
    ],
    "status": "completed"
}
```

### 模型权重下载与转换

作为参考，这里描述CheckPoint在HuggingFace或者官方开源github仓库和MindSpore间的转换，在不同分布式策略间的转换。

如果不需要加载权重，或者使用from_pretrained功能自动下载，则可以跳过此章节。

>说明：鉴于hugging face被封，github访问不稳定，google driver无法访问，建议在此提供mindformers下载渠道。
>例如：[mindformers权重下载](obs://abc/abc)
>说明：提供目标权重下载地址（hugging face网址，或者官方github仓权重）
>说明：对无需下载可以from pretrained的权重进行说明
>说明：权重转换步骤+权重转换命令

```bash
python path/to/convert_weight.py --torch_path "PATH OF model.pth" --mindspore_path "SAVE PATH OF model.ckpt" --otherargs xxx
```

>说明：可以解释额外入参

### [模型权重切分与合并](path/to/docs/feature_cards/Transform_Ckpt.md)

从hugging face或官方github仓库转换而来的权重通常是单卡权重，基于该权重进行多卡微调，评测，推理，涉及ckpt从单机策略到分布式策略的切换。

通常训练采用分布式训练，基于该权重进行评测，推理多采用单卡，涉及ckpt从分布式策略到单机策略的切换。

以上涉及到ckpt的单卡，多卡转换，详细教程请参考特性文档模型[权重切分与合并](path/to/docs/feature_cards/Transform_Ckpt.md)

## 基于API的快速使用

### 基于AutoClass的快速使用

可以使用AutoClass接口，通过模型名称获取相应的model/preprocess/tokenizer等实例，并自动下载并加载权重

`from_pretrained()` 接口会自动从云上下载预训练的模型，存储路径：`mindformers/checkpoint_download/model_name`

```python
import mindspore
from mindformers import AutoModel, AutoTokenizer

# 指定图模式，指定使用训练卡id
mindspore.set_context(mode=0, device_id=0) 

tokenizer = AutoTokenizer.from_pretrained("model_name_type")
model = AutoModel.from_pretrained("model_name_type")

inputs = tokenizer("输入")
# 非LM模型
outputs = model(inputs)
# LM模型
outputs = model.generate(input_tokens["input_ids"], max_length=100)
response = tokenizer.decode(outputs)[0]
print(response)
# output
```

### 基于Trainer的快速训练，微调，评测，推理

>说明：通常基于Trainer的训练，微调，评测，推理应该把trainer定义一次，然后分别train，finetune，eval，predict，展示接口调用方式即可，需要给出结果。
>说明：如果模型额外支持低参微调，也可以在此处增加
>说明：模版内的task_name, model_name以及路径等需要根据模型替换

```python
import mindspore
from mindformers.trainer import Trainer

# 指定图模式，指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
# 初始化预训练任务
trainer = Trainer(task='task_name',
                  model='model_name',
                  train_dataset='path/to/train_dataset',
                  eval_dataset='path/to/eval_dataset')
# 开启预训练
trainer.train()

# 开启全量微调
trainer.finetune() 

# 开启评测
trainer.evaluate()

 # 开启推理
predict_result = trainer.predict(input_data="predict_data")
print(predict_result)

# Lora微调
trainer = Trainer(task="task_name", model="model_name", pet_method="lora",
                  train_dataset="path/to/train_dataset")
trainer.finetune(finetune_checkpoint="model_name_type")
```

>说明：如果基于基于Trainer的训练，微调，评测，推理前处理，数据，流程等差别过大可以参考blip2分开写，以`-`来分段，不要加入额外标题。

### 基于Pipeline的快速推理

```python
import mindspore
from mindformers.pipeline import pipeline

# 指定图模式，指定使用训练卡id
mindspore.set_context(mode=0, device_id=0)
pipeline_task = pipeline(task="task_name", model="model_name_type")
pipeline_result = pipeline_task("predict_data", top_k=3)
print(pipeline_result)
# output
```

## 预训练

### 数据集准备-预训练

>说明：可以没有，需要做出说明，例如官方训练权重的数据未开源等，也可以提供推荐数据集
>说明：简单数据集不需要额外增加标题，直接在三级标题下加内容即可，如果过数据相关内容较多，可以按照需要增加四集标题。
>说明：需要包括数据集来源，下载链接，格式规范。
>说明：如果需要转换MindRecord格式，需要提供preprocess脚本，以及相关使用命令。
>说明：如果需要修改yaml dataset config内容，则需增加相关说明

### 脚本启动

#### 单卡训练

- python启动

```bash
python run_mindformer.py --config path/to/config.yaml --run_mode train
```

- bash启动

```bash
cd scripts
bash run_standalone.sh path/to/config.yaml [DEVICE_ID] train
```

#### 多卡训练

多卡运行需要RANK_FILE_TABLE，请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节)

- 单机多卡

```bash
cd scripts
bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] train 8
```

多机多卡运行需要合并不同机器的RANK_FILE_TABLE，参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节)

- 多机多卡

在每台机器上启动`bash run_distribute.sh`。

**注：需要保证执行的节点和RANK_TABLE_FIEL的节点顺序保持一致，即rank_id匹配。**

```bash
server_count=12
device_num=8*$server_count
# launch ranks in the 0th server
cd scripts
bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] train $device_num

# launch ranks in the 1-11 server via ssh
for idx in {1..11}
do  
    let rank_start=8*$idx
    let rank_end=$rank_start+8
    ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] train $device_num"
done
```

其中

- `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件；
- `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11]

```bash
IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11")
```

## 微调

### 数据集准备-微调数据集A

>说明：可以没有，需要做出说明，例如官方训练权重的数据未开源等，也可以提供推荐数据集
>说明：简单数据集不需要额外增加标题，直接在三级标题下加内容即可，如果过数据相关内容较多，可以按照需要增加四集标题。
>说明：需要包括数据集来源，下载链接，格式规范。
>说明：如果需要转换MindRecord格式，需要提供preprocess脚本，以及相关使用命令。
>说明：如果需要修改yaml dataset config内容，则需增加相关说明

### 全参微调

#### 单卡微调

- python启动

```bash
python run_mindformer.py --config path/to/config.yaml --run_mode finetune
```

- bash启动

```bash
cd scripts
bash run_standalone.sh path/to/config.yaml [DEVICE_ID] finetune
```

#### 多卡微调

多卡运行需要RANK_FILE_TABLE，请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节)

- 单机多卡

```bash
cd scripts
bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] finetune 8
```

多机多卡运行需要合并不同机器的RANK_FILE_TABLE，参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节)

- 多机多卡

**注：需要保证执行的节点和RANK_TABLE_FIEL的节点顺序保持一致，即rank_id匹配。**

在每台机器上启动`bash run_distribute.sh`。

```bash
server_count=12
device_num=8*$server_count
# launch ranks in the 0th server
cd scripts
bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] finetune $device_num

# launch ranks in the 1-11 server via ssh
for idx in {1..11}
do  
    let rank_start=8*$idx
    let rank_end=$rank_start+8
    ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] finetune $device_num"
done
```

其中

- `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件；
- `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11]

```bash
IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11")
```

### Lora微调

#### 单卡微调

```bash
python run_mindformer.py --config path/to/config_lora.yaml --run_mode finetune
```

```bash
cd scripts
bash run_standalone.sh path/to/config_lora.yaml [DEVICE_ID] finetune
```

#### 多卡微调

多卡运行需要RANK_FILE_TABLE，请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节)

- 单机多卡

```bash
cd scripts
bash run_distribute.sh RANK_TABLE_FILE path/to/config_lora.yaml [0,8] finetune 8
```

多机多卡运行需要合并不同机器的RANK_FILE_TABLE，参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节)

- 多机多卡

在每台机器上启动`bash run_distribute.sh`。

```bash
server_count=12
device_num=8*$server_count
# launch ranks in the 0th server
cd scripts
bash run_distribute.sh $RANK_TABLE_FILE path/to/config_lora.yaml [0,8] finetune $device_num

# launch ranks in the 1-11 server via ssh
for idx in {1..11}
do  
    let rank_start=8*$idx
    let rank_end=$rank_start+8
    ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config_lora.yaml [$rank_start,$rank_end] finetune $device_num"
done
```

其中

- `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件；
- `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11]

```bash
IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11")
```

## 评测

### 任务名称

### 数据集准备-任务名称

- 获取数据集：
        - [XXX数据集](数据集下载链接)数据集介绍
- 处理数据成mindrecord格式

```bash
cd mindformers/tools/dataset_preprocess/model_name
python dataset_process.py --input_file {your_path/dataset} -output_file {your_path/wikitext-2.valid.mindrecord} --otherargs
```

>说明：可对额外参数进行说明

#### 单卡评测

```bash
python run_mindformer.py --config path/to/config.yaml --run_mode eval --eval_dataset_dir {your_path/wikitext-2.valid.mindrecord} --otherargs
# output
# eg: model_name： metric: {'loss': 3.24, 'PPL': 25.55}
```

>说明：需要有输出结果
>说明：部分有特殊情况，例如需要先微调才能评测的可以增加说明，并且将微调启动也写在此处

#### 多卡评测

>说明：多卡评测可根据实际情况进行添加，不是必须有

多卡运行需要RANK_FILE_TABLE，请参考前期准备-[生成RANK_TABLE_FILE](#生成rank_table_file多卡运行必须环节)

- 单机多卡

```bash
cd scripts
bash run_distribute.sh RANK_TABLE_FILE path/to/config.yaml [0,8] eval 8
```

多机多卡运行需要合并不同机器的RANK_FILE_TABLE，参考前期准备-[多机RANK_TABLE_FILE合并](#多机rank_table_file合并多机多卡必备环节)

- 多机多卡

在每台机器上启动`bash run_distribute.sh`。

```bash
server_count=12
device_num=8*$server_count
# launch ranks in the 0th server
cd scripts
bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [0,8] eval $device_num

# launch ranks in the 1-11 server via ssh
for idx in {1..11}
do  
    let rank_start=8*$idx
    let rank_end=$rank_start+8
    ssh ${IP_LIST[$idx]} "cd scripts; bash run_distribute.sh $RANK_TABLE_FILE path/to/config.yaml [$rank_start,$rank_end] eval $device_num"
done
```

其中

- `RANK_TABLE_FILE`为上一步汇总并分发的总rank table文件；
- `IP_LIST`为12台服务器的IP地址。如192.168.0.[0-11]

```bash
IP_LIST=("192.168.0.0", "192.168.0.1", ..., "192.168.0.11")
```

## 推理

### 基于pipeline的推理

>说明：该部分LLM模型需要增加，基于pipeline实现多卡推理。
>说明：该处目的为让用户复现模型性能部分指标。
>说明：需要将模版中pangualpha_2_6b替换为文档模型，inputs建议自行修改。

以下为基于pipeline接口的自定义推理脚本，支持多卡多batch推理。

```python
# predict_custom.py 文件
import os
import argparse

import mindspore as ms
from mindspore.train import Model
from mindspore import load_checkpoint, load_param_into_net

from mindformers import AutoConfig, AutoTokenizer, AutoModel, pipeline
from mindformers import init_context, ContextConfig, ParallelContextConfig
from mindformers.trainer.utils import get_last_checkpoint
from mindformers.tools.utils import str2bool


def context_init(use_parallel=False, device_id=0):
    """init context for mindspore."""
    context_config = ContextConfig(mode=0, device_target="Ascend", device_id=device_id)
    parallel_config = None
    if use_parallel:
        parallel_config = ParallelContextConfig(parallel_mode='SEMI_AUTO_PARALLEL',
                                                gradients_mean=False,
                                                full_batch=True)
    init_context(use_parallel=use_parallel,
                 context_config=context_config,
                 parallel_config=parallel_config)


def main(use_parallel=False,
         device_id=0,
         checkpoint_path="",
         use_past=True):
    """main function."""
    # 初始化单卡/多卡环境
    context_init(use_parallel, device_id)

    # 多batch输入
    inputs = ["上联：欢天喜地度佳节 下联：",
              "四川的省会是哪里？",
              "李大钊如果在世，他会对今天的青年人说："]

    # set model config
    model_config = AutoConfig.from_pretrained("pangualpha_2_6b")
    model_config.use_past = use_past
    if checkpoint_path and not use_parallel:
        model_config.checkpoint_name_or_path = checkpoint_path
    print(f"config is: {model_config}")

    # build tokenizer
    tokenizer = AutoTokenizer.from_pretrained("pangualpha_2_6b")
    # build model from config
    network = AutoModel.from_config(model_config)

    # if use parallel, load distributed checkpoints
    if use_parallel:
        # find the sharded ckpt path for this rank
        ckpt_path = os.path.join(checkpoint_path, "rank_{}".format(os.getenv("RANK_ID", "0")))
        ckpt_path = get_last_checkpoint(ckpt_path)
        print("ckpt path: %s", str(ckpt_path))

        # shard pangualpha and load sharded ckpt
        model = Model(network)
        model.infer_predict_layout(ms.Tensor(np.ones(shape=(1, model_config.seq_length)), ms.int32))
        checkpoint_dict = load_checkpoint(ckpt_path)
        not_load_network_params = load_param_into_net(model, checkpoint_dict)
        print("Network parameters are not loaded: %s", str(not_load_network_params))

    text_generation_pipeline = pipeline(task="text_generation", model=network, tokenizer=tokenizer)
    outputs = text_generation_pipeline(inputs)
    for output in outputs:
        print(output)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--use_parallel', default=False, type=str2bool,
                        help='whether use parallel.')
    parser.add_argument('--device_id', default=0, type=int,
                        help='set device id.')
    parser.add_argument('--checkpoint_path', default='', type=str,
                        help='set checkpoint path.')
    parser.add_argument('--use_past', default=True, type=str2bool,
                        help='whether use past.')
    args = parser.parse_args()

    main(args.use_parallel,
         args.device_id,
         args.checkpoint_path,
         args.use_past)
```

以下为多卡运行自定义多batch推理的脚本

```bash
# >>> `run_predict.sh`文件
CHECKPOINT_PATH=$2
export RANK_TABLE_FILE=$1

# define variable
export RANK_SIZE=8
export START_RANK=0 # this server start rank
export END_RANK=8 # this server end rank

# run
for((i=${START_RANK}; i<${END_RANK}; i++))
do
    export RANK_ID=$i
    export DEVICE_ID=$((i-START_RANK))
    echo "Start distribute running for rank $RANK_ID, device $DEVICE_ID"
    python3 ./predict_custom.py --use_parallel True --checkpoint_path CHECKPOINT_PATH &> minformers_$RANK_ID.log &
done
```

#### 单卡pipeline推理

```bash
python predict_custom.py
```

#### 多卡pipeline推理

```bash
bash run_predict.sh RANK_TABLE_FILE path/to/pangualpha_2_6b_shard_checkpoint_dir
```

### 基于generate的推理

>说明：该部分LLM模型需要增加，基于pipeline实现多batch推理，支持多卡。
>说明：提供模版多卡推理，多batch推理，可以参考修改。
>说明：该处目的为让用户复现模型性能部分指标。
>说明：需要将模版中pangualpha_2_6b替换为文档模型，inputs建议自行修改。

以下为基于model.generate接口的自定义推理脚本，支持多卡多batch推理。

```python
# predict_custom.py 文件
import os
import argparse

import mindspore as ms
from mindspore.train import Model
from mindspore import load_checkpoint, load_param_into_net

from mindformers import AutoConfig, AutoTokenizer, AutoModel
from mindformers import init_context, ContextConfig, ParallelContextConfig
from mindformers.trainer.utils import get_last_checkpoint
from mindformers.tools.utils import str2bool


def context_init(use_parallel=False, device_id=0):
    """init context for mindspore."""
    context_config = ContextConfig(mode=0, device_target="Ascend", device_id=device_id)
    parallel_config = None
    if use_parallel:
        parallel_config = ParallelContextConfig(parallel_mode='SEMI_AUTO_PARALLEL',
                                                gradients_mean=False,
                                                full_batch=True)
    init_context(use_parallel=use_parallel,
                 context_config=context_config,
                 parallel_config=parallel_config)


def main(use_parallel=False,
         device_id=0,
         checkpoint_path="",
         use_past=True):
    """main function."""
    # 初始化单卡/多卡环境
    context_init(use_parallel, device_id)

    # 多batch输入
    inputs = ["上联：欢天喜地度佳节 下联：",
              "四川的省会是哪里？",
              "李大钊如果在世，他会对今天的青年人说："]

    # set model config
    model_config = AutoConfig.from_pretrained("pangualpha_2_6b")
    model_config.batch_size = len(inputs)
    model_config.use_past = use_past
    if checkpoint_path and not use_parallel:
        model_config.checkpoint_name_or_path = checkpoint_path
    print(f"config is: {model_config}")

    # build tokenizer
    tokenizer = AutoTokenizer.from_pretrained("pangualpha_2_6b")
    # build model from config
    model = AutoModel.from_config(model_config)

    # if use parallel, load distributed checkpoints
    if use_parallel:
        # find the sharded ckpt path for this rank
        ckpt_path = os.path.join(checkpoint_path, "rank_{}".format(os.getenv("RANK_ID", "0")))
        ckpt_path = get_last_checkpoint(ckpt_path)
        print("ckpt path: %s", str(ckpt_path))

        # shard pangualpha and load sharded ckpt
        model = Model(model)
        model.infer_predict_layout(ms.Tensor(np.ones(shape=(1, model_config.seq_length)), ms.int32))
        checkpoint_dict = load_checkpoint(ckpt_path)
        not_load_network_params = load_param_into_net(model, checkpoint_dict)
        print("Network parameters are not loaded: %s", str(not_load_network_params))

    inputs_ids = tokenizer(inputs, max_length=model_config.max_decode_length, padding="max_length")["input_ids"]
    outputs = model.generate(inputs_ids, max_length=model_config.max_decode_length)
    for output in outputs:
        print(tokenizer.decode(output))


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--use_parallel', default=False, type=str2bool,
                        help='whether use parallel.')
    parser.add_argument('--device_id', default=0, type=int,
                        help='set device id.')
    parser.add_argument('--checkpoint_path', default='', type=str,
                        help='set checkpoint path.')
    parser.add_argument('--use_past', default=True, type=str2bool,
                        help='whether use past.')
    args = parser.parse_args()

    main(args.use_parallel,
         args.device_id,
         args.checkpoint_path,
         args.use_past)
```

以下为多卡运行自定义多batch推理的脚本

```bash
# >>> `run_predict.sh`文件
CHECKPOINT_PATH=$2
export RANK_TABLE_FILE=$1

# define variable
export RANK_SIZE=8
export START_RANK=0 # this server start rank
export END_RANK=8 # this server end rank

# run
for((i=${START_RANK}; i<${END_RANK}; i++))
do
    export RANK_ID=$i
    export DEVICE_ID=$((i-START_RANK))
    echo "Start distribute running for rank $RANK_ID, device $DEVICE_ID"
    python3 ./predict_custom.py --use_parallel True --checkpoint_path CHECKPOINT_PATH &> minformers_$RANK_ID.log &
done
```

#### 单卡generate推理

```bash
python predict_custom.py
```

#### 多卡generate推理

```bash
bash run_predict.sh RANK_TABLE_FILE path/to/pangualpha_2_6b_shard_checkpoint_dir
```

### 脚本启动

#### 单卡推理

```bash
python run_mindformer.py --config configs/pangualpha/run_pangualpha_2_6b.yaml --run_mode predict --predict_data 上联：欢天喜地度佳节 下联： --use_parallel False
# output result is: [{'text_generation_text': ['上联:欢天喜地度佳节 下联:笑逐颜开迎佳期 横批:幸福快乐<eot>']}]
```

**注**：要提高推理速度，可在对应模型配置文件中进行如下配置，设置增量推理`use_past`为True。

```yaml
# model config
use_past: True          # 开启增量推理
use_moe: False
expert_num: 1
per_token_num_experts_chosen: 1
checkpoint_name_or_path: "pangualpha_2_6b"
repetition_penalty: 1
max_decode_length: 1024
top_k: 3
top_p: 1
do_sample: False
```

### 脚本启动

#### 单卡推理

```bash
python run_mindformer.py --config path/to/config.yaml --run_mode predict
```

#### 多卡推理

>说明：多卡评测可根据实际情况进行添加，不是必须有，部分大参数模型推荐增加此模块
>说明：当前bash启动方式不支持传入predict data，因此多卡推理需要自行实现

## [mindspore-lite](../feature_cards/Inference.md#模型导出增量推理为例)

### export使用

>说明： 提供export脚本用法
>说明： 需要包括输出结果
>说明： 该部分根据需要添加