2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

baichuan2在910A环境下单机8卡微调失败:ValueError: For 'Optimizer', the argument parameters must not be empty

DONE
Bug-Report
创建于  
2024-05-11 20:18
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

在910A环境下,进行单机8卡的baichuan2大模型微调训练失败。参考文档:https://gitee.com/mindspore/mindformers/blob/dev/research/baichuan2/baichuan2.md#lora%E5%BE%AE%E8%B0%83

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
Ascend 910A

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    MindSpore 2.2.11
    Python 3.9.18
    Linux version 4.19.36-vhulk1907.1.0.h1524.eulerosv2r8.aarch64 (abuild@szxrtosci10000) (gcc version 7.3.0 (GCC)) #1 SMP Mon Jan 1 17:11:14 UTC 2024
  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. 通过docker命令启动测试容器,并且将模型算法文件都挂载到容器当中
docker run -itd  -u root --ipc=host --network host --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /etc/localtime:/etc/localtime -v /etc/hccn.conf:/etc/hccn.conf -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /var/log/npu/:/usr/slog -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/bin/hccn_tool:/usr/bin/hccn_tool -v /mindformer-share/nj/dataset:/workspace/dataset -v /mindformer-share/nj/model:/workspace/model -v /mindformer-share/nj/training-framework:/workspace/training-framework  --name mindformers_test  swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.0_mindspore2.2.11:aarch_20240125 /bin/bash
  1. 在容器中启动微调训练任务
bash research/run_singlenode.sh "python research/baichuan2/run_baichuan2.py --config research/baichuan2/run_baichuan2_7b_lora_910b.yaml --load_checkpoint /workspace/model/baichuan2 --auto_trans_ckpt True --use_parallel True --run_mode finetune --train_data  /workspace/dataset/baichuan2/train" /workspace/training-framework/mindformers-1.0/hccl_8p_01234567_192.168.12.126.json  [0,8] 8
  1. 查看任务日志
cat output/log/rank_0/mindformer.log

Describe the expected behavior / 预期结果 (Mandatory / 必填)

模型微调训练成果

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

2024-05-11 17:37:35,948 - mindformers[mindformers/core/context/build_context.py:194] - INFO - full_batch will be forced to False when the parallel mode is stand_alone or data_parallel
[WARNING] HCCL_ADPT(17565,ffff82a4b1c0,python):2024-05-11-17:37:36.616.898 [mindspore/ccsrc/plugin/device/ascend/hal/hccl_adapter/hccl_adapter.cc:63] GenHcclOptions] The environment variable DEPLOY_MODE is not set. Now set to default value 0
2024-05-11 17:37:36,891 - mindformers[mindformers-1.0/research/baichuan2/run_baichuan2.py:88] - INFO - 当前工作路径:/workspace/training-framework/mindformers-1.0/
2024-05-11 17:37:36,927 - mindformers[mindformers/tools/utils.py:153] - INFO - set output path to '/workspace/training-framework/mindformers-1.0/output'
2024-05-11 17:37:36,930 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'CausalLanguageModelingTrainer', 'model_name': 'baichuan2_7b_lora'}
2024-05-11 17:37:36,931 - mindformers[mindformers/trainer/base_trainer.py:90] - INFO - Now Running Task is: text_generation, Model is: baichuan2_7b_lora
2024-05-11 17:37:36,933 - mindformers[mindformers/trainer/base_trainer.py:131] - WARNING - Input model name is not in the supported list or unspecified.
2024-05-11 17:37:36,934 - mindformers[mindformers/trainer/base_trainer.py:132] - WARNING - See the list of supported task and model name: OrderedDict([('general', OrderedDict([('common', '/workspace/training-framework/mindformers-1.0/configs/general/run_general_task.yaml')])), ('masked_image_modeling', OrderedDict([('mae_vit_base_p16', '/workspace/training-framework/mindformers-1.0/configs/mae/run_mae_vit_base_p16_224_800ep.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/mae/run_mae_vit_base_p16_224_800ep.yaml')])), ('image_classification', OrderedDict([('vit_base_p16', '/workspace/training-framework/mindformers-1.0/configs/vit/run_vit_base_p16_224_100ep.yaml'), ('swin_base_p4w7', '/workspace/training-framework/mindformers-1.0/configs/swin/run_swin_base_p4w7_224_100ep.yaml'), ('mindspore/vit_base_p16', '/workspace/training-framework/mindformers-1.0/configs/vit/run_vit_base_p16_224_100ep.yaml'), ('mindspore/swin_base_p4w7', '/workspace/training-framework/mindformers-1.0/configs/swin/run_swin_base_p4w7_224_100ep.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/vit/run_vit_base_p16_224_100ep.yaml')])), ('fill_mask', OrderedDict([('bert_base_uncased', '/workspace/training-framework/mindformers-1.0/configs/bert/run_bert_base_uncased.yaml'), ('bert_tiny_uncased', '/workspace/training-framework/mindformers-1.0/configs/bert/run_bert_tiny_uncased.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/bert/run_bert_tiny_uncased.yaml')])), ('contrastive_language_image_pretrain', OrderedDict([('clip_vit_b_32', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_pretrain_flickr8k.yaml'), ('blip2_stage1_vit_g', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage1_vit_g_qformer_pretrain.yaml'), ('blip2_stage2_vit_g_baichuan_7b', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage2_vit_g_baichuan_7b.yaml'), ('blip2_stage2_vit_g_llama_7b', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage2_vit_g_llama_7b.yaml'), ('mindspore/clip_vit_b_32', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_pretrain_flickr8k.yaml'), ('clip_vit_b_16', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_16_pretrain_flickr8k.yaml'), ('clip_vit_l_14', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_l_14_pretrain_flickr8k.yaml'), ('clip_vit_l_14@336', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_l_14@336_pretrain_flickr8k.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_pretrain_flickr8k.yaml')])), ('image_to_text_retrieval', OrderedDict([('blip2_stage1_evaluator', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage1_vit_g_retrieval_flickr30k.yaml')])), ('zero_shot_image_classification', OrderedDict([('clip_vit_b_32', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_zero_shot_image_classification_cifar100.yaml'), ('mindspore/clip_vit_b_32', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_zero_shot_image_classification_cifar100.yaml'), ('clip_vit_b_16', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_16_zero_shot_image_classification_cifar100.yaml'), ('clip_vit_l_14', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_l_14_zero_shot_image_classification_cifar100.yaml'), ('clip_vit_l_14@336', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_l_14@336_zero_shot_image_classification_cifar100.yaml'), ('blip2_stage1_classification', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage1_vit_g_zero_shot_image_classification_cifar100.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/clip/run_clip_vit_b_32_zero_shot_image_classification_cifar100.yaml')])), ('image_to_text_generation', OrderedDict([('itt_blip2_stage2_vit_g_baichuan_7b', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage2_vit_g_baichuan_7b_image_to_text_generation.yaml'), ('itt_blip2_stage2_vit_g_llama_7b', '/workspace/training-framework/mindformers-1.0/configs/blip2/run_blip2_stage2_vit_g_llama_7b_image_to_text_generation.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/clip/run_blip2_stage2_vit_g_llama_7b_image_to_text_generation.yaml')])), ('translation', OrderedDict([('t5_small', '/workspace/training-framework/mindformers-1.0/configs/t5/run_t5_small_on_wmt16.yaml'), ('t5_tiny', '/workspace/training-framework/mindformers-1.0/configs/t5/run_t5_tiny_on_wmt16.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/t5/run_t5_small_on_wmt16.yaml')])), ('text_classification', OrderedDict([('txtcls_bert_base_uncased', '/workspace/training-framework/mindformers-1.0/configs/txtcls/run_txtcls_bert_base_uncased.yaml'), ('txtcls_bert_base_uncased_mnli', '/workspace/training-framework/mindformers-1.0/configs/txtcls/run_txtcls_bert_base_uncased_mnli.yaml'), ('mindspore/txtcls_bert_base_uncased_mnli', '/workspace/training-framework/mindformers-1.0/configs/txtcls/run_txtcls_bert_base_uncased_mnli.yaml'), ('gpt2_txtcls', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_txtcls.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/txtcls/run_txtcls_bert_base_uncased.yaml')])), ('token_classification', OrderedDict([('tokcls_bert_base_chinese', '/workspace/training-framework/mindformers-1.0/configs/tokcls/run_tokcls_bert_base_chinese.yaml'), ('tokcls_bert_base_chinese_cluener', '/workspace/training-framework/mindformers-1.0/configs/tokcls/run_tokcls_bert_base_chinese_cluener.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/tokcls/run_tokcls_bert_base_chinese.yaml')])), ('question_answering', OrderedDict([('qa_bert_base_uncased', '/workspace/training-framework/mindformers-1.0/configs/qa/run_qa_bert_base_uncased.yaml'), ('qa_bert_base_uncased_squad', '/workspace/training-framework/mindformers-1.0/configs/qa/run_qa_bert_base_uncased.yaml'), ('mindspore/qa_bert_base_uncased', '/workspace/training-framework/mindformers-1.0/configs/qa/run_qa_bert_base_uncased.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/qa/run_qa_bert_base_uncased.yaml')])), ('text_generation', OrderedDict([('gpt2', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2.yaml'), ('gpt2_lora', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_lora.yaml'), ('gpt2_13b', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_13b.yaml'), ('gpt2_52b', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_52b.yaml'), ('gpt2_xl', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_xl.yaml'), ('gpt2_xl_lora', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2_xl_lora.yaml'), ('llama_7b', '/workspace/training-framework/mindformers-1.0/configs/llama/run_llama_7b.yaml'), ('llama_13b', '/workspace/training-framework/mindformers-1.0/configs/llama/run_llama_13b.yaml'), ('llama_65b', '/workspace/training-framework/mindformers-1.0/configs/llama/run_llama_65b.yaml'), ('llama2_7b', '/workspace/training-framework/mindformers-1.0/configs/llama2/run_llama2_7b.yaml'), ('llama2_13b', '/workspace/training-framework/mindformers-1.0/configs/llama2/run_llama2_13b.yaml'), ('llama2_70b', '/workspace/training-framework/mindformers-1.0/configs/llama2/run_llama2_70b.yaml'), ('codellama_34b', '/workspace/training-framework/mindformers-1.0/configs/codellama/run_codellama_34b_910b.yaml'), ('llama_7b_lora', '/workspace/training-framework/mindformers-1.0/configs/llama/run_llama_7b_lora.yaml'), ('pangualpha_2_6b', '/workspace/training-framework/mindformers-1.0/configs/pangualpha/run_pangualpha_2_6b.yaml'), ('pangualpha_13b', '/workspace/training-framework/mindformers-1.0/configs/pangualpha/run_pangualpha_13b.yaml'), ('glm_6b', '/workspace/training-framework/mindformers-1.0/configs/glm/run_glm_6b_finetune.yaml'), ('glm_6b_chat', '/workspace/training-framework/mindformers-1.0/configs/glm/run_glm_6b_infer.yaml'), ('glm_6b_lora', '/workspace/training-framework/mindformers-1.0/configs/glm/run_glm_6b_lora.yaml'), ('glm_6b_lora_chat', '/workspace/training-framework/mindformers-1.0/configs/glm/run_glm_6b_lora_infer.yaml'), ('glm2_6b', '/workspace/training-framework/mindformers-1.0/configs/glm2/run_glm2_6b.yaml'), ('glm2_6b_lora', '/workspace/training-framework/mindformers-1.0/configs/glm2/run_glm2_6b_lora.yaml'), ('glm2_6b_ptuning2', '/workspace/training-framework/mindformers-1.0/configs/glm2/run_glm2_6b_ptuning2.yaml'), ('glm3_6b', '/workspace/training-framework/mindformers-1.0/configs/glm3/run_glm3_6b.yaml'), ('codegeex2_6b', '/workspace/training-framework/mindformers-1.0/configs/codegeex2/run_codegeex2_6b.yaml'), ('bloom_560m', '/workspace/training-framework/mindformers-1.0/configs/bloom/run_bloom_560m.yaml'), ('bloom_7.1b', '/workspace/training-framework/mindformers-1.0/configs/bloom/run_bloom_7.1b.yaml'), ('bloom_65b', '/workspace/training-framework/mindformers-1.0/configs/bloom/run_bloom_65b.yaml'), ('bloom_176b', '/workspace/training-framework/mindformers-1.0/configs/bloom/run_bloom_176b.yaml'), ('baichuan_7b', '/workspace/training-framework/mindformers-1.0/research/baichuan/run_baichuan_7b.yaml'), ('baichuan2_7b', '/workspace/training-framework/mindformers-1.0/research/baichuan2/run_baichuan2_7b.yaml'), ('baichuan2_13b', '/workspace/training-framework/mindformers-1.0/research/baichuan2/run_baichuan2_13b.yaml'), ('ziya_13b', '/workspace/training-framework/mindformers-1.0/research/ziya/run_ziya_13b.yaml'), ('skywork_13b', '/workspace/training-framework/mindformers-1.0/research/skywork/run_skywork_13b.yaml'), ('internlm_7b', '/workspace/training-framework/mindformers-1.0/research/internlm/run_internlm_7b.yaml'), ('internlm_7b_lora', '/workspace/training-framework/mindformers-1.0/research/internlm/run_internlm_7b_lora.yaml'), ('qwen_7b', '/workspace/training-framework/mindformers-1.0/research/qwen/run_qwen_7b.yaml'), ('qwen_7b_lora', '/workspace/training-framework/mindformers-1.0/research/qwen/run_qwen_7b_lora.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2.yaml')])), ('segment_anything', OrderedDict([('sam_vit_b', '/workspace/training-framework/mindformers-1.0/configs/sam/run_sam_vit-b.yaml'), ('sam_vit_l', '/workspace/training-framework/mindformers-1.0/configs/sam/run_sam_vit-l.yaml'), ('sam_vit_h', '/workspace/training-framework/mindformers-1.0/configs/sam/run_sam_vit-h.yaml'), ('common', '/workspace/training-framework/mindformers-1.0/configs/sam/run_sam_vit-h.yaml')]))])
2024-05-11 17:37:36,937 - mindformers[mindformers/trainer/base_trainer.py:133] - WARNING - The default model config: /workspace/training-framework/mindformers-1.0/configs/gpt2/run_gpt2.yaml will now be used for the text_generation task
2024-05-11 17:37:36,939 - mindformers[mindformers/core/parallel_config.py:45] - INFO - initial recompute_config from dict: {'recompute': True, 'select_recompute': False, 'parallel_optimizer_comm_recompute': False, 'mp_comm_recompute': True, 'recompute_slice_activation': True}
2024-05-11 17:37:36,941 - mindformers[mindformers/core/parallel_config.py:51] - INFO - initial parallel_config from dict: {'data_parallel': 8, 'model_parallel': 1, 'pipeline_stage': 1, 'micro_batch_num': 1, 'vocab_emb_dp': True, 'gradient_aggregation_group': 4}
2024-05-11 17:37:36,943 - mindformers[mindformers/trainer/base_trainer.py:233] - INFO - The current parallel mode is data_parallel, batch size per card will not be changed: batch_size_per_card = 2
2024-05-11 17:37:36,944 - mindformers[mindformers/trainer/base_trainer.py:237] - INFO - global_batch_size = batch_size_per_card * device_num * gradient_accumulation_steps = 16 = 2 * 8 * 1
2024-05-11 17:37:36,946 - mindformers[mindformers/trainer/base_trainer.py:246] - INFO - parallel_config will be change to default config: [ParallelConfig]
_recompute:[ParallelConfig]
_recompute:True
_select_recompute:False
_parallel_optimizer_comm_recompute:False
_mp_comm_recompute:True
_recompute_slice_activation:True

select_recompute:False
use_seq_parallel:False
_gradient_aggregation_group:4
_embed_dp_mp_config:[ParallelConfig]
_dp_mp_config:[ParallelConfig]
_data_parallel:1
_model_parallel:1
use_seq_parallel:False
select_recompute:False

_vocab_emb_dp:True
use_seq_parallel:False
select_recompute:False

_pp_config:[ParallelConfig]
_pipeline_stage:1
_micro_batch_num:1

_moe_config:[ParallelConfig]
_dpmp:[ParallelConfig]
_data_parallel:1
_model_parallel:1
use_seq_parallel:False
select_recompute:False

_expert_parallel:1
use_seq_parallel:False
select_recompute:False

.
2024-05-11 17:37:36,951 - mindformers[mindformers/trainer/base_trainer.py:629] - INFO - .........Build Dataset For Train..........
2024-05-11 17:37:36,953 - mindformers[mindformers/trainer/base_trainer.py:353] - INFO - .........Build Dataset From Config..........
2024-05-11 17:37:36,955 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'CausalLanguageModelDataset', 'dataset_config': {'data_loader': {'type': 'MindDataset', 'dataset_dir': '/workspace/dataset/baichuan2/train', 'shuffle': True}, 'tokenizer': {'type': 'Baichuan2Tokenizer', 'vocab_file': '../../model/baichuan2/tokenizer.model'}, 'input_columns': ['input_ids', 'labels'], 'num_parallel_workers': 8, 'python_multiprocessing': False, 'drop_remainder': True, 'repeat': 1, 'numa_enable': False, 'prefetch_size': 1, 'do_eval': False, 'seed': 0, 'auto_tune': False, 'filepath_prefix': './autotune', 'autotune_per_step': 10, 'profile': False, 'batch_size': 2}}
2024-05-11 17:37:36,957 - mindformers[mindformers/dataset/causal_language_model_dataset.py:166] - INFO - Now Create Causal Language Model Dataset.
2024-05-11 17:37:36,960 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'MindDataset', 'shuffle': True}
2024-05-11 17:37:36,969 - mindformers[mindformers/trainer/utils.py:149] - INFO - Will be Training epochs:1, sink_size:4
2024-05-11 17:37:36,971 - mindformers[mindformers/trainer/utils.py:151] - INFO - Create training dataset finish, dataset size:625
2024-05-11 17:37:36,973 - mindformers[mindformers/tools/check_rules.py:122] - WARNING - full_batch could only be used under semi_auto_parallel or auto_parallel, but get data_parallel, full_batch has been forced to False
2024-05-11 17:37:36,975 - mindformers[mindformers/trainer/base_trainer.py:661] - INFO - .........Build Net For Train..........
2024-05-11 17:37:36,977 - mindformers[mindformers/trainer/base_trainer.py:388] - INFO - .........Build Network From Config..........
2024-05-11 17:37:36,978 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'LlamaConfig', 'batch_size': 1, 'seq_length': 512, 'hidden_size': 4096, 'num_layers': 32, 'num_heads': 32, 'vocab_size': 125696, 'multiple_of': 256, 'rms_norm_eps': 1e-06, 'bos_token_id': 1, 'eos_token_id': 2, 'pad_token_id': 0, 'ignore_token_id': -100, 'user_token_id': 195, 'assistant_token_id': 196, 'compute_dtype': 'float16', 'layernorm_compute_type': 'float32', 'softmax_compute_type': 'float32', 'rotary_dtype': 'float32', 'param_init_type': 'float16', 'use_past': False, 'compute_in_2d': True, 'use_flash_attention': False, 'offset': 0, 'checkpoint_name_or_path': None, 'repetition_penalty': 1.05, 'temperature': 1.0, 'max_decode_length': 512, 'top_k': 5, 'top_p': 0.85, 'do_sample': False, 'pet_config': {'pet_type': 'lora', 'lora_rank': 8, 'lora_alpha': 32, 'lora_dropout': 0.1, 'target_modules': '.*query_key_value*'}}
2024-05-11 17:37:36,980 - mindformers[mindformers/models/llama/llama_config.py:184] - WARNING - Argument `compute_in_2d` is deprecated.
2024-05-11 17:37:36,981 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'Baichuan7BV2ForCausalLM'}
2024-05-11 17:37:36,983 - mindformers[mindformers/version_control.py:60] - INFO - The Cell Reuse compilation acceleration feature is not supported when the environment variable ENABLE_CELL_REUSE is 0 or MindSpore version is earlier than 2.1.0 or stand_alone mode or pipeline_stages <= 1
2024-05-11 17:37:36,985 - mindformers[mindformers/version_control.py:64] - INFO -
The current ENABLE_CELL_REUSE=0, please set the environment variable as follows:
export ENABLE_CELL_REUSE=1 to enable the Cell Reuse compilation acceleration feature.
2024-05-11 17:37:36,986 - mindformers[mindformers/version_control.py:73] - INFO - The Cell Reuse compilation acceleration feature only works in pipeline parallel mode(pipeline_stage>1).Current pipeline stage=1, the feature is disabled by default.
[WARNING] ME(17565:281472873574848,MainProcess):2024-05-11-17:37:36.997.741 [mindspore/ops/primitive.py:228] The in_strategy of the operator in your network will not take effect in data_parallel mode. This means the the shard function called in the network is ignored.
If you want to enable it, please use semi auto or auto parallel mode by context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL or context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL)
2024-05-11 17:37:45,608 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
[WARNING] ME(17565:281472873574848,MainProcess):2024-05-11-17:37:45.613.373 [mindspore/common/parameter.py:786] This interface may be deleted in the future.
2024-05-11 17:37:48,173 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:37:50,832 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:37:53,419 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:37:55,974 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:37:58,651 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:38:01,326 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:38:04,069 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:38:06,745 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:38:09,424 - mindformers[mindformers/modules/layers.py:554] - WARNING - The user passed the custom defined activation function True. If the user want to enable shard for the activation cell, the user should set the shard for each primitives in the cell.
2024-05-11 17:39:15,593 - mindformers[mindformers/models/base_model.py:117] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None.
2024-05-11 17:39:15,611 - mindformers[mindformers/models/base_model.py:117] - INFO - model built, but weights is unloaded, since the config has no checkpoint_name_or_path attribute or checkpoint_name_or_path is None.
[INFO] 2024-05-11 17:39:15,616 [17565] [SDK] : Start to freeze model for delta, mode: lora, include list: None, exclude list: None
[INFO] 2024-05-11 17:39:15,616 [17565] [SDK] : Start to freeze model, include list: ['*'], exclude list: ['*mindpet_delta_lora*']
[INFO] 2024-05-11 17:39:15,625 [17565] [SDK] : End to freeze model.
[INFO] 2024-05-11 17:39:15,625 [17565] [SDK] : End to freeze model for delta.
2024-05-11 17:39:15,633 - mindformers[mindformers/trainer/base_trainer.py:540] - INFO - Network Parameters: 0 M.
2024-05-11 17:39:15,635 - mindformers[mindformers/trainer/base_trainer.py:686] - INFO - .........Build Optimizer For Train..........
2024-05-11 17:39:15,637 - mindformers[mindformers/trainer/base_trainer.py:435] - INFO - .........Build Optimizer From Config..........
2024-05-11 17:39:15,639 - mindformers[mindformers/trainer/base_trainer.py:469] - INFO - .........Build LR Schedule From Config..........
2024-05-11 17:39:15,640 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'CosineWithWarmUpLR', 'learning_rate': 5e-05, 'lr_end': 2e-06, 'total_steps': 624, 'warmup_steps': 0}
2024-05-11 17:39:15,646 - mindformers[mindformers/trainer/optimizer_grouped_parameters.py:74] - WARNING - dynamic_lr_schedule will be reset and invalid when layer_scale is False.
2024-05-11 17:39:15,650 - mindformers[mindformers/trainer/optimizer_grouped_parameters.py:113] - INFO - Param groups = {}
2024-05-11 17:39:15,652 - mindformers[mindformers/trainer/base_trainer.py:451] - INFO - .........Build Optimizer From Config config.optimizer={'type': 'FP32StateAdamWeightDecay', 'beta1': 0.9, 'beta2': 0.98, 'eps': 1e-08}..........
2024-05-11 17:39:15,653 - mindformers[mindformers/tools/register/register.py:160] - INFO - get_instance_from_cfg.cfg={'type': 'FP32StateAdamWeightDecay', 'beta1': 0.9, 'beta2': 0.98, 'eps': 1e-08}
2024-05-11 17:39:15,659 - mindformers[mindformers/tools/cloud_adapter/cloud_monitor.py:43] - ERROR - Traceback (most recent call last):
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/register/register.py", line 193, in get_instance_from_cfg
   return obj_cls(**args)
 File "/workspace/training-framework/mindformers-1.0/mindformers/core/optim/optim.py", line 437, in __init__
   super(nn.AdamWeightDecay, self).__init__(learning_rate, params, weight_decay)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/optim/optimizer.py", line 199, in __init__
   parameters = self._parameters_base_check(parameters, "parameters")
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/optim/optimizer.py", line 400, in _parameters_base_check
   raise ValueError(f"For 'Optimizer', the argument {param_info} must not be empty.")
ValueError: For 'Optimizer', the argument parameters must not be empty.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/workspace/training-framework/mindformers-1.0/research/baichuan2/run_baichuan2.py", line 279, in main
   trainer.finetune(finetune_checkpoint=ckpt, auto_trans_ckpt=config.auto_trans_ckpt, resume_training=resume)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
   return func(*args, **kwargs)
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/trainer.py", line 485, in finetune
   self.trainer.train(
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 99, in train
   self.training_process(
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/base_trainer.py", line 688, in training_process
   optimizer = self.create_optimizer_scheduler(network, layer_scale=config.layer_scale)
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/base_trainer.py", line 452, in create_optimizer_scheduler
   self.optimizer = build_optim(
 File "/workspace/training-framework/mindformers-1.0/mindformers/core/optim/build_optim.py", line 67, in build_optim
   return MindFormerRegister.get_instance_from_cfg(
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/register/register.py", line 195, in get_instance_from_cfg
   raise type(e)('{}: {}'.format(obj_cls.__name__, e))
ValueError: FP32StateAdamWeightDecay: For 'Optimizer', the argument parameters must not be empty.

Traceback (most recent call last):
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/register/register.py", line 193, in get_instance_from_cfg
   return obj_cls(**args)
 File "/workspace/training-framework/mindformers-1.0/mindformers/core/optim/optim.py", line 437, in __init__
   super(nn.AdamWeightDecay, self).__init__(learning_rate, params, weight_decay)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/optim/optimizer.py", line 199, in __init__
   parameters = self._parameters_base_check(parameters, "parameters")
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/nn/optim/optimizer.py", line 400, in _parameters_base_check
   raise ValueError(f"For 'Optimizer', the argument {param_info} must not be empty.")
ValueError: For 'Optimizer', the argument parameters must not be empty.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/workspace/training-framework/mindformers-1.0/research/baichuan2/run_baichuan2.py", line 347, in <module>
   main(task=args.task,
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper
   raise exc
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/workspace/training-framework/mindformers-1.0/research/baichuan2/run_baichuan2.py", line 279, in main
   trainer.finetune(finetune_checkpoint=ckpt, auto_trans_ckpt=config.auto_trans_ckpt, resume_training=resume)
 File "/root/miniconda3/envs/mindspore2.2.11_py39/lib/python3.9/site-packages/mindspore/_checkparam.py", line 1313, in wrapper
   return func(*args, **kwargs)
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/trainer.py", line 485, in finetune
   self.trainer.train(
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 99, in train
   self.training_process(
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/base_trainer.py", line 688, in training_process
   optimizer = self.create_optimizer_scheduler(network, layer_scale=config.layer_scale)
 File "/workspace/training-framework/mindformers-1.0/mindformers/trainer/base_trainer.py", line 452, in create_optimizer_scheduler
   self.optimizer = build_optim(
 File "/workspace/training-framework/mindformers-1.0/mindformers/core/optim/build_optim.py", line 67, in build_optim
   return MindFormerRegister.get_instance_from_cfg(
 File "/workspace/training-framework/mindformers-1.0/mindformers/tools/register/register.py", line 195, in get_instance_from_cfg
   raise type(e)('{}: {}'.format(obj_cls.__name__, e))
ValueError: FP32StateAdamWeightDecay: For 'Optimizer', the argument parameters must not be empty.

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

liuliu910 创建了Bug-Report

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli @Shawny

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

您好,建议移步mindformers issue获取更多支持:https://gitee.com/mindspore/mindformers/issues

i-robot 添加了
 
gitee
标签
Shawny 负责人设置为Shawny
Shawny 添加协作者Shawny
Shawny 负责人Shawny 修改为lzy0920232
Shawny 关联项目设置为MindSpore Issue Assistant
Shawny 计划开始日期设置为2024-05-13
Shawny 计划截止日期设置为2024-06-13
Shawny 添加了
 
mindspore-assistant
标签
Shawny 添加了
 
sig/mindformers
标签
Shawny 添加了
 
sig/mindformers
标签
Shawny 任务状态TODO 修改为VALIDATION

您好,由于问题单没有回复,我们后续会关闭,如您仍有疑问,可以反馈下具体信息,并将ISSUE状态修改为WIP,我们这边会进一步跟踪,谢谢

Shawny 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
8108889 shawny233 1628167362
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891