2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][master][baichuan2_13b][双机16p][微调][910B]在910B1微调失败,RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False.

REJECTED
Bug-Report
创建于  
2024-05-16 20:08
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[ST][MS][master][baichuan2_13b][双机16p][微调][910B]在910B1微调失败,RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

daily版本:
CANN版本:MILAN-Florence-ASL/ABL V100R001C17SPC001B240 Alpha
MindSpore版本:MindSpore_master_04959216(MindSporeDaily)
MindFormers版本:MindFormers_r1.1.0_23ac9b10(MindFormersDaily)

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_mf_qwen_7b_infer_batch_incremental_1p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers

  2. cd mindformers

  3. bash run_multinode.sh '''python ./baichuan2/run_baichuan2.py --config /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/baichuan2/finetune_baichuan2_13b.yaml --load_checkpoint /home/workspace/large_model_ckpt//baichuan2/13b/train/rank_0/Baichuan2_13B_Base.ckpt --auto_trans_ckpt False --use_parallel True --run_mode finetune --train_data /home/workspace/large_model_dataset//baichuan2/belle_4096.mindrecord''' /home/workspace/config/hccl_16p.json [0,8] 16 > /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/sh_distribute.log 2>&1 " ] success

  4. 查看日志

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络评估成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 139, in main
   build_context(config)
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 47, in build_context
   local_rank, device_num = init_context(use_parallel=config.use_parallel,
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 122, in init_context
   raise RuntimeError("Notice: if you are trying to run with a single device, please set "
RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.

Traceback (most recent call last):
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 120, in init_context
   init()
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindspore/communication/management.py", line 188, in init
   init_hccl()
RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.

----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Malloc device memory failed, size[64424509440], ret[207001], Device 7 Available HBM size:65464696832 free size:65165602816 may be other processes occupying this card, check as: ps -ef|grep python

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EL0004: 2024-05-16-13:53:19.552.992 Failed to allocate memory.
       Possible Cause: Available memory is insufficient.
       Solution: Close applications not in use.
       TraceBack (most recent call last):
       rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
       alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:357 Init
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:293 MallocFromRts


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 216, in <module>
   main(task=args.task,
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper
   raise exc
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
   result = run_func(*args, **kwargs)
 File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 139, in main
   build_context(config)
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 47, in build_context
   local_rank, device_num = init_context(use_parallel=config.use_parallel,
 File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 122, in init_context
   raise RuntimeError("Notice: if you are trying to run with a single device, please set "
RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.

Special notes for this issue/备注 (Optional / 选填)

走给张森镇

评论 (2)

sunjiawei999 创建了Bug-Report
sunjiawei999 复制于任务 I9PUWD
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
device/ascend
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
v2.3.0.rc2
标签
sunjiawei999 添加了
 
v2.3.0.rc3
标签
sunjiawei999 添加了
 
sig/mindformers
标签
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加协作者hsshuai
sunjiawei999 修改了描述
sunjiawei999 修改了描述
sunjiawei999 修改了标题
wangxingyan 添加协作者wangxingyan
wangxingyan 负责人wangxingyan 修改为森镇
zhongjicheng 移除了
 
v2.3.0.rc2
标签
zhongjicheng 移除了
 
v2.3.0.rc2
标签
zhongjicheng 移除了
 
v2.3.0.rc3
标签
zhongjicheng 移除了
 
v2.3.0.rc3
标签
zhongjicheng 添加了
 
master
标签
zhongjicheng 添加了
 
master
标签
linzhengshu 移除了
 
master
标签
linzhengshu 移除了
 
master
标签
linzhengshu 添加了
 
v2.3.0.rc2
标签
linzhengshu 移除了
 
master
标签
linzhengshu 移除了
 
master
标签
linzhengshu 添加了
 
v2.3.0.rc2
标签
linzhengshu 移除了
 
v2.3.0.rc2
标签
linzhengshu 移除了
 
v2.3.0.rc2
标签
linzhengshu 移除了
 
v2.3.0.rc2
标签
linzhengshu 移除了
 
v2.3.0.rc2
标签
zhongjicheng 关联仓库设置为MindSpore/mindspore
zhongjicheng 关联分支设置为master
zhongjicheng 添加了
 
v2.3.0.rc3
标签
zhongjicheng 添加了
 
v2.3.0.rc3
标签
zhongjicheng 关联分支master 修改为r2.3.q1
zhongjicheng 添加了
 
master
标签
zhongjicheng 关联分支r2.3.q1 修改为master
展开全部操作日志

原因:日志里面Malloc device memory failed, size[64424509440], ret[207001], Device 7 Available HBM size:65464696832 free size:65165602816 may be other processes occupying this card, check as: ps -ef|grep python

测试环境问题,问题单打回

Lin 任务状态TODO 修改为REJECTED
Lin 负责人森镇 修改为sunjiawei999
Lin 取消协作者sunjiawei999
Lin 添加协作者森镇
Lin 里程碑B-SIG-MindFormers 修改为B-SolutionTest

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891