name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
[ST][MS][master][baichuan2_13b][双机16p][微调][910B]在910B1微调失败,RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/
daily版本:
CANN版本:MILAN-Florence-ASL/ABL V100R001C17SPC001B240 Alpha
MindSpore版本:MindSpore_master_04959216(MindSporeDaily)
MindFormers版本:MindFormers_r1.1.0_23ac9b10(MindFormersDaily)
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
test_mf_qwen_7b_infer_batch_incremental_1p_0001
get code from mindformers
cd mindformers
bash run_multinode.sh '''python ./baichuan2/run_baichuan2.py --config /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/baichuan2/finetune_baichuan2_13b.yaml --load_checkpoint /home/workspace/large_model_ckpt//baichuan2/13b/train/rank_0/Baichuan2_13B_Base.ckpt --auto_trans_ckpt False --use_parallel True --run_mode finetune --train_data /home/workspace/large_model_dataset//baichuan2/belle_4096.mindrecord''' /home/workspace/config/hccl_16p.json [0,8] 16 > /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/sh_distribute.log 2>&1 " ] success
查看日志
网络评估成功
Traceback (most recent call last):
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 139, in main
build_context(config)
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 47, in build_context
local_rank, device_num = init_context(use_parallel=config.use_parallel,
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 122, in init_context
raise RuntimeError("Notice: if you are trying to run with a single device, please set "
RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.
Traceback (most recent call last):
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 120, in init_context
init()
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindspore/communication/management.py", line 188, in init
init_hccl()
RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.
----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Malloc device memory failed, size[64424509440], ret[207001], Device 7 Available HBM size:65464696832 free size:65165602816 may be other processes occupying this card, check as: ps -ef|grep python
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EL0004: 2024-05-16-13:53:19.552.992 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:357 Init
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:293 MallocFromRts
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 216, in <module>
main(task=args.task,
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 44, in wrapper
raise exc
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 139, in main
build_context(config)
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 47, in build_context
local_rank, device_num = init_context(use_parallel=config.use_parallel,
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 122, in init_context
raise RuntimeError("Notice: if you are trying to run with a single device, please set "
RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.
走给张森镇
原因:日志里面Malloc device memory failed, size[64424509440], ret[207001], Device 7 Available HBM size:65464696832 free size:65165602816 may be other processes occupying this card, check as: ps -ef|grep python
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
测试环境问题,问题单打回
登录 后才可以发表评论