398 Star 1.4K Fork 1.5K

GVPopenGauss / openGauss-server

 / 详情

【测试类型:故障注入】【测试版本:3.0.3】 故障注入服务器重启后openGauss启动异常

已验收
缺陷
创建于  
2023-12-20 17:15

【标题描述】:
【测试类型:故障注入】【测试版本:3.0.3】 故障注入服务器重启后openGauss启动异常

【操作系统和硬件信息】(查询命令: cat /etc/system-release, uname -a):
CentOS Linux release 7.9.2009 (Core)
Linux test123 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

【测试环境】
多实例环境(1主1备)
单实例环境

【被测功能】:
故障注入后的恢复表现

【测试类型】:
故障注入

【数据库版本】(查询命令: gaussdb -V):
gaussdb (openGauss 3.0.3 build 94f6a79e) compiled at 2023-09-13 08:32:24 commit 0 last mr

【预置条件】:

  1. 按照官方文档安装企业版

【操作步骤】(请填写详细的操作步骤):

  1. 注入故障
    echo b > /proc/sysrq-trigger

【预期输出】:

  1. 服务器重启后opengauss能够正常运行

【实际输出】:
Failed to read gaussdb.state: 0
Failed to set gaussdb.state with UNKNOWN_STATE.

【原因分析】:

  1. 这个问题的根因
    发生异常时,data/gaussdb.state文件size会变为0.

    gs_ctl start时,会有验证流程向gaussdb.state文件中写入状态,判断文件状态时
    文件不存在时,流程继续
    文件存在时,文件size与结构体size不相等时异常退出
    导致opengauss启动异常

  2. 问题推断过程

    1. 服务器重启后,删除文件,重启正常
    2. 服务器重启后,手动将文件置为72,重启正常
    3. 服务器重启后,清空文件,重启异常
    4. 结合源代码判断分析
  3. 还有哪些原因可能造成类似现象

    1. 非预期的关机操作: shutdown -h now (该操作问题复现频率低)
  4. 该问题是否有临时规避措施

    1. 删除文件
    2. 手动将文件size设置为72(truncate -s 72 data/gaussdb.state)
  5. 问题解决方案

    1. 需要分析下,当文件size为0时,直接写入文件的可行性.不阻塞异常宕机后的opengauss启动
  6. 预计修复问题时间

    1. 1天

【日志信息】(请附上日志文件、截图、coredump信息):
2023-12-20 17:09:44.443 6582af58.1 [unknown] 140611239308416 [unknown] 0 dn_6001 DB010 0 [REDO] LOG: Recovery parallelism, cpu count = 4, max = 4, actual = 4
2023-12-20 17:09:44.443 6582af58.1 [unknown] 140611239308416 [unknown] 0 dn_6001 DB010 0 [REDO] LOG: ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
Failed to read gaussdb.state: 0Failed to set gaussdb.state with UNKNOWN_STATE.[2023-12-20 17:09:45.471][118648][][gs_ctl]: waitpid 118654 failed, exitstatus is 256, ret is 2

[2023-12-20 17:09:45.471][118648][][gs_ctl]: stopped waiting
[2023-12-20 17:09:45.471][118648][][gs_ctl]: could not start server

【测试代码】:
上述命令故障注入

评论 (5)

samli3388 创建了缺陷

Hey @samli3388, Welcome to openGauss Community.
All of the projects in openGauss Community are maintained by @opengauss_bot.
That means the developers can comment below every pull request or issue to trigger Bot Commands.
Please follow instructions at Here to find the details.

Hi @samli3388, please use the command /sig xxx to add a SIG label to this issue.
For example: /sig sqlengine or /sig storageengine or /sig om or /sig ai and so on.
You can find more SIG labels from Here.
If you have no idea about that, please contact with @xiangxinyong , @zhangxubo .

samli3388 修改了标题
samli3388 修改了描述
samli3388 修改了标题
samli3388 修改了描述
周斌 负责人设置为lukeman
周斌 添加协作者pengjiong
周斌 关联项目设置为openGauss 3.0.0 community
周斌 修改了备注
lukeman 任务状态待办的 修改为已确认
lukeman 任务状态已确认 修改为修复中
lukeman 通过opengauss/openGauss-server Pull Request !4659任务状态修复中 修改为已完成

基于B011版本自测:
输入图片说明
输入图片说明

lukeman 任务状态已完成 修改为待回归

验收日期:2024-1-16
验收版本:gsql (openGauss 6.0.0 build d2533e77) compiled at 2024-01-10 08:37:56 commit 0 last mr
验收结论:通过
输入图片说明
输入图片说明

wan005 任务状态待回归 修改为已验收

验收日期:2024-4-19
验收版本:gsql (openGauss 5.0.2 build 0db5202f) compiled at 2024-04-17 15:28:50 commit 0 last mr
验收结论:通过

[peilq_502@kwepwebenv02644 dn1]$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node           node_ip         port      instance                                                     state
-------------------------------------------------------------------------------------------------------------------------------
1  kwepwebenv02644 10.29.180.204   50200      6001 /openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1   P Primary Normal
2  kwepwebenv07952 10.243.194.134  50200      6002 /openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1   S Standby Normal
3  kwemhisprc10431 7.212.123.28    50200      6003 /openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1   S Standby Normal
[peilq_502@kwepwebenv02644 dn1]$ ll gaussdb.state
-rw------- 1 peilq_502 peilq_502 72 Apr 18 14:50 gaussdb.state
[peilq_502@kwepwebenv02644 dn1]$ truncate -s 0 gaussdb.state
[peilq_502@kwepwebenv02644 dn1]$ ll gaussdb.state
-rw------- 1 peilq_502 peilq_502 0 Apr 19 15:47 gaussdb.state
[peilq_502@kwepwebenv02644 dn1]$ gs_ctl -D /openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1 restart
[2024-04-19 15:47:31.581][47700][][gs_ctl]: gs_ctl restarted ,datadir is /openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1
waiting for server to shut down... done
server stopped
[2024-04-19 15:47:35.595][47700][][gs_ctl]: waiting for server to start...
.0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

0 LOG:  [Alarm Module]Host Name: kwepwebenv02644

0 LOG:  [Alarm Module]Host IP: kwepwebenv02644. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

0 LOG:  [Alarm Module]Cluster Name: peilq_502

0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58

0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
0 LOG:  bbox_dump_path is set to /openGauss1/core/
2024-04-19 15:47:35.752 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  Recovery parallelism, cpu count = 8, max = 4, actual = 4
2024-04-19 15:47:35.752 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 DB010  0 [REDO] LOG:  ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
Failed to read gaussdb.state: 0, len: 02024-04-19 15:47:35.760 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

2024-04-19 15:47:35.760 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host Name: kwepwebenv02644

2024-04-19 15:47:35.760 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Host IP: kwepwebenv02644. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

2024-04-19 15:47:35.760 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Cluster Name: peilq_502

2024-04-19 15:47:35.760 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 58

2024-04-19 15:47:35.768 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  loaded library "security_plugin"
2024-04-19 15:47:35.771 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2024-04-19 15:47:35.780 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  InitNuma numaNodeNum: 1 numa_distribute_mode: none inheritThreadPool: 0.
2024-04-19 15:47:35.780 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 01000  0 [BACKEND] WARNING:  Failed to initialize the memory protect for g_instance.attr.attr_storage.cstore_buffers (1024 Mbytes) or shared memory (4477 Mbytes) is larger.
2024-04-19 15:47:35.879 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [CACHE] LOG:  set data cache  size(805306368)
2024-04-19 15:47:36.370 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [SEGMENT_PAGE] LOG:  Segment-page constants: DF_MAP_SIZE: 8156, DF_MAP_BIT_CNT: 65248, DF_MAP_GROUP_EXTENTS: 4175872, IPBLOCK_SIZE: 8168, EXTENTS_PER_IPBLOCK: 1021, IPBLOCK_GROUP_SIZE: 4090, BMT_HEADER_LEVEL0_TOTAL_PAGES: 8323072, BktMapEntryNumberPerBlock: 2038, BktMapBlockNumber: 25, BktBitMaxMapCnt: 512
2024-04-19 15:47:36.441 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  gaussdb: fsync file "/openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1/gaussdb.state.temp" success
2024-04-19 15:47:36.441 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  create gaussdb state file success: db state(STARTING_STATE), server mode(Primary), connection index(1)
2024-04-19 15:47:36.464 66222197.1 [unknown] 139828598239552 [unknown] 0 dn_6001_6002_6003 00000  0 [BACKEND] LOG:  max_safe_fds = 973, usable_fds = 1000, already_open = 17
bbox_dump_path is set to /openGauss1/core/
.
[2024-04-19 15:47:37.621][47700][][gs_ctl]:  done
[2024-04-19 15:47:37.621][47700][][gs_ctl]: server started (/openGauss/peilq_all/peilq_app/peilq_502/cluster/dn1)
[peilq_502@kwepwebenv02644 dn1]$

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
13084139 opengauss bot 1686829535
C++
1
https://gitee.com/opengauss/openGauss-server.git
git@gitee.com:opengauss/openGauss-server.git
opengauss
openGauss-server
openGauss-server

搜索帮助