@gigstox2

Ошибки CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0, Как продиагностировать в чем проблема? память? процессор?

Добрый день.
Приобрел себе такой конфиг:
Xeon e5 v2 4*16 Gb ddr3 ecc
На нем proxmox и несколько виртуалок с ubuntu, может по 5 часов работать хорошо, но переодически начинаются проблемы с прозводительностью с в системном журнале выдается ошибки:

CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0

При этом разные слоты, разные каналы, вот большой кусок кода с ошибками:

Кусок лога с ошибками

Feb 02 14:55:38 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x1d832
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc0000c1000800c1
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: TSC 1b1b0cd4ea7c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: ADDR 1d832000
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: MISC 9000043c3c0028c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706874938 SOCKET 0 APIC 0
Feb 02 14:55:38 vhcalnplci kernel: EDAC MC0: 3 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x1d832 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255 )
Feb 02 14:55:38 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x100a43
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00008000010091
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: TSC 1b1b0cd67aa2
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: ADDR 100a43000
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: MISC 205030b086
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706874938 SOCKET 0 APIC 0
Feb 02 14:55:38 vhcalnplci kernel: EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x100a43 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:0 )
Feb 02 14:55:38 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x1d832
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000101000800c1
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: TSC 1b1b0cd67aa2
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: ADDR 1d832000
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: MISC 90000424080028c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706874938 SOCKET 0 APIC 0
Feb 02 14:55:38 vhcalnplci kernel: EDAC MC0: 4 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x1d832 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255 )
Feb 02 14:55:38 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x1d833
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000081000800c1
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: TSC 1b1b0cd8190c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: ADDR 1d833000
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: MISC 9000181c041828c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706874938 SOCKET 0 APIC 0
Feb 02 14:55:38 vhcalnplci kernel: EDAC MC0: 2 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x1d833 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255 )
Feb 02 14:55:38 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x1d833
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: cc000581000800c1
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: TSC 1b1b0cdaacdd
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: ADDR 1d833000
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: MISC 90001010200028c
Feb 02 14:55:38 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706874938 SOCKET 0 APIC 0
Feb 02 14:55:38 vhcalnplci kernel: EDAC MC0: 22 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x1d833 offset:0x0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:255 )
Feb 02 14:55:38 vhcalnplci kernel: Memory failure: 0x100a43: unhandlable page.
Feb 02 14:55:38 vhcalnplci kernel: soft_offline_page: 0x1d832 page already poisoned
Feb 02 14:55:38 vhcalnplci kernel: soft_offline_page: 0x1d833 page already poisoned
Feb 02 14:56:19 vhcalnplci smartd[846]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 115 to 117
Feb 02 14:56:19 vhcalnplci smartd[846]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27
Feb 02 14:56:19 vhcalnplci systemd[1]: Starting apt-daily.service - Daily apt download activities...
Feb 02 14:56:19 vhcalnplci systemd[1]: apt-daily.service: Deactivated successfully.
Feb 02 14:56:19 vhcalnplci systemd[1]: Finished apt-daily.service - Daily apt download activities.
Feb 02 15:00:39 vhcalnplci kernel: mce: CMCI storm subsided: switching to interrupt mode
Feb 02 15:03:16 vhcalnplci kernel: mce: CMCI storm detected: switching to poll mode
Feb 02 15:03:16 vhcalnplci kernel: mce_notify_irq: 26 callbacks suppressed
Feb 02 15:03:16 vhcalnplci kernel: mce: [Hardware Error]: Machine check events logged
Feb 02 15:03:16 vhcalnplci kernel: mce: [Hardware Error]: Machine check events logged
Feb 02 15:03:16 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x100a43
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00010000010091
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: TSC 1c051d53836a
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: ADDR 100a43140
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: MISC 205070f086
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706875396 SOCKET 0 APIC 0
Feb 02 15:03:16 vhcalnplci kernel: EDAC MC0: 4 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x100a43 offset:0x140 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:0 )
Feb 02 15:03:16 vhcalnplci kernel: RAS: Soft-offlining pfn: 0x100a4e
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00020000010091
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: TSC 1c051d54bc5d
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: ADDR 100a4e240
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: MISC 2050565686
Feb 02 15:03:16 vhcalnplci kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1706875396 SOCKET 0 APIC 0
Feb 02 15:03:16 vhcalnplci kernel: EDAC MC0: 8 CE memory read error on CPU_SrcID#0_Ha#0_Chan#3_DIMM#1 (channel:3 slot:1 page:0x100a4e offset:0x240 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:8 rank:5 )



мемтест ошибок не дает, подскажите в какую сторону копать?
  • Вопрос задан
  • 70 просмотров
Пригласить эксперта
Ваш ответ на вопрос

Войдите, чтобы написать ответ

Войти через центр авторизации
Похожие вопросы