@KaLans

Как исправить ошибку osds down в CEPH?

Здравствуйте. Вырубился кластер с серверами. Три сервера proxmox, на них был собран CEPH.
После загрузки виртуалки не стартанули. У Ceph появилось куча предупреждений.
Реально ли это все восстановить?

ceph df

[root@altpve-1 ~]# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    2.7 TiB  1.8 TiB  965 GiB   965 GiB      34.53
TOTAL  2.7 TiB  1.8 TiB  965 GiB   965 GiB      34.53
 
--- POOLS ---
POOL          ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr           1    1  9.5 MiB        4   19 MiB      0    3.9 TiB
CEPH_POOL_01   3  119  497 GiB  132.85k  963 GiB  10.77    4.0 TiB
[root@altpve-1 ~]#



ceph -s

[root@altpve-1 ~]# ceph -s
  cluster:
    id:     c3e6d6cb-03a2-4fec-86df-231ccf7a4100
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            Reduced data availability: 9 pgs inactive
            Degraded data redundancy: 141155/398559 objects degraded (35.416%), 119 pgs degraded, 120 pgs undersized
            9 pgs not deep-scrubbed in time
            9 pgs not scrubbed in time
 
  services:
    mon: 1 daemons, quorum altpve-1 (age 12m)
    mgr: altpve-1(active, since 4m)
    osd: 3 osds: 2 up (since 8m), 3 in (since 25m); 1 remapped pgs
 
  data:
    pools:   2 pools, 120 pgs
    objects: 132.85k objects, 512 GiB
    usage:   965 GiB used, 1.8 TiB / 2.7 TiB avail
    pgs:     7.500% pgs not active
             141155/398559 objects degraded (35.416%)
             2100/398559 objects misplaced (0.527%)
             110 active+undersized+degraded
             9   undersized+degraded+peered
             1   active+undersized+remapped



systemctl status ceph-osd@0.service

[root@altpve-1 ~]# systemctl status ceph-osd@0.service 
× ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: disabled)
    Drop-In: /lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: failed (Result: signal) since Mon 2024-10-14 09:17:51 +05; 28min ago
    Process: 7290 ExecStartPre=/usr/libexec/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
    Process: 7294 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Main PID: 7294 (code=killed, signal=ABRT)
        CPU: 22.178s

окт 14 09:17:51 altpve-1 systemd[1]: ceph-osd@0.service: Consumed 22.178s CPU time.
окт 14 09:17:51 altpve-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
окт 14 09:17:51 altpve-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
окт 14 09:17:51 altpve-1 systemd[1]: Failed to start Ceph object storage daemon osd.0.
окт 14 09:27:30 altpve-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
окт 14 09:27:30 altpve-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
окт 14 09:27:30 altpve-1 systemd[1]: Failed to start Ceph object storage daemon osd.0.
окт 14 09:32:54 altpve-1 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
окт 14 09:32:54 altpve-1 systemd[1]: ceph-osd@0.service: Failed with result 'signal'.
окт 14 09:32:54 altpve-1 systemd[1]: Failed to start Ceph object storage daemon osd.0.
[root@altpve-1 ~]#



ceph health detail

[root@altpve-1 ~]# ceph health detail
HEALTH_WARN Reduced data availability: 9 pgs inactive; Degraded data redundancy: 141155/398559 objects degraded (35.416%), 119 pgs degraded, 120 pgs undersized; 9 pgs not deep-scrubbed in time; 9 pgs not scrubbed in time
[WRN] PG_AVAILABILITY: Reduced data availability: 9 pgs inactive
    pg 3.4 is stuck inactive for 30m, current state undersized+degraded+peered, last acting [2]
    pg 3.12 is stuck inactive for 65m, current state undersized+degraded+peered, last acting [2]
    pg 3.1b is stuck inactive for 30m, current state undersized+degraded+peered, last acting [2]
    pg 3.27 is stuck inactive for 30m, current state undersized+degraded+peered, last acting [2]
    pg 3.36 is stuck inactive for 2h, current state undersized+degraded+peered, last acting [2]
    pg 3.4d is stuck inactive for 30m, current state undersized+degraded+peered, last acting [2]
    pg 3.4e is stuck inactive for 2h, current state undersized+degraded+peered, last acting [2]
    pg 3.5d is stuck inactive for 30m, current state undersized+degraded+peered, last acting [2]
    pg 3.61 is stuck inactive for 31m, current state undersized+degraded+peered, last acting [2]
[WRN] PG_DEGRADED: Degraded data redundancy: 141155/398559 objects degraded (35.416%), 119 pgs degraded, 120 pgs undersized
    pg 3.44 is active+undersized+degraded, acting [1,2]
    pg 3.45 is stuck undersized for 30m, current state active+undersized+degraded, last acting [1,2]
    pg 3.46 is stuck undersized for 30m, current state active+undersized+degraded, last acting [1,2]
    
...
    pg 3.72 is stuck undersized for 30m, current state active+undersized+degraded, last acting [2,1]
    pg 3.73 is stuck undersized for 30m, current state active+undersized+degraded, last acting [2,1]
    pg 3.74 is stuck undersized for 30m, current state active+undersized+degraded, last acting [2,1]
    pg 3.75 is stuck undersized for 30m, current state active+undersized+degraded, last acting [1,2]
    pg 3.76 is stuck undersized for 30m, current state active+undersized+remapped, last acting [2,1]
[WRN] PG_NOT_DEEP_SCRUBBED: 9 pgs not deep-scrubbed in time
    pg 3.61 not deep-scrubbed since 2023-11-29T10:45:22.842288+0500
    pg 3.5d not deep-scrubbed since 2023-11-29T10:45:22.842288+0500

...

    pg 3.27 not deep-scrubbed since 2023-11-29T10:45:22.842288+0500
    pg 3.36 not deep-scrubbed since 2023-11-29T10:45:22.842288+0500
[WRN] PG_NOT_SCRUBBED: 9 pgs not scrubbed in time
    pg 3.61 not scrubbed since 2023-11-29T10:45:22.842288+0500
    pg 3.5d not scrubbed since 2023-11-29T10:45:22.842288+0500

...

    pg 3.27 not scrubbed since 2023-11-29T10:45:22.842288+0500
    pg 3.36 not scrubbed since 2023-11-29T10:45:22.842288+0500
[root@altpve-1 ~]#

  • Вопрос задан
  • 73 просмотра
Пригласить эксперта
Ответы на вопрос 1
@KaLans Автор вопроса
Если выполнить
systemctl daemon-reload
а затем
systemctl start ceph-osd@0.service
то все запускается, но от силы на пол минуты, потом падает
Ответ написан
Ваш ответ на вопрос

Войдите, чтобы написать ответ

Похожие вопросы