解决 Kubesphere 因为 Redis 节点挂掉而无法登陆的问题

发布于:6/20/2024, 8:00:16 PM @孙博
技术分享 | Kubesphere,K8S,K3S,Redis,运维
许可协议:署名-非商业性使用(by-nc)

记录一次 Kubesphere 无法登陆的问题。具体表现为点击登陆后没有反应,没有在UI上显示任何报错提示。
按 F12 查看后,发现点击登陆按钮后,login 接口响应为:

{"status":500,"reason":"Internal Server Error"}

看不出更多有价值的信息,只好登陆到机器上去查看了。

Kubesphere 的版本为 3.4.1

运行环境为 3 主(master) 6 从(agent) 的 K3S,版本信息如下:

[root@k3s-master-01 ~]# kubectl version
Client Version: v1.29.3+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3+k3s1

首先看下报了什么错

[root@k3s-master-01 ~]# kubectl -n kubesphere-system logs -l app=ks-console
unable to retrieve container logs for containerd://eb35**********35b6  <-- POST /login 2024/06/20T17:22:49.459
{
  code: 500,
  error: 'server_error',
  error_description: 'dial tcp 10.**.**.88:6379: connect: connection refused',
  statusText: 'Internal Server Error'
}
  --> POST /login 200 1,123ms 47b 2024/06/20T17:22:50.581
  <-- GET /login 2024/06/20T19:02:54.937
  --> GET /login 200 160ms 17.79kb 2024/06/20T19:02:55.097
}
  --> POST /login 200 1,122ms 47b 2024/06/20T18:50:38.581
  <-- POST /login 2024/06/20T19:03:15.057
{
  code: 500,
  error: 'server_error',
  error_description: 'dial tcp 10.**.**.88:6379: connect: connection refused',
  statusText: 'Internal Server Error'
}
  --> POST /login 200 1,119ms 47b 2024/06/20T19:03:16.175
    at getCurrentUser (/opt/kubesphere/console/server/server.js:6992:14)
    at renderView (/opt/kubesphere/console/server/server.js:72141:7)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async /opt/kubesphere/console/server/server.js:40154:7
    at async logger (/opt/kubesphere/console/server/server.js:37098:7)
    at async /opt/kubesphere/console/server/server.js:31861:26
    at async /opt/kubesphere/console/server/server.js:31861:26
    at async /opt/kubesphere/console/server/server.js:31861:26
  --> GET / 302 131ms 43b 2024/06/20T19:02:54.907

很明显是 redis 出了错,经查询文档 https://www.kubesphere.io/zh/docs/v3.4/faq/access-control/cannot-login/#redis-%E5%BC%82%E5%B8%B8 可知,ks-consoleks-apiserver 需要借助 Redis 在多个副本之间共享数据。,而当时在安装 Kubesphere 时也没换过 redis 相关的配置,那么按照官方的指导查看一下问题在哪。

[root@k3s-master-01 ~]# kubectl get pod -A | grep redis
kubesphere-system              redis-5596f89c8f-csvmg                                            0/1     Pending                  0              16m

果然是 redis 没启动起来,查下怎么回事。

[root@k3s-master-01 ~]# kubectl describe pod redis-5596f89c8f-csvmg -n kubesphere-system
# ...
# 略去无关信息
# ...
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  17m                  default-scheduler  0/9 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 8 node is filtered out by the prefilter result. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  7m21s (x2 over 12m)  default-scheduler  0/9 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 8 node is filtered out by the prefilter result. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.

是磁盘满了导致无法调度了。反正这个 redis 是默认安装的也没啥别的用途,直接重新调度吧。由于 Kubesphere 的 redis 是由 ReplicaSet 部署的且有 pvc,所以先手动缩容至 0 个副本。

[root@k3s-master-01 ~]# kubectl scale deploy redis --replicas=0 -n kubesphere-system
deployment.apps/redis scaled

然后再手动删除再也用不到的 pvc,防止扩容后再次被调度到这个节点上。此前我查到 pvc 的名称为 redis-pvc,如果担心自己搞错,也可以在销毁 pod 前通过 describe 查一下,由于输出的日志很长我就不贴了,毕竟默认情况下应该就是 redis-pvc 而不会变。

[root@k3s-master-01 ~]# kubectl delete pvc redis-pvc -n kubesphere-system
persistentvolumeclaim "redis-pvc" deleted

别忘了删除 pv,先查到它。

[root@k3s-master-01 ~]# kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                                             STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-cefb1990-dede-4205-b847-f070201f78b3   2Gi        RWO            Delete           Released   kubesphere-system/redis-pvc                                                       local-path     <unset>                          58d

再进行删除。

[root@k3s-master-01 ~]# kubectl delete pv pvc-cefb1990-dede-4205-b847-f070201f78b3
persistentvolume "pvc-cefb1990-dede-4205-b847-f070201f78b3" deleted

下面重建 pvc。先在本地创建一个 redis-pvc.yaml,并填写相关配置。

cat >> redis-pvc.yaml << EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: redis-pvc
  namespace: kubesphere-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: local-path
EOF

之后进行重建。

[root@k3s-master-01 ~]# kubectl apply -f redis-pvc.yaml
persistentvolumeclaim/redis-pvc created

然后重新将 redis 副本扩容。

[root@k3s-master-01 ~]# kubectl scale deploy redis --replicas=1 -n kubesphere-system
deployment.apps/redis scaled

由于 Kubesphere 默认安装时是单副本的 redis,所以也不需要设置更多节点,没有意义。

稍微等一小会,redis 节点副本启动成功后,就可以正常登陆后台了。

[root@k3s-master-01 ~]# kubectl get pod -A | grep redis
kubesphere-system              redis-5596f89c8f-nlhz2                                            1/1     Running                  0              41s

问题修复完毕。由于我 redis 节点挂掉的原因是磁盘不足,后面我需要更新配置扩容硬盘,并最好将 Kubesphere 对 redis 的依赖放在集群外,这就是后话了。