记录一次 Kubesphere 无法登陆的问题。具体表现为点击登陆后没有反应,没有在UI上显示任何报错提示。
按 F12 查看后,发现点击登陆按钮后,login 接口响应为:
{"status":500,"reason":"Internal Server Error"}
看不出更多有价值的信息,只好登陆到机器上去查看了。
Kubesphere 的版本为 3.4.1
运行环境为 3 主(master) 6 从(agent) 的 K3S,版本信息如下:
[root@k3s-master-01 ~]# kubectl version
Client Version: v1.29.3+k3s1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3+k3s1
首先看下报了什么错
[root@k3s-master-01 ~]# kubectl -n kubesphere-system logs -l app=ks-console
unable to retrieve container logs for containerd://eb35**********35b6 <-- POST /login 2024/06/20T17:22:49.459
{
code: 500,
error: 'server_error',
error_description: 'dial tcp 10.**.**.88:6379: connect: connection refused',
statusText: 'Internal Server Error'
}
--> POST /login 200 1,123ms 47b 2024/06/20T17:22:50.581
<-- GET /login 2024/06/20T19:02:54.937
--> GET /login 200 160ms 17.79kb 2024/06/20T19:02:55.097
}
--> POST /login 200 1,122ms 47b 2024/06/20T18:50:38.581
<-- POST /login 2024/06/20T19:03:15.057
{
code: 500,
error: 'server_error',
error_description: 'dial tcp 10.**.**.88:6379: connect: connection refused',
statusText: 'Internal Server Error'
}
--> POST /login 200 1,119ms 47b 2024/06/20T19:03:16.175
at getCurrentUser (/opt/kubesphere/console/server/server.js:6992:14)
at renderView (/opt/kubesphere/console/server/server.js:72141:7)
at runMicrotasks (<anonymous>)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async /opt/kubesphere/console/server/server.js:40154:7
at async logger (/opt/kubesphere/console/server/server.js:37098:7)
at async /opt/kubesphere/console/server/server.js:31861:26
at async /opt/kubesphere/console/server/server.js:31861:26
at async /opt/kubesphere/console/server/server.js:31861:26
--> GET / 302 131ms 43b 2024/06/20T19:02:54.907
很明显是 redis 出了错,经查询文档 https://www.kubesphere.io/zh/docs/v3.4/faq/access-control/cannot-login/#redis-%E5%BC%82%E5%B8%B8 可知,ks-console
和 ks-apiserver
需要借助 Redis 在多个副本之间共享数据。,而当时在安装 Kubesphere 时也没换过 redis 相关的配置,那么按照官方的指导查看一下问题在哪。
[root@k3s-master-01 ~]# kubectl get pod -A | grep redis
kubesphere-system redis-5596f89c8f-csvmg 0/1 Pending 0 16m
果然是 redis 没启动起来,查下怎么回事。
[root@k3s-master-01 ~]# kubectl describe pod redis-5596f89c8f-csvmg -n kubesphere-system
# ...
# 略去无关信息
# ...
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 17m default-scheduler 0/9 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 8 node is filtered out by the prefilter result. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
Warning FailedScheduling 7m21s (x2 over 12m) default-scheduler 0/9 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }, 8 node is filtered out by the prefilter result. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling.
是磁盘满了导致无法调度了。反正这个 redis 是默认安装的也没啥别的用途,直接重新调度吧。由于 Kubesphere 的 redis 是由 ReplicaSet 部署的且有 pvc,所以先手动缩容至 0 个副本。
[root@k3s-master-01 ~]# kubectl scale deploy redis --replicas=0 -n kubesphere-system
deployment.apps/redis scaled
然后再手动删除再也用不到的 pvc,防止扩容后再次被调度到这个节点上。此前我查到 pvc 的名称为 redis-pvc
,如果担心自己搞错,也可以在销毁 pod 前通过 describe
查一下,由于输出的日志很长我就不贴了,毕竟默认情况下应该就是 redis-pvc
而不会变。
[root@k3s-master-01 ~]# kubectl delete pvc redis-pvc -n kubesphere-system
persistentvolumeclaim "redis-pvc" deleted
别忘了删除 pv,先查到它。
[root@k3s-master-01 ~]# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
pvc-cefb1990-dede-4205-b847-f070201f78b3 2Gi RWO Delete Released kubesphere-system/redis-pvc local-path <unset> 58d
再进行删除。
[root@k3s-master-01 ~]# kubectl delete pv pvc-cefb1990-dede-4205-b847-f070201f78b3
persistentvolume "pvc-cefb1990-dede-4205-b847-f070201f78b3" deleted
下面重建 pvc。先在本地创建一个 redis-pvc.yaml
,并填写相关配置。
cat >> redis-pvc.yaml << EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: redis-pvc
namespace: kubesphere-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: local-path
EOF
之后进行重建。
[root@k3s-master-01 ~]# kubectl apply -f redis-pvc.yaml
persistentvolumeclaim/redis-pvc created
然后重新将 redis 副本扩容。
[root@k3s-master-01 ~]# kubectl scale deploy redis --replicas=1 -n kubesphere-system
deployment.apps/redis scaled
由于 Kubesphere 默认安装时是单副本的 redis,所以也不需要设置更多节点,没有意义。
稍微等一小会,redis 节点副本启动成功后,就可以正常登陆后台了。
[root@k3s-master-01 ~]# kubectl get pod -A | grep redis
kubesphere-system redis-5596f89c8f-nlhz2 1/1 Running 0 41s
问题修复完毕。由于我 redis 节点挂掉的原因是磁盘不足,后面我需要更新配置扩容硬盘,并最好将 Kubesphere 对 redis 的依赖放在集群外,这就是后话了。