1. namespace无法被删除

1. 检查是否有 Finalizers 阻塞
root@ubuntu:~# kubectl get ns nebula -o json | jq '.spec.finalizers'
[
  "kubernetes"
]

2. 你的 nebula 命名空间卡在 Terminating 状态,是因为 Kubernetes 的 Finalizer("kubernetes")阻止了它的删除。Finalizer 是 Kubernetes 的一种机制,用于确保资源被正确清理,但有时会因为某些原因卡住。你可以直接编辑命名空间的 Finalizer 字段,移除 "kubernetes",使其可以正常删除:
root@ubuntu:~# kubectl get ns nebula -o json | \
  jq 'del(.spec.finalizers)' | \
  kubectl replace --raw "/api/v1/namespaces/nebula/finalize" -f -

2. evicted 容器被驱逐

粗暴的方式是先全部删掉这些异常的pod

kubectl get pods -A | grep Evicted
kubectl delete pods --all-namespaces --field-selector=status.phase=Failed

每个出错的Evicted容器都会给出具体的Events事件的,例如下图这个 The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

主要是提示该容器使用了 524036KiB(约 512Mi) 的临时存储。但容器没有设置 ephemeral-storage requests,也就是说,K8s 默认它“不需要”临时存储。节点发现临时存储空间不足时,会优先驱逐没有声明 request 的容器。

resources:
  requests:
    ephemeral-storage: "1Gi"
  limits:
    ephemeral-storage: "2Gi"
image-XkAz.png

根本原因就是因为工作节点的硬盘空间快要满了,需要到该节点上清理一下空间释放些出来。

image-FiXo.png

检测脚本

可以做成定时脚本来执行,再完善一下可以将结果进行推送。

kubectl get pods -A --field-selector=status.phase=Failed -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.nodeName}{"\t"}{.status.reason}{"\t"}{.status.message}{"\n"}{end}' | \
grep Evicted | \
while IFS=$'\t' read -r namespace pod node reason message; do
  echo "容器: $namespace/$pod"
  echo "节点: $node"
  echo "详情: $message"
  echo ""
done


容器: fssc/ems-base-application-5dbf9bd7-74tdp
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container ems-base-application was using 780640Ki, which exceeds its request of 0.

容器: kubesphere-monitoring-system/notification-manager-deployment-798fdfc9b-fbbqr
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container notification-manager was using 16Ki, which exceeds its request of 0. Container tenant was using 754680Ki, which exceeds its request of 0.

容器: zizhu/zzoms-service-d7754cbd7-9nwg9
节点: hybxvuca05
详情: The node was low on resource: ephemeral-storage. Container zzoms-service was using 524036Ki, which exceeds its request of 0.

3. Headless Service 无法被ping通

今天踩了个坑,差点以为系统出问题了 Headless Service 居然怎么都 ping 不通,明明服务 Pod 都跑得好好的,结果像 kafka-0.kafka.default.svc.cluster.local 这样的域名死活解析不了,提示 “bad address”。

仔细对比观察这才发现,其实是 StatefulSet + Headless Service 的组合在搞事,StatefulSet 里会指定一个 serviceName,这个名字不能乱改、也不能单独新建个名字不一样的 serviceName,只要这个名字对不上 DNS 就解析不出来,服务之间自然也就无法通信了。

4. pv,pvc,claim 等概念

在 Kubernetes 中,PV 就像房东提供的大房子(存储资源),PVC 就是租客提交的租房申请(申请使用存储),Claim 表示这一申请行为。PVC 会根据需求匹配合适的 PV,一旦绑定成功,租客(Pod)就可以独占使用这块存储空间。

5. 获取所有容器Pending信息

kubectl get pods --all-namespaces --field-selector=status.phase=Pending | awk 'NR>1 {print $1, $2}' | xargs -n2 -I{} sh -c 'echo Namespace: $(echo {} | cut -d" " -f1), Pod: $(echo {} | cut -d" " -f2); kubectl describe pod -n $(echo {} | cut -d" " -f1) $(echo {} | cut -d" " -f2) | grep -A10 -E "Events|Warning|Failed|Error|FailedScheduling"; echo "--------------------------"'


Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-0
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-0"
--------------------------
Namespace: kubesphere-monitoring-system, Pod: prometheus-k8s-1
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  running PreBind plugin "VolumeBinding": binding volumes: failed to check provisioning pvc: could not find v1.PersistentVolumeClaim "kubesphere-monitoring-system/prometheus-k8s-db-prometheus-k8s-1"
--------------------------

6. endpoints

引Service 不指定 selector 时,系统将按 Service 名称关联同命名空间下的同名 Endpoints,从而可手动绑定外部地址。

[root@hybxvuka01 harbor-svc]# cat harbor-endpoints.yaml 
apiVersion: v1
kind: Service
metadata:
  name: harbor
  namespace: bx
spec:
  ports:
    - port: 80
      targetPort: 80
---
apiVersion: v1
kind: Endpoints
metadata:
  name: harbor
  namespace: bx
subsets:
  - addresses:
      - ip: 172.31.0.99
    ports:
      - port: 80

7. 证书续签

# 检查证书
[root@hybxvpka01 ~]# kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Nov 19, 2026 09:51 UTC   357d            ca                      no      
apiserver                  Nov 19, 2026 09:51 UTC   357d            ca                      no      
apiserver-etcd-client      Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
apiserver-kubelet-client   Nov 19, 2026 09:51 UTC   357d            ca                      no      
controller-manager.conf    Nov 19, 2026 09:51 UTC   357d            ca                      no      
etcd-healthcheck-client    Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
etcd-peer                  Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
etcd-server                Nov 19, 2026 09:51 UTC   357d            etcd-ca                 no      
front-proxy-client         Nov 19, 2026 09:51 UTC   357d            front-proxy-ca          no      
scheduler.conf             Nov 19, 2026 09:51 UTC   357d            ca                      no      
super-admin.conf           Nov 19, 2026 09:51 UTC   357d            ca                      no      

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Nov 17, 2035 09:51 UTC   9y              no      
etcd-ca                 Nov 17, 2035 09:51 UTC   9y              no      
front-proxy-ca          Nov 17, 2035 09:51 UTC   9y              no      



# 开始续签,如果你有多个 master 每个节点都需要执行
[root@hybxvpka01 ~]# kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed
certificate embedded in the kubeconfig file for the super-admin renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.

# 所有master都需要重启kubelet
[root@hybxvpka01 ~]# systemctl restart kubelet

8. 获取ip pool

# calico
kubectl get ippools.crd.projectcalico.org

# 
kubectl describe ippools.crd.projectcalico.org default-ipv4-ippool

x. serviceaccount自动打imagePullSecrets

有些场景下为了安全需要,Harbor 会禁止任何镜像仓库对外公开。但在 Kubernetes 集群中,Namespace 数量往往非常多,不可能为每一个 Deployment 手动配置 imagePullSecrets,维护成本和出错概率都极高。

更合理的做法是利用 ServiceAccount 的继承机制:通过脚本循环指定的 Namespace 列表,逐个读取其中已有的 ServiceAccount,并统一为其绑定 imagePullSecrets。这样,使用这些 ServiceAccount 启动的 Pod 会自动携带镜像拉取凭证,无需在每个 Deployment 中重复声明。

#!/bin/bash
# 文件名:patch-harbor-sa-selective.sh
# 用法:直接运行 ./patch-harbor-sa-selective.sh
# 修改下面的变量即可控制行为

# ==================== 需要修改的变量 ====================
# 空格分隔的 namespace 列表(只处理这些 ns)
NAMESPACES="test kube-system default"

# 源 secret 所在的 namespace(通常是 default)
SOURCE_SECRET_NS="default"

# secret 的名字(改成你想要的任何名字)
SECRET_NAME="harbor-dev"
# ======================================================

echo "开始处理指定的 namespace,为其所有 ServiceAccount 添加 imagePullSecrets: $SECRET_NAME"
echo "处理的 namespace: $NAMESPACES"
echo "源 secret 来自: $SOURCE_SECRET_NS/$SECRET_NAME"
echo "=================================================="

for ns in $NAMESPACES; do
  echo "=== 处理 namespace: $ns ==="

  # 1. 检查并复制 secret(如果目标 ns 没有)
  if ! kubectl get secret "$SECRET_NAME" -n "$ns" >/dev/null 2>&1; then
    echo "  → secret $SECRET_NAME 不存在,正在从 $SOURCE_SECRET_NS 复制..."
    kubectl get secret "$SECRET_NAME" -n "$SOURCE_SECRET_NS" -o yaml | \
      sed "s/namespace: $SOURCE_SECRET_NS/namespace: $ns/" | \
      kubectl apply -n "$ns" -f -
  else
    echo "  → secret $SECRET_NAME 已存在,跳过复制"
  fi

  # 2. 获取该 namespace 中所有真实存在的 ServiceAccount
  sas=$(kubectl get serviceaccount -n "$ns" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null)

  if [ -z "$sas" ]; then
    echo "  警告:namespace $ns 中没有 ServiceAccount,跳过 patch"
    continue
  fi

  echo "  发现 ServiceAccount: $sas"

  # 3. 对每个存在的 SA 追加 imagePullSecrets
  for sa in $sas; do
    echo "  → Patching ServiceAccount: $sa"
    kubectl patch serviceaccount "$sa" -n "$ns" \
      --type=merge \
      -p "{\"imagePullSecrets\": [{\"name\": \"$SECRET_NAME\"}]}" || \
      echo "    警告:patch $sa 失败(可能权限不足)"
  done
done

echo "=================================================="
echo "全部完成!指定 namespace 中的所有 ServiceAccount 已添加 $SECRET_NAME 的 imagePullSecrets。"