Hi, were you ever able to fix this? I have the same problem in my aks cluster. I tested some node downtimes / failovers etc.
After e node comes back up / recover, the portworx-api and cluster pods on this node are in CrashLoopBackOff state.
I think the cluster pod is waiting for the api pod and this is waiting for the px-csi-ext pod. It’s trying to connect via unix sock, but the problem is, the third of the px-csi-ext pod has already been scheduled on another node, it’s running fine there, but now are on my 3 node cluster two csi pods on one node. If I now delete the pod which is too much on one node and it gets scheduled on the recovered node, all went fine.
It could be fixed if the operator would create podantiaffinity rules, or better topologySpreadConstraints on the csi pods.
here some logs:
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:06Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:08.851Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:08Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 10 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:14.911152 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:15.244209 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:15.559750 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:16Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:16.352Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:16Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 11 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:23.853Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:23Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 12 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:24.910456 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:25.243267 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:25.559777 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:26Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:31.354Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:31Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 13 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:34.910692 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v E0430 09:32:34.910764 1 csi-provisioner.go:215] context deadline exceeded
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: cri-containerd-05115797d210396a85851a4b9acb7159154d16b61b217fde19f976f6eed0612e.scope: Deactivated successfully.
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-05115797d210396a85851a4b9acb7159154d16b61b217fde19f976f6eed0612e-rootfs.mount: Deactivated successfully.
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:35.243328 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-csi-ext-664b9dbf76-b2v8v E0430 09:32:35.243462 1 main.go:175] error connecting to CSI driver: context deadline exceeded
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: cri-containerd-a4df84f9ff04d667e9f190ab571fa782392a7b054db9148628754d5f77ba5c1a.scope: Deactivated successfully.
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-a4df84f9ff04d667e9f190ab571fa782392a7b054db9148628754d5f77ba5c1a-rootfs.mount: Deactivated successfully.
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:35.559625 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-csi-ext-664b9dbf76-b2v8v E0430 09:32:35.559940 1 main.go:153] "Failed to create CSI client" err="failed to connect to CSI driver: context deadline exceeded"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: cri-containerd-daf38ab9e9dafffec341accff538d1ff94328654d45ba62d74a393f5bc1207d4.scope: Deactivated successfully.
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-daf38ab9e9dafffec341accff538d1ff94328654d45ba62d74a393f5bc1207d4-rootfs.mount: Deactivated successfully.
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:35.745422 1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:35.745501 1 feature_gate.go:249] feature gates: &{map[Topology:true]}
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:35.745533 1 csi-provisioner.go:154] Version: v3.6.1
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:35.745538 1 csi-provisioner.go:177] Building kube configs for running in cluster...
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: Started libcontainer container 7ae43a0ef5b5341660740defe1635dcc8793ebfb01c96cd80b333d1effb8ec26.
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:36.056164 1 main.go:108] Version: v8.1.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: Started libcontainer container fdadd0271e23b7193c02ae4a0c6f93556b1ab566a8b87be77536a18103313f17.
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:36Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:36.817893 1 main.go:108] "Version" version="v1.12.0"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:36.817962 1 feature_gate.go:387] feature gates: {map[]}
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 systemd[1]: Started libcontainer container 32ef94a3a4fafed92faea2d4d69e80d50377023798e5b21091eb3e9ea2414ccc.
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:38.856Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:38Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 14 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:45.746820 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:46.058352 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:46.357Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:46Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 15 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:46Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:46.819547 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:32:53.860Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:32:53Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 16 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:32:55.746354 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:56.057359 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj time="2025-04-30T09:32:56Z" level=warning msg="Could not retrieve PX node status" error="Get \"http://127.0.0.1:9001/v1/cluster/nodehealth\": dial tcp 127.0.0.1:9001: connect: connection refused"
px-csi-ext-664b9dbf76-b2v8v I0430 09:32:56.819758 1 connection.go:253] "Still connecting" address="unix:///csi/csi.sock"
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: {"level":"warn","ts":"2025-04-30T09:33:01.361Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-5a2e911b-5e51-4538-a972-e6a858285c54/10.224.0
px-cnpg-cluster-1-tkpvj @aks-portworx-39265852-vmss000004 portworx[862]: time="2025-04-30T09:33:01Z" level=error msg="[set: testConnection] kvdb error: rpc error: code = Unknown desc = context deadline exceeded, retry count 17 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
px-csi-ext-664b9dbf76-b2v8v W0430 09:33:05.746475 1 connection.go:183] Still connecting to unix:///csi/csi.sock
px-csi-ext-664b9dbf76-b2v8v E0430 09:33:05.746541 1 csi-provisioner.go:215] context deadline exceeded