Hello,
It’s been several days i try to install portworx into k8s cluster. I’ve first started with portworx 3.1.6 witout success.
Then i decided to try portworx 3.2 release but unable to have a successfull setup.
As to me, Portworx doc is a bit light & opaque to debug this kind of problem.
Status of deployment
- 1 portwork node OK with internal kvdb installed
- 2 others portwork node refuse to join kvdb cluster
k8s resources
$ kubectl -n portworx get storageclusters.core.libopenstorage.org px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
NAME CLUSTER UUID STATUS VERSION AGE
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526 c9549e55-bab2-4ce4-a3fc-d7e811282c51 Initializing 3.2.0 41m
$ kubectl -n portworx get storageclusters.core.libopenstorage.org px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526 -oyaml
apiVersion: core.libopenstorage.org/v1
kind: StorageCluster
metadata:
annotations:
portworx.io/install-source: https://install.portworx.com/?operator=true&mc=false&kbver=1.29.6&ns=portworx&oem=esse&user=7f9c9026-63df-4998-b4b3-b0229e7cd11a&b=true&iop=6&f=true&m=ens192&d=ens224&c=px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526&stork=false&csi=true&tel=false&st=k8s®=dockerh.bbaa1.sbm.admin%2Fportworx&rsec=px-image-repository&pp=Always
portworx.io/misc-args: --oem esse
portworx.io/preflight-check: "false"
creationTimestamp: "2024-11-18T13:47:39Z"
finalizers:
- operator.libopenstorage.org/delete
generation: 3
labels:
kustomize.toolkit.fluxcd.io/name: tstt9-portworx
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
namespace: portworx
resourceVersion: "52830717"
uid: 165c47c5-e17a-4ac1-87f2-8106ccb010eb
spec:
autopilot:
enabled: false
csi:
enabled: true
installSnapshotController: true
topology:
enabled: true
customImageRegistry: <my_private_registry>/portworx
deleteStrategy:
type: UninstallAndWipe
env:
- name: PURE_FLASHARRAY_SAN_TYPE
value: ISCSI
image: portworx/oci-monitor:3.2.0
imagePullPolicy: Always
imagePullSecret: px-image-repository
kvdb:
internal: true
monitoring:
telemetry:
enabled: false
network:
dataInterface: ens224
mgmtInterface: ens192
placement:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: px/enabled
operator: NotIn
values:
- "false"
- key: node-role.kubernetes.io/master
operator: DoesNotExist
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
- matchExpressions:
- key: px/enabled
operator: NotIn
values:
- "false"
- key: node-role.kubernetes.io/master
operator: Exists
- key: node-role.kubernetes.io/worker
operator: Exists
- matchExpressions:
- key: px/enabled
operator: NotIn
values:
- "false"
- key: node-role.kubernetes.io/control-plane
operator: Exists
- key: node-role.kubernetes.io/worker
operator: Exists
revisionHistoryLimit: 10
runtimeOptions:
default-io-profile: "6"
secretsProvider: k8s
startPort: 9001
storage:
forceUseDisks: true
useAllWithPartitions: true
stork:
enabled: false
updateStrategy:
rollingUpdate:
maxUnavailable: 20%
type: RollingUpdate
version: 3.2.0
status:
clusterName: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
clusterUid: c9549e55-bab2-4ce4-a3fc-d7e811282c51
conditions:
- lastTransitionTime: "2024-11-18T13:49:48Z"
message: Portworx installation completed on 1/3 nodes, 2 nodes remaining
source: Portworx
status: InProgress
type: Install
- lastTransitionTime: "2024-11-18T13:49:11Z"
source: Portworx
status: Online
type: RuntimeState
desiredImages:
csiNodeDriverRegistrar: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.12.0
csiProvisioner: registry.k8s.io/sig-storage/csi-provisioner:v3.6.1
csiResizer: registry.k8s.io/sig-storage/csi-resizer:v1.12.0
csiSnapshotController: registry.k8s.io/sig-storage/snapshot-controller:v8.1.0
csiSnapshotter: registry.k8s.io/sig-storage/csi-snapshotter:v8.1.0
dynamicPlugin: portworx/portworx-dynamic-plugin:1.1.1
dynamicPluginProxy: nginxinc/nginx-unprivileged:1.25
kubeControllerManager: registry.k8s.io/kube-controller-manager-amd64:v1.29.6
pause: registry.k8s.io/pause:3.1
phase: Initializing
storage: {}
version: 3.2.0
$ kubectl -n portworx get pods -owide -l name=portworx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-5fjxr 0/1 Running 2 (13m ago) 43m 10.128.41.212 tstt9-d1-co-k8sw2 <none> <none>
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph 0/1 Running 2 (13m ago) 43m 10.128.41.213 tstt9-d1-co-k8sw3 <none> <none>
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-wcdf8 1/1 Running 0 43m 10.128.41.211 tstt9-d1-co-k8sw1 <none> <none>
pxctl status
- node1 (tstt9-d1-co-k8sw1)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-wcdf8 -- /usr/local/bin/pxctl status
Status: PX is operational
Telemetry: Disabled or Unhealthy
Metering: Healthy
License: Portworx CSI for FA/FB (lease renewal in 23h, 31m)
Node ID: 2c1c67b3-4f64-4bfc-8653-ae7451c4f068
IP: 10.128.41.211
Local Storage Pool: 1 pool
POOL IO_PRIORITY RAID_LEVEL USABLE USED STATUS ZONE REGION
0 HIGH raid0 50 GiB 4.4 GiB Online casino monaco
Local Storage Devices: 1 device
Device Path Media Type Size Last-Scan
0:1 /dev/sdb STORAGE_MEDIUM_SSD 50 GiB 18 Nov 24 13:49 UTC
* Internal kvdb on this node is sharing this storage device /dev/sdb to store its data.
total - 50 GiB
Cache Devices:
* No cache devices
Cluster Summary
Cluster ID: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
Cluster UUID: c9549e55-bab2-4ce4-a3fc-d7e811282c51
Scheduler: kubernetes
Total Nodes: 1 node(s) with storage (1 online)
IP ID SchedulerNodeName Auth StorageNode Used Capacity Status StorageStatus Version Kernel OS
10.128.67.21 2c1c67b3-4f64-4bfc-8653-ae7451c4f068 tstt9-d1-co-k8sw1 Disabled Yes 4.4 GiB 50 GiB Online Up (This node) 3.2.0.0-2ded0fe 5.14.0-427.40.1.el9_4.x86_64 Rocky Linux 9.4 (Blue Onyx)
Warnings:
WARNING: Internal Kvdb is not using dedicated drive on nodes [10.128.67.21]. This configuration is not recommended for production clusters.
Global Storage Pool
Total Used : 4.4 GiB
Total Capacity : 50 GiB
Collected at: 2024-11-18 14:17:38 UTC
- node2 (tstt9-d1-co-k8sw2)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-5fjxr -- /usr/local/bin/pxctl status
ERRO[0000] Failed to get /status: Local storage spec not initialized
KVDB is unreachable or misconfigured: kvdb instance not initialized
List of last known failures:
Type ID Resource Severity Count LastSeen FirstSeen Description
NODE ClusterManagerFailure d8877676-7142-44e5-bad2-fd62e9beebec ALARM 1 Nov 18 14:16:14 UTC 2024 Nov 18 14:16:14 UTC 2024 Failed to start cluster manager on node [10.128.67.22]: Storage initialization failed
NODE NodeInitFailure d8877676-7142-44e5-bad2-fd62e9beebec ALARM 1 Nov 18 14:16:14 UTC 2024 Nov 18 14:16:14 UTC 2024 Failed to create kvdb spec on the node 10.128.67.22: rpc error: code = Unavailable desc = transport is closing
Collected at: 2024-11-18 14:18:02 UTC
command terminated with exit code 1
- node3 (tstt9-d1-co-k8sw3)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph -- /usr/local/bin/pxctl status
ERRO[0000] Failed to get /status: Local storage spec not initialized
KVDB is unreachable or misconfigured: kvdb instance not initialized
List of last known failures:
Type ID Resource Severity Count LastSeen FirstSeen Description
NODE ClusterManagerFailure 798ac086-be3d-496e-83c1-fa733e5c2174 ALARM 1 Nov 18 14:15:40 UTC 2024 Nov 18 14:15:40 UTC 2024 Failed to start cluster manager on node [10.128.67.23]: Storage initialization failed
NODE NodeInitFailure 798ac086-be3d-496e-83c1-fa733e5c2174 ALARM 1 Nov 18 14:15:40 UTC 2024 Nov 18 14:15:40 UTC 2024 Failed to create kvdb spec on the node 10.128.67.23: rpc error: code = Unavailable desc = transport is closing
Collected at: 2024-11-18 14:18:11 UTC
command terminated with exit code 1
Context :
- k8s cluster v1.29.6
- Rocky Linux release 9.4 (Blue Onyx) nodes
- 5.14.0-427.40.1.el9_4.x86_64 kernel
- Portworx v3.2
- Ait-gapped cluster deployed on-premise on VM with private registry to get portworx images (works well)
- Only need CSI drivers for Pure storage FlashArray. No need to use advanced storage cluster deployed by Portwork.
- /dev/sdb dedicated disk for kvdb
Documentation used :
Possible cause of the problem
I’ve noticed into px cluster logs that node2 & node3 continously try to reach etcd endpoint from node1 but without success. See error into logs below.
I’ve checked network communication between etcd node with sucess.
TCP port 9011 is reachable from each nodes.
So i think it’s not a newtork communication problem.
Etcd cluster can’t initialize and as a consequence portworx cluster can’t work.
Logs from node3 (same on node2) :
$kubectl -n portworx logs px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:15.313745Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:15Z" level=error msg="[enumerate: pwx/px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526/storage/spec/] kvdb error: context deadline exceeded, retry count 50 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:15.55621Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:15Z" level=error msg="[set: lic_temp] kvdb error: context deadline exceeded, retry count 0 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079404Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079471Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:18Z" level=error msg="[set: lic_temp] kvdb error: rpc error: code = Unavailable desc = transport is closing, retry count 1 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079508Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
Opened TCP ports on each node
- node1
# netstat -nap | grep -w LISTEN |grep px
tcp 0 0 10.128.67.21:9018 0.0.0.0:* LISTEN 1559004/px-etcd
tcp 0 0 10.128.67.21:9002 0.0.0.0:* LISTEN 1559760/px
tcp 0 0 10.128.67.21:9003 0.0.0.0:* LISTEN 1559730/px-storage
tcp 0 0 127.0.0.1:9015 0.0.0.0:* LISTEN 1557157/px-oci-mon
tcp 0 0 0.0.0.0:9004 0.0.0.0:* LISTEN 1559032/px-ns
tcp6 0 0 :::9001 :::* LISTEN 1559760/px
tcp6 0 0 :::9009 :::* LISTEN 1559730/px-storage
tcp6 0 0 :::9008 :::* LISTEN 1559760/px
tcp6 0 0 :::9011 :::* LISTEN 1559004/px-etcd
tcp6 0 0 :::9013 :::* LISTEN 1559032/px-ns
tcp6 0 0 :::9012 :::* LISTEN 1559760/px
tcp6 0 0 :::9014 :::* LISTEN 1559016/px-diag
tcp6 0 0 :::9017 :::* LISTEN 1559760/px
tcp6 0 0 :::9019 :::* LISTEN 1559004/px-etcd
tcp6 0 0 :::9021 :::* LISTEN 1559760/px
tcp6 0 0 :::9020 :::* LISTEN 1559760/px
tcp6 0 0 :::9022 :::* LISTEN 1559025/px-healthmo
- node2
# netstat -nap | grep -w LISTEN |grep px
tcp 0 0 0.0.0.0:9004 0.0.0.0:* LISTEN 2530366/px-ns
tcp 0 0 127.0.0.1:9015 0.0.0.0:* LISTEN 2561485/px-oci-mon
tcp6 0 0 :::9009 :::* LISTEN 2550167/px-storage
tcp6 0 0 :::9011 :::* LISTEN 2530334/px-etcd
tcp6 0 0 :::9013 :::* LISTEN 2530366/px-ns
tcp6 0 0 :::9014 :::* LISTEN 2530352/px-diag
tcp6 0 0 :::9017 :::* LISTEN 2550181/px
tcp6 0 0 :::9020 :::* LISTEN 2550181/px
tcp6 0 0 :::9021 :::* LISTEN 2550181/px
tcp6 0 0 :::9022 :::* LISTEN 2530360/px-healthmo
tcp6 0 0 :::9001 :::* LISTEN 2550181/px
- node3
# netstat -nap | grep -w LISTEN |grep px
tcp 0 0 0.0.0.0:9004 0.0.0.0:* LISTEN 1489630/px-ns
tcp 0 0 127.0.0.1:9015 0.0.0.0:* LISTEN 1514328/px-oci-mon
tcp6 0 0 :::9001 :::* LISTEN 1505656/px
tcp6 0 0 :::9009 :::* LISTEN 1505643/px-storage
tcp6 0 0 :::9011 :::* LISTEN 1489592/px-etcd
tcp6 0 0 :::9013 :::* LISTEN 1489630/px-ns
tcp6 0 0 :::9014 :::* LISTEN 1489615/px-diag
tcp6 0 0 :::9017 :::* LISTEN 1505656/px
tcp6 0 0 :::9020 :::* LISTEN 1505656/px
tcp6 0 0 :::9021 :::* LISTEN 1505656/px
tcp6 0 0 :::9022 :::* LISTEN 1489622/px-healthmo
kvbd connecity check from each node
Some TCP ports (9018 & 9019) are not opened on node2 & node3. I think it’s normal since etcd cluster is not yet UP. 9018 tcp port is used for etcd cluster data transit. 9019 is use for etcd service.
i don’t know what is used for tcp port 9011.
TCP port 9011 is open and reachable from all node and to all nodes.
- node1
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused
- node2
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused
- node3
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused
I would appreciate you help trying to solve this problem. Portworx is mandatary for us to consume our Purestorage arrays from our k8s cluster.
We also use NetApp Trident for our NetApp storage arrays and setup & instalaltion is done in 30 minutes. I do not understand why it’s so complicated and opaque with Portworx.
Regards,
Jax