Unable to install portworx 3.2

Hello,

It’s been several days i try to install portworx into k8s cluster. I’ve first started with portworx 3.1.6 witout success.
Then i decided to try portworx 3.2 release but unable to have a successfull setup.

As to me, Portworx doc is a bit light & opaque to debug this kind of problem.

Status of deployment

  • 1 portwork node OK with internal kvdb installed
  • 2 others portwork node refuse to join kvdb cluster

k8s resources

$ kubectl -n portworx get storageclusters.core.libopenstorage.org px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526 
NAME                                              CLUSTER UUID                           STATUS         VERSION   AGE
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526   c9549e55-bab2-4ce4-a3fc-d7e811282c51   Initializing   3.2.0     41m


$ kubectl -n portworx get storageclusters.core.libopenstorage.org px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526 -oyaml
apiVersion: core.libopenstorage.org/v1
kind: StorageCluster
metadata:
  annotations:
    portworx.io/install-source: https://install.portworx.com/?operator=true&mc=false&kbver=1.29.6&ns=portworx&oem=esse&user=7f9c9026-63df-4998-b4b3-b0229e7cd11a&b=true&iop=6&f=true&m=ens192&d=ens224&c=px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526&stork=false&csi=true&tel=false&st=k8s&reg=dockerh.bbaa1.sbm.admin%2Fportworx&rsec=px-image-repository&pp=Always
    portworx.io/misc-args: --oem esse
    portworx.io/preflight-check: "false"
  creationTimestamp: "2024-11-18T13:47:39Z"
  finalizers:
  - operator.libopenstorage.org/delete
  generation: 3
  labels:
    kustomize.toolkit.fluxcd.io/name: tstt9-portworx
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
  namespace: portworx
  resourceVersion: "52830717"
  uid: 165c47c5-e17a-4ac1-87f2-8106ccb010eb
spec:
  autopilot:
    enabled: false
  csi:
    enabled: true
    installSnapshotController: true
    topology:
      enabled: true
  customImageRegistry: <my_private_registry>/portworx
  deleteStrategy:
    type: UninstallAndWipe
  env:
  - name: PURE_FLASHARRAY_SAN_TYPE
    value: ISCSI
  image: portworx/oci-monitor:3.2.0
  imagePullPolicy: Always
  imagePullSecret: px-image-repository
  kvdb:
    internal: true
  monitoring:
    telemetry:
      enabled: false
  network:
    dataInterface: ens224
    mgmtInterface: ens192
  placement:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: px/enabled
            operator: NotIn
            values:
            - "false"
          - key: node-role.kubernetes.io/master
            operator: DoesNotExist
          - key: node-role.kubernetes.io/control-plane
            operator: DoesNotExist
        - matchExpressions:
          - key: px/enabled
            operator: NotIn
            values:
            - "false"
          - key: node-role.kubernetes.io/master
            operator: Exists
          - key: node-role.kubernetes.io/worker
            operator: Exists
        - matchExpressions:
          - key: px/enabled
            operator: NotIn
            values:
            - "false"
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
          - key: node-role.kubernetes.io/worker
            operator: Exists
  revisionHistoryLimit: 10
  runtimeOptions:
    default-io-profile: "6"
  secretsProvider: k8s
  startPort: 9001
  storage:
    forceUseDisks: true
    useAllWithPartitions: true
  stork:
    enabled: false
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 20%
    type: RollingUpdate
  version: 3.2.0
status:
  clusterName: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
  clusterUid: c9549e55-bab2-4ce4-a3fc-d7e811282c51
  conditions:
  - lastTransitionTime: "2024-11-18T13:49:48Z"
    message: Portworx installation completed on 1/3 nodes, 2 nodes remaining
    source: Portworx
    status: InProgress
    type: Install
  - lastTransitionTime: "2024-11-18T13:49:11Z"
    source: Portworx
    status: Online
    type: RuntimeState
  desiredImages:
    csiNodeDriverRegistrar: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.12.0
    csiProvisioner: registry.k8s.io/sig-storage/csi-provisioner:v3.6.1
    csiResizer: registry.k8s.io/sig-storage/csi-resizer:v1.12.0
    csiSnapshotController: registry.k8s.io/sig-storage/snapshot-controller:v8.1.0
    csiSnapshotter: registry.k8s.io/sig-storage/csi-snapshotter:v8.1.0
    dynamicPlugin: portworx/portworx-dynamic-plugin:1.1.1
    dynamicPluginProxy: nginxinc/nginx-unprivileged:1.25
    kubeControllerManager: registry.k8s.io/kube-controller-manager-amd64:v1.29.6
    pause: registry.k8s.io/pause:3.1
  phase: Initializing
  storage: {}
  version: 3.2.0


$ kubectl -n portworx get pods -owide -l name=portworx
NAME                                                    READY   STATUS    RESTARTS      AGE   IP              NODE                NOMINATED NODE   READINESS GATES
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-5fjxr   0/1     Running   2 (13m ago)   43m   10.128.41.212   tstt9-d1-co-k8sw2   <none>           <none>
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph   0/1     Running   2 (13m ago)   43m   10.128.41.213   tstt9-d1-co-k8sw3   <none>           <none>
px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-wcdf8   1/1     Running   0             43m   10.128.41.211   tstt9-d1-co-k8sw1   <none>           <none>

pxctl status

  • node1 (tstt9-d1-co-k8sw1)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-wcdf8 -- /usr/local/bin/pxctl status
Status: PX is operational
Telemetry: Disabled or Unhealthy
Metering: Healthy
License: Portworx CSI for FA/FB (lease renewal in 23h, 31m)
Node ID: 2c1c67b3-4f64-4bfc-8653-ae7451c4f068
    IP: 10.128.41.211 
     Local Storage Pool: 1 pool
    POOL    IO_PRIORITY    RAID_LEVEL    USABLE    USED    STATUS    ZONE    REGION
    0    HIGH        raid0        50 GiB    4.4 GiB    Online    casino    monaco
    Local Storage Devices: 1 device
    Device    Path        Media Type        Size        Last-Scan
    0:1    /dev/sdb    STORAGE_MEDIUM_SSD    50 GiB        18 Nov 24 13:49 UTC
    * Internal kvdb on this node is sharing this storage device /dev/sdb  to store its data.
    total        -    50 GiB
    Cache Devices:
     * No cache devices
Cluster Summary
    Cluster ID: px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526
    Cluster UUID: c9549e55-bab2-4ce4-a3fc-d7e811282c51
    Scheduler: kubernetes
    Total Nodes: 1 node(s) with storage (1 online)
    IP        ID                    SchedulerNodeName    Auth        StorageNode    Used    Capacity    Status    StorageStatus    Version        Kernel                OS
    10.128.67.21    2c1c67b3-4f64-4bfc-8653-ae7451c4f068    tstt9-d1-co-k8sw1    Disabled    Yes        4.4 GiB    50 GiB        Online    Up (This node)    3.2.0.0-2ded0fe    5.14.0-427.40.1.el9_4.x86_64    Rocky Linux 9.4 (Blue Onyx)
    Warnings: 
         WARNING: Internal Kvdb is not using dedicated drive on nodes [10.128.67.21]. This configuration is not recommended for production clusters.
Global Storage Pool
    Total Used        :  4.4 GiB
    Total Capacity    :  50 GiB
Collected at: 2024-11-18 14:17:38 UTC
  • node2 (tstt9-d1-co-k8sw2)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-5fjxr -- /usr/local/bin/pxctl status
ERRO[0000] Failed to get /status: Local storage spec not initialized 
KVDB is unreachable or misconfigured: kvdb instance not initialized


List of last known failures:

Type    ID            Resource                Severity    Count    LastSeen            FirstSeen            Description                        
NODE    ClusterManagerFailure    d8877676-7142-44e5-bad2-fd62e9beebec    ALARM        1    Nov 18 14:16:14 UTC 2024    Nov 18 14:16:14 UTC 2024    Failed to start cluster manager on node [10.128.67.22]: Storage initialization failed                
NODE    NodeInitFailure        d8877676-7142-44e5-bad2-fd62e9beebec    ALARM        1    Nov 18 14:16:14 UTC 2024    Nov 18 14:16:14 UTC 2024    Failed to create kvdb spec on the node 10.128.67.22: rpc error: code = Unavailable desc = transport is closing    
Collected at: 2024-11-18 14:18:02 UTC
command terminated with exit code 1
  • node3 (tstt9-d1-co-k8sw3)
$ kubectl -n portworx exec -it px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph -- /usr/local/bin/pxctl status
ERRO[0000] Failed to get /status: Local storage spec not initialized 
KVDB is unreachable or misconfigured: kvdb instance not initialized


List of last known failures:

Type    ID            Resource                Severity    Count    LastSeen            FirstSeen            Description                        
NODE    ClusterManagerFailure    798ac086-be3d-496e-83c1-fa733e5c2174    ALARM        1    Nov 18 14:15:40 UTC 2024    Nov 18 14:15:40 UTC 2024    Failed to start cluster manager on node [10.128.67.23]: Storage initialization failed                
NODE    NodeInitFailure        798ac086-be3d-496e-83c1-fa733e5c2174    ALARM        1    Nov 18 14:15:40 UTC 2024    Nov 18 14:15:40 UTC 2024    Failed to create kvdb spec on the node 10.128.67.23: rpc error: code = Unavailable desc = transport is closing    
Collected at: 2024-11-18 14:18:11 UTC
command terminated with exit code 1

Context :

  • k8s cluster v1.29.6
  • Rocky Linux release 9.4 (Blue Onyx) nodes
  • 5.14.0-427.40.1.el9_4.x86_64 kernel
  • Portworx v3.2
  • Ait-gapped cluster deployed on-premise on VM with private registry to get portworx images (works well)
  • Only need CSI drivers for Pure storage FlashArray. No need to use advanced storage cluster deployed by Portwork.
  • /dev/sdb dedicated disk for kvdb

Documentation used :

Possible cause of the problem
I’ve noticed into px cluster logs that node2 & node3 continously try to reach etcd endpoint from node1 but without success. See error into logs below.

I’ve checked network communication between etcd node with sucess.

TCP port 9011 is reachable from each nodes.

So i think it’s not a newtork communication problem.

Etcd cluster can’t initialize and as a consequence portworx cluster can’t work.

Logs from node3 (same on node2) :

$kubectl -n portworx logs px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526-fphph 

@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:15.313745Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:15Z" level=error msg="[enumerate: pwx/px-cluster-7d05886f-4a59-4a38-9273-2a9e2ca93526/storage/spec/] kvdb error: context deadline exceeded, retry count 50 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:15.55621Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:15Z" level=error msg="[set: lic_temp] kvdb error: context deadline exceeded, retry count 0 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079404Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079471Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}
@tstt9-d1-co-k8sw3 portworx[1488034]: time="2024-11-18T14:25:18Z" level=error msg="[set: lic_temp] kvdb error: rpc error: code = Unavailable desc = transport is closing, retry count 1 \n" file="kv_etcd.go:1793" component=kvdb/etcd/v3
@tstt9-d1-co-k8sw3 portworx[1488034]: {"level":"warn","ts":"2024-11-18T14:25:18.079508Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2f91eb07-9026-4f58-9e18-a0484b9e08fd/10.128.67.21:9019","attempt":0,"error":"rpc error: code = Unavailable desc = transport is closing"}

Opened TCP ports on each node

  • node1
# netstat -nap | grep -w LISTEN |grep px
tcp        0      0 10.128.67.21:9018       0.0.0.0:*               LISTEN      1559004/px-etcd     
tcp        0      0 10.128.67.21:9002       0.0.0.0:*               LISTEN      1559760/px          
tcp        0      0 10.128.67.21:9003       0.0.0.0:*               LISTEN      1559730/px-storage  
tcp        0      0 127.0.0.1:9015          0.0.0.0:*               LISTEN      1557157/px-oci-mon  
tcp        0      0 0.0.0.0:9004            0.0.0.0:*               LISTEN      1559032/px-ns       
tcp6       0      0 :::9001                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9009                 :::*                    LISTEN      1559730/px-storage  
tcp6       0      0 :::9008                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9011                 :::*                    LISTEN      1559004/px-etcd     
tcp6       0      0 :::9013                 :::*                    LISTEN      1559032/px-ns       
tcp6       0      0 :::9012                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9014                 :::*                    LISTEN      1559016/px-diag     
tcp6       0      0 :::9017                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9019                 :::*                    LISTEN      1559004/px-etcd     
tcp6       0      0 :::9021                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9020                 :::*                    LISTEN      1559760/px          
tcp6       0      0 :::9022                 :::*                    LISTEN      1559025/px-healthmo 
  • node2
# netstat -nap | grep -w LISTEN |grep px
tcp        0      0 0.0.0.0:9004            0.0.0.0:*               LISTEN      2530366/px-ns       
tcp        0      0 127.0.0.1:9015          0.0.0.0:*               LISTEN      2561485/px-oci-mon  
tcp6       0      0 :::9009                 :::*                    LISTEN      2550167/px-storage  
tcp6       0      0 :::9011                 :::*                    LISTEN      2530334/px-etcd     
tcp6       0      0 :::9013                 :::*                    LISTEN      2530366/px-ns       
tcp6       0      0 :::9014                 :::*                    LISTEN      2530352/px-diag     
tcp6       0      0 :::9017                 :::*                    LISTEN      2550181/px          
tcp6       0      0 :::9020                 :::*                    LISTEN      2550181/px          
tcp6       0      0 :::9021                 :::*                    LISTEN      2550181/px          
tcp6       0      0 :::9022                 :::*                    LISTEN      2530360/px-healthmo 
tcp6       0      0 :::9001                 :::*                    LISTEN      2550181/px     
  • node3
# netstat -nap | grep -w LISTEN |grep px
tcp        0      0 0.0.0.0:9004            0.0.0.0:*               LISTEN      1489630/px-ns       
tcp        0      0 127.0.0.1:9015          0.0.0.0:*               LISTEN      1514328/px-oci-mon  
tcp6       0      0 :::9001                 :::*                    LISTEN      1505656/px          
tcp6       0      0 :::9009                 :::*                    LISTEN      1505643/px-storage  
tcp6       0      0 :::9011                 :::*                    LISTEN      1489592/px-etcd     
tcp6       0      0 :::9013                 :::*                    LISTEN      1489630/px-ns       
tcp6       0      0 :::9014                 :::*                    LISTEN      1489615/px-diag     
tcp6       0      0 :::9017                 :::*                    LISTEN      1505656/px          
tcp6       0      0 :::9020                 :::*                    LISTEN      1505656/px          
tcp6       0      0 :::9021                 :::*                    LISTEN      1505656/px          
tcp6       0      0 :::9022                 :::*                    LISTEN      1489622/px-healthmo

kvbd connecity check from each node

Some TCP ports (9018 & 9019) are not opened on node2 & node3. I think it’s normal since etcd cluster is not yet UP. 9018 tcp port is used for etcd cluster data transit. 9019 is use for etcd service.
i don’t know what is used for tcp port 9011.

TCP port 9011 is open and reachable from all node and to all nodes.

  • node1
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused
  • node2
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused
  • node3
# for node in 10.128.67.21 10.128.67.22 10.128.67.23; do echo "## $node"; for port in 9011 9018 9019; do echo "## $port"; curl http://$node:$port; done; done
## 10.128.67.21
## 9011
404 page not found
## 9018
404 page not found
## 9019
404 page not found
## 10.128.67.22
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.22 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.22 port 9019: Connection refused
## 10.128.67.23
## 9011
404 page not found
## 9018
curl: (7) Failed to connect to 10.128.67.23 port 9018: Connection refused
## 9019
curl: (7) Failed to connect to 10.128.67.23 port 9019: Connection refused

I would appreciate you help trying to solve this problem. Portworx is mandatary for us to consume our Purestorage arrays from our k8s cluster.

We also use NetApp Trident for our NetApp storage arrays and setup & instalaltion is done in 30 minutes. I do not understand why it’s so complicated and opaque with Portworx.

Regards,

Jax