V2.7 install failing on IBM Cloud ROKS (OpenShift 4.6)

Been at this for a while and can’t get Portworx Essentials installed on IBM Cloud ROKS (Openshift 4.6).

Here is my spec:

# SOURCE: https://install.portworx.com/?operator=true&mc=false&kbver=1.19.0%2B263ee0d&oem=esse&user=0eeca59c-c7ab-11ea-a2c5-c24e499c7467&b=true&s=%2Fdev%2Fdm-1&m=eth0&d=eth0&c=px-cluster-6604ffbb-4120-4f25-8fbc-b4d803a96530&osft=true&stork=true&st=k8s&rsec=px-essential
kind: StorageCluster
apiVersion: core.libopenstorage.org/v1
metadata:
  name: px-cluster-6604ffbb-4120-4f25-8fbc-b4d803a96530
  namespace: kube-system
  annotations:
portworx.io/install-source: "https://install.portworx.com/?operator=true&mc=false&kbver=1.19.0%2B263ee0d&oem=esse&user=0eeca59c-c7ab-11ea-a2c5-c24e499c7467&b=true&s=%2Fdev%2Fdm-1&m=eth0&d=eth0&c=px-cluster-6604ffbb-4120-4f25-8fbc-b4d803a96530&osft=true&stork=true&st=k8s&rsec=px-essential"
portworx.io/is-openshift: "true"
portworx.io/misc-args: "--oem esse"
spec:
  image: portworx/oci-monitor:2.7.0
  imagePullPolicy: Always
  imagePullSecret: px-essential
  kvdb:
internal: true
  storage:
devices:
- /dev/dm-1
  network:
dataInterface: eth0
mgmtInterface: eth0
  secretsProvider: k8s
  stork:
enabled: true
args:
  webhook-controller: "false"
  autopilot:
enabled: true

Current issue is that px-cluster-* pods are running but not ready. Event on one of the pods.

Warning Unhealthy 14s (x114 over 19m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503

Here’s the last 50 lines of the pod log.

@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:08:57Z" level=info msg="Made 1 pools"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:08:57Z" level=info msg="Benchmarking drive  /dev/sda"
time="2021-05-12T21:09:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:07Z" level=info msg="fio: test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128\nfio-2.2.10\nStarting 1 process\n\ntest: (groupid=0, jobs=1): err= 0: pid=1877: Wed May 12 21:09:07 2021\n read : io=129528KB, bw=12900KB/s, iops=3224, runt= 10041msec\n    slat (usec): min=0, max=2647, avg=11.00, stdev=24.61\n    clat (msec): min=1, max=379, avg=39.66, stdev=22.42\n     lat (msec): min=1, max=379, avg=39.67, stdev=22.42\n    clat percentiles (msec):\n     |  1.00th=[    4],  5.00th=[   11], 10.00th=[  19], 20.00th=[   21],\n     | 30.00th=[   30], 40.00th=[   31], 50.00th=[   40], 60.00th=[   41],\n     | 70.00th=[   50], 80.00th=[   60], 90.00th=[   61],95.00th=[   70],\n     | 99.00th=[  110], 99.50th=[  159], 99.90th=[  221], 99.95th=[  269],\n     | 99.99th=[  330]\n    bw (KB  /s): min=11402, max=24120, per=99.68%, avg=12858.35, stdev=2660.10\n    lat (msec) : 2=0.07%, 4=1.46%, 10=3.46%, 20=12.86%, 50=52.58%\n    lat (msec) : 100=28.47%, 250=1.04%, 500=0.06%\n  cpu          : usr=2.07%, sys=5.48%, ctx=4011, majf=0, minf=166\n  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.8%\n     submit   : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%\n     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%\n     issued    : total=r=32382/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0\n     latency   : target=0, window=0, percentile=100.00%, depth=128\n\nRun status group 0(all jobs):\n   READ: io=129528KB, aggrb=12899KB/s, minb=12899KB/s, maxb=12899KB/s, mint=10041msec, maxt=10041msec\n\nDisk stats (read/write):\n  sda: ios=31946/0, merge=32/0, ticks=1257442/0, in_queue=1260582, util=99.07%\n"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:07Z" level=info msg="Storage pool WriteThroughput 12MB/s"
time="2021-05-12T21:09:11Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
time="2021-05-12T21:09:21Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:22Z" level=error msg="Unable to start internal kvdb on this node" err="failed in initializing drives on this node: Failed to format [-f --nodiscard /dev/sda]: ERROR: unable to open /dev/sda: Device or resource busy" fn=kvdb-provisioner.ProvisionKvdbWithoutLock id=5081f13f-81c2-4082-b5f5-f8b7b17a4776
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:22Z" level=error msg="failed to setup internal kvdb:failed to provision internal kvdb: failed in initializing drives on this node: Failed to format [-f --nodiscard /dev/sda]: ERROR: unable to open /dev/sda: Device or resource busy" func=InitAndBoot package=boot
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:22Z" level=error msg="Could not init boot manager" error="failed to setup internal kvdb: failed to provision internal kvdb: failed in initializing drives on this node: Failed to format [-f --nodiscard /dev/sda]: ERROR: unable to open /dev/sda: Device or resource busy"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PXPROCS[INFO]: px daemon exited with code: 1
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12 21:09:23,628 INFO exited: pxdaemon (exit status 1; not expected)
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12 21:09:23,630 INFO spawned: 'pxdaemon' with pid 1888
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PX_STORAGE_IO_FLUSHER=yes
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: Starting as an IOFlusher process : /usr/local/bin/start_pxcontroller_pxstorage.py
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: Process with PID 1888, is a IO Flusher
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12 21:09:23,654 INFO reaped unknown pid 1768
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PXPROCS[INFO]: Started px-storage with pid 1922
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: bash: connect: Connection refused
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: bash: /dev/tcp/localhost/17006: Connection refused
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PXPROCS[INFO]: px-storage not started yet...sleeping
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12 21:09:25,215 INFO reaped unknown pid 1863
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PXPROCS[INFO]: Started px with pid 1939
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: PXPROCS[INFO]: Started watchdog with pid 1940
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12_21:09:26: PX-Watchdog: Starting watcher
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12_21:09:26: PX-Watchdog: Waiting for px process to start
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12_21:09:26: PX-Watchdog: (pid 1939): Begin monitoring
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Registering [kernel] as a volume driver"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Registered the Usage based Metering Agent...."
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Setting log level to info(4)"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=error msg="Cannot listen on UNIX socket: listen unix /run/docker/plugins/pxd.sock: bind: no such file or directory"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=warning msg="Failed to start pxd-dummy: failed to listen on pxd.sock, ingnoring and continuing..."
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="read config from env var" func=init package=boot
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="read config from config.json" func=init package=boot
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Alerts initialized successfullyfor this cluster"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Node is not yet initialized" func=setNodeInfo package=boot
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Generated a new NodeID: 244996ec-254d-4389-b6aa-7abf1aad486c"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Found mgmt interface device:[eth0]"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:27Z" level=info msg="Found data interface device:[eth0]"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:28Z" level=info msg="Using interface device:[eth0] for management..."
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:28Z" level=info msg="Using interface device:[eth0] for data..."
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:28Z" level=info msg="Detected Machine Hardware Type as: xen (Virtual Machine)"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:28Z" level=info msg="Bootstraping internal kvdb service." fn=kv-store.New id=244996ec-254d-4389-b6aa-7abf1aad486c
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: 2021-05-12 21:09:29,109 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
time="2021-05-12T21:09:31Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
time="2021-05-12T21:09:41Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:09:45Z" level=warning msg="Locked for 15 seconds" Error="ConfigMap is locked" Function=Lock Module=ConfigMap Name=px-bootstrap-pxstoragecluster Owner=6f388ee6-818a-4301-8eb6-be34bdba973b
time="2021-05-12T21:09:51Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"
@kube-c2cose7w02fpblcs6i10-roksaiops05-default-00000378.iks.ibm portworx[283510]: time="2021-05-12T21:10:01Z" level=warning msg="Locked for 30 seconds" Error="ConfigMap is locked" Function=Lock Module=ConfigMap Name=px-bootstrap-pxstoragecluster Owner=5fc3109e-e137-40d3-8f42-4ca76bf98acf
time="2021-05-12T21:10:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:17001/v1/cluster/nodehealth: dial tcp 127.0.0.1:17001: connect: connection refused"

Hi… I’ve struggled several time to install Portworx in my Lab and finally it work with a couple of considerations:
1 - for reinstall (It’s mean, try again) you should wipe the installation with this command:
curl -fsL “https://install.portworx.com/px-wipe” | bash
2 - When you wipe the nodes, you should remove the Spec and create a new one.
3 - Review the resources. I had to add more vcpu: Prerequisites
4 - The problem could be de KVDB deploy inside the pods… I’ve decided to define specific devices (vdisks) and put this in the Spec (no auto, check the size, kvdb devices should be 64GB min).
I hope this helps, Marcelo

@Mat_Davis as @Marcelo_Soria mentioned error shows and failing to initialize the internal kvdb and it could not to find the /dev/sda

i think you need to add the vdisks for portworx. These drives should not be mounted or formatted.
can you please get the out put for below commands lsblk from all the worker nodes?

@sensre One major issue that I’ve discovered is that the portworx-pvc-controller-* pods never get created during the operator install. You have to manually define them using the following.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: portworx-pvc-controller-account
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
   name: portworx-pvc-controller-role
rules:
- apiGroups: [""]
  resources: ["persistentvolumes"]
  verbs: ["create","delete","get","list","update","watch"]
- apiGroups: [""]
  resources: ["persistentvolumes/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["persistentvolumeclaims"]
  verbs: ["get", "list", "update", "watch"]
- apiGroups: [""]
  resources: ["persistentvolumeclaims/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "delete", "get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["endpoints", "services"]
  verbs: ["create", "delete", "get", "update"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["watch", "create", "patch", "update"]
- apiGroups: [""]
  resources: ["serviceaccounts"]
  verbs: ["get", "create"]
- apiGroups: [""]
  resources: ["serviceaccounts/token"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "create", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: portworx-pvc-controller-role-binding
subjects:
- kind: ServiceAccount
  name: portworx-pvc-controller-account
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: portworx-pvc-controller-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    tier: control-plane
  name: portworx-pvc-controller
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: portworx-pvc-controller
  replicas: 3
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: portworx-pvc-controller
        tier: control-plane
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "name"
                    operator: In
                    values:
                    - portworx-pvc-controller
              topologyKey: "kubernetes.io/hostname"
      hostNetwork: true
      containers:
      - command:
        - kube-controller-manager
        - --leader-elect=true
        - --address=0.0.0.0
        - --controllers=persistentvolume-binder,persistentvolume-expander
        - --use-service-account-credentials=true
        - --leader-elect-resource-lock=endpoints
        image: gcr.io/google_containers/kube-controller-manager-amd64:v1.16.2
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 8
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 10252
            scheme: HTTP
          initialDelaySeconds: 15
          timeoutSeconds: 15
        name: portworx-pvc-controller-manager
        resources:
          requests:
            cpu: 200m
      serviceAccountName: portworx-pvc-controller-account

@sensre is this a known issue or is there some mechanism for me to open a support ticket for this as an Essentials customer?

@Mat_Davis yes this is a known issue for any cloud-based Kubernetes service providers, that need this pvc controller to function properly. some clusters like EKS, GKE…etc will have PVC controller by default in this px2.7.x version. but not sure about the IBM cloud ROKS. I will check internally and confirm on the same.

For Essential customer cant open the support case, you can use the general slack or this forum for the discussions or queries.