Portworx pod fails to come up after installation using Daemon set on on-prem k8s cluster

I have been trying to install Portworx on on-prem kubernetes cluster. I was able to generate the spec file from px-central console. I applied the file against my kubernetes cluster but the portworx-api pods fail to come up. I then tried to collect the logs for Portworx using the command:
kubectl logs -n kube-system -l name=portworx -c portworx --tail=99999
listed here: Troubleshoot Portworx on Kubernetes but the command fails with the error: Error from server: Get "https://172.23.105.137:10250/containerLogs/kube-system/portworx-5mxmj/portworx?tailLines=99999": dial tcp 172.23.105.137:10250: connect: no route to host.

Env:
K8s version: 1.21.7
Base Machine: centos7
Portworx version: 2.8

Process of Installing Portorx on kubernetes:

  1. Generated Spec file from px-central, did not specify etcd or kvdb values, all params with default values.
  2. Applied the px-spec file over the k8s cluster.
  3. Status of the portworx pods in kube-system:
    portworx-4zrk9 1/2 Running 86 22h
    portworx-5mxmj 1/2 Running 86 22h
    portworx-api-2kvkt 0/1 Running 0 22h
    portworx-api-br254 0/1 Running 0 22h
    portworx-api-tkp4b 0/1 Running 0 22h
    portworx-slbsd 1/2 Running 86 22h
    px-csi-ext-577876dcb8-hmblh 4/4 Running 0 22h
    px-csi-ext-577876dcb8-hzjnp 4/4 Running 0 22h
    px-csi-ext-577876dcb8-jfd6m 4/4 Running 0 22h
    stork-59dfbd5f89-4w7jq 1/1 Running 0 22h
    stork-59dfbd5f89-g4l75 1/1 Running 0 22h
    stork-59dfbd5f89-qjct8 1/1 Running 0 22h
    stork-scheduler-6c5b979799-45vbq 1/1 Running 0 22h
    stork-scheduler-6c5b979799-mr62s 1/1 Running 0 22h
    stork-scheduler-6c5b979799-r8bts 1/1 Running 0 22h

How do I start troubleshooting or collecting logs for Portworx ? I can share the output of lsblk or blkid if required.

Describe of one of the portworx-api pods:

[root@localhost ~]# kubectl describe pods/portworx-api-2kvkt  -n kube-system
Name:         portworx-api-2kvkt
Namespace:    kube-system
Priority:     0
Node:         workernode2.localdomain/172.23.105.1
Start Time:   Thu, 27 Jan 2022 06:46:55 -0800
Labels:       controller-revision-hash=db477d449
              name=portworx-api
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           172.23.105.1
IPs:
  IP:           172.23.105.1
Controlled By:  DaemonSet/portworx-api
Containers:
  portworx-api:
    Container ID:   docker://8be2997ab551cde34449d56becb8f2e3211f0887e6501c14db5f81713d3ae564
    Image:          k8s.gcr.io/pause:3.1
    Image ID:       docker-pullable://k8s.gcr.io/pause@sha256:f78411e19d84a252e53bff71a4407a5686c46983a2c2eeed83929b888179acea
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Thu, 27 Jan 2022 06:47:10 -0800
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://127.0.0.1:9001/status delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vkbv9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-vkbv9:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  4m55s (x8101 over 22h)  kubelet  Readiness probe failed: Get "http://127.0.0.1:9001/status": dial tcp 127.0.0.1:9001: connect: connection refused

FYI: I did label the nodes with px/metadata-node=true. I also provided the kvdb device value as /dev/xvdb as xvdb is present in my output of lsblk but hitting the same error.

Output from one of the worker nodes attached: journalctl -lu portworx* > node.logs

Spec file: https://install.portworx.com/2.8?mc=false&kbver=1.21.7&oem=esse&user=4906ceae-ba0a-11eb-a2c5-c24e499c7467&b=true&kd=%2Fdev%2Fxvdb&c=px-cluster-46cef63f-291b-4f3b-8d59-06c6442c78be&stork=true&csi=true&mon=true&tel=false&st=k8s

Looking at the provided Spec file, you are telling Portworx to start and consume all unmounted disks. You also mention labeling nodes and providing a path for the kvdb device.

However, the start command in the log file you attached looks like it is trying to create disks for GKE on GCP. This line is the command telling Portworx what to use:

Jan 27 06:42:26 workernode.localdomain portworx[8857]: time=“2022-01-27T06:42:26-08:00” level=info msg=“PX-RunC arguments: -b -c px-cluster-e0ab5f47-bb74-47cd-acf7-070b0d677ae3 -kvdb_dev type=pd-standard,size=150 -s type=pd-ssd,size=50 -secret_type k8s -x kubernetes”

The type=pd-standard and pd-ssd are GCP volume types.

So it appears the manifest you linked and the logs do not match up.

I would suggest running kubectl delete -f <your-px-specfile.yaml> and then go to Central and generate a new cluster configuration. If /dev/xvdb is the only additional device connected, then do not specify it for use by the KVDB. Instead don’t specify a KVDB device and Portworx will use a portion of that drive for the kvdb and the rest for persistent storage.

It may also be necessary to click the star in the upper right of the Central page and disassociate the cluster ID of this failed deployment if it made it far enough to check in with our license servers to validate PX-Essentials.

Good Luck!

Thanks for the response, looks like I wasn’t disassociating the cluster. When I unlinked my old cluster and generated spec for a fresh one, It worked.

1 Like