Issues with on-prem installation

Pure essentials version: 2.8
Kubernetes version: 1.22.3 (built using kubeadm)

When I run the installation of Portworx essentials using the operator not all the pods are starting correctly. Can you advise on this?

e.g. the potworx-api pods are in state running but have 0 running containers

portworx-api-bnr88                                      0/1     Running                 0                 42h
portworx-api-qjhjw                                      0/1     Running                 0                 42h

I’m not able to view any logs from these pods

px-cluster is similarly missing a running container

px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-bkpbn   1/2     Running                 164 (57s ago)     42h
px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-q8jmc   1/2     Running                 164 (31s ago)     42h

and lighthouse is failing to init as it can’t access portworx-service (presumably because the API isn’t up)

kubectl -n kube-system logs px-lighthouse-656b55cfdd-htc6c -c config-init -p
time="2021-11-17T08:10:12Z" level=info msg="Creating new default config for lighthouse"
2021/11/17 08:10:42 Get http://portworx-service:9001/config: dial tcp 10.106.131.105:9001: i/o timeout Next retry in: 10s
time="2021-11-17T08:10:52Z" level=info msg="Creating new default config for lighthouse"
2021/11/17 08:11:22 Get http://portworx-service:9001/config: dial tcp 10.106.131.105:9001: i/o timeout Next retry in: 10s
time="2021-11-17T08:11:32Z" level=info msg="Creating new default config for lighthouse"
2021/11/17 08:12:02 Get http://portworx-service:9001/config: dial tcp 10.106.131.105:9001: i/o timeout Next retry in: 10s
time="2021-11-17T08:12:12Z" level=fatal msg="Error initializing lighthouse config. timed out performing task"

Logs from the controller pod show that it cannot connect to portworx-service as well

kubectl -n kube-system logs px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-bkpbn -c csi-node-driver-registrar
...{repeated every few secs}
W1117 13:36:23.403311       1 connection.go:172] Still connecting to unix:///csi/csi.sock

output of the portworx container is large so I put it in pastebin portworx.log - Pastebin.com

I think the issue comes down to the csi-node-driver-registrar container not being able to connect to the csi socket

I1125 16:07:28.538410       1 main.go:137] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1125 16:07:28.538448       1 connection.go:153] Connecting to unix:///csi/csi.sock
W1125 16:07:38.538609       1 connection.go:172] Still connecting to unix:///csi/csi.sock

What can cause this?

Updated to version 2.9 but no change. Still seeing pods in “Running” state with failing containers

portworx-api-6b6bw                                      0/1     Running    0              112s
portworx-api-d9m5x                                      0/1     Running    0              89s
portworx-api-pv8gd                                      0/1     Running    0              89s
portworx-operator-84488c55c5-5mhzn                      1/1     Running    0              10d
px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-gslgv   1/2     Running    0              57s
px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-j2hjn   1/2     Running    0              84s
px-cluster-e3249027-e17e-4cd6-bba2-48690ca7326f-wfqm9   1/2     Running    0              57s
px-csi-ext-5fb4cc4bff-ld5pk                             3/3     Running    0              2m2s
px-csi-ext-5fb4cc4bff-w6t2n                             3/3     Running    0              99s
px-csi-ext-5fb4cc4bff-xc6kd                             3/3     Running    0              99s
px-lighthouse-656b55cfdd-jphgm                          0/3     Init:0/1   1 (38s ago)    2m49s
stork-969ff57d5-htq67                                   1/1     Running    0              3m42s
stork-969ff57d5-hwdht                                   1/1     Running    0              3m19s
stork-969ff57d5-wmp6c                                   1/1     Running    0              3m19s
stork-scheduler-8699d795f9-8sm5j                        1/1     Running    0              3m10s
stork-scheduler-8699d795f9-hlpgw                        1/1     Running    0              3m33s
stork-scheduler-8699d795f9-xr9jl                        1/1     Running    0              3m10s

still get the warning about connecting to csi socket

I don’t see the socket open on the node when using ss -x | grep csi is there some other part the needs installing on the node?

open-iscsi and multipath-tools are up to date

~$ sudo apt-cache policy open-iscsi
open-iscsi:
  Installed: 2.0.874-7.1ubuntu6.2
  Candidate: 2.0.874-7.1ubuntu6.2
  Version table:
 *** 2.0.874-7.1ubuntu6.2 500
        500 http://gb.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.0.874-7.1ubuntu6 500
        500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages

~$ sudo apt-cache policy multipath-tools
multipath-tools:
  Installed: 0.8.3-1ubuntu2
  Candidate: 0.8.3-1ubuntu2
  Version table:
 *** 0.8.3-1ubuntu2 500
        500 http://gb.archive.ubuntu.com/ubuntu focal/main amd64 Packages
        100 /var/lib/dpkg/status

events from storagecluster object

Events:
  Type     Reason            Age                   From                       Message
  ----     ------            ----                  ----                       -------
  Normal   SuccessfulCreate  20m                   storagecluster-controller  Created pod: px-cluster-c41aa9d0-1398-415c-b3cc-5de9cc451fe8-zmqql
  Normal   SuccessfulCreate  20m                   storagecluster-controller  Created pod: px-cluster-c41aa9d0-1398-415c-b3cc-5de9cc451fe8-4sgkw
  Normal   SuccessfulCreate  20m                   storagecluster-controller  Created pod: px-cluster-c41aa9d0-1398-415c-b3cc-5de9cc451fe8-9mblx
  Normal   SuccessfulCreate  11m                   storagecluster-controller  Created pod: px-cluster-c41aa9d0-1398-415c-b3cc-5de9cc451fe8-6t4kr
  Warning  FailedComponent   6m28s (x13 over 20m)  storagecluster-controller  Failed to setup Monitoring. Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/mutate?timeout=10s": dial tcp 10.106.178.125:443: i/o timeout
  Warning  FailedComponent   81s (x35 over 21m)    storagecluster-controller  Failed to setup Monitoring. Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-operator.monitoring.svc:443/admission-prometheusrules/mutate?timeout=10s": context deadline exceeded

which is odd as I have spec.monitoring.prometheus.enabled: false set in my operator spec.

Two warnings on the px-cluster pods:

  Warning  Unhealthy                          3m19s (x72 over 13m)  kubelet   Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  NodeStartFailure                   25s (x9 over 11m)     portworx  Failed to start Portworx: error loading node identity: Cause: ProviderInternal Error: failed to attach volume 8be366ddb6: volume does not exist

complaining the volume doesn’t exist. what should create the volume and how?