Portworx essential 2.5 failing to install on OCP 4.5 on AWS

Trying to install Portworx essential 2.5 on an OCP 4.5 cluster on AWS. It is failing with following error:

$ oc exec $PX_POD -n kube-system – /opt/pwx/bin/pxctl status
Defaulting container name to portworx.
Use ‘oc describe pod/portworx-8fvf9 -n kube-system’ to see all of the containers in this pod.
PX is not running on this host: Could not reach 'HealthMonitor’List of last known failures:Type ID Resource Severity Count LastSeen FirstSeen Description
NODE FileSystemDependency ip-10-0-171-59 ALARM 1 Oct 21 07:33:43 UTC 2020 Oct 21 07:33:43 UTC 2020 Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting…
NODE PortworxMonitorSchedulerInitializationFailed ip-10-0-171-59 ALARM 1 Oct 21 07:02:59 UTC 2020 Oct 21 07:02:59 UTC 2020 Could not init scheduler ‘kubernetes’: Could not find my node in Kubernetes cluster: Get https://172.30.0.1:443/api/v1/nodes: dial tcp 172.30.0.1:443: connect: no route to host
command terminated with exit code 1

Can you share the complete logs from one of the worker/compute node ?
journalctl -lu portworx* > /tmp/px-fail.log

– Logs begin at Wed 2020-10-21 09:44:12 UTC, end at Wed 2020-10-21 13:31:48 UTC. –
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Listening on Portworx logging FIFO.
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Started Portworx FIFO logging reader.
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Starting Portworx OCI Container…
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Started Portworx OCI Container.
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Rootfs found at /var/opt/pwx/oci/rootfs”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“PX binaries found at /var/opt/pwx/bin/px-runc”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Initializing as version 2.5.7.0-686a134 (OCI)”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“SPEC READ [1ef01fb2ba5288a806834cfb1d7c4304 /var/opt/pwx/oci/config.json]”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Enabling Sharedv4 NFS support …”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Setting up NFS service”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“> Initialized service controls via DBus{type:dbus,svc:nfs-server.service,id:0xc4201d7180}”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Checking mountpoints for following shared directories: [/var/lib/kubelet /var/lib/osd]”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="Found following mountpoints for shared dirs: map[/var/lib/kubelet:{isMP=f,Opts=shared:3,Pare>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="PX-RunC arguments: -b -c px-cluster-d547b53f-7253-4336-b058-41e389fdda2b -kvdb_dev type=gp2,>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="PX-RunC mounts: /dev:/dev /etc/exports:/etc/exports /opt/pwx/oci/mounts/etc/hosts:/etc/hosts>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="PX-RunC env: ALERTMANAGER_PORTWORX_PORT=tcp://172.30.75.232:9093 ALERTMANAGER_PORTWORX_PORT_>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Switched ‘/var/opt/pwx/oci’ to PRIVATE mount propagation”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Found 2 usable runc binaries: /var/opt/pwx/bin/runc and /var/opt/pwx/bin/runc-fb”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Detected kernel release 4.18.0-193”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="Exec: ["/var/opt/pwx/bin/runc" "run" "-b" "/var/opt/pwx/oci" "–no-new-keyring" ">
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: Executing with arguments: -b -c px-cluster-d547b53f-7253-4336-b058-41e389fdda2b -kvdb_dev type=gp2,size=150 -s type=gp2,size=1000 -secre>
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Wed Oct 21 13:25:51 UTC 2020 : Running version 2.5.7.0-686a134 on Linux ip-10-0-132-7 4.18.0-193.24.1.el8_2.dt1.x86_64 #1 SMP Thu Sep 24>
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Version: Linux version 4.18.0-193.24.1.el8_2.dt1.x86_64 (mockbuild@x86-vm-08.build.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red >
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:51Z” level=error msg="Cannot listen on UNIX socket: listen unix /run/docker/plugins/pxd.sock: bind: no such file >
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Error: failed to listen on pxd.sock
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /hostusr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /hostusr/src/linux-headers-4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: ln: failed to create symbolic link ‘compiler-gcc5.h’: Read-only file system
Oct 21 13:25:53 ip-10-0-132-7 portworx[605595]: Using /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64 with defines CC=gcc-8
Oct 21 13:25:53 ip-10-0-132-7 portworx[605595]: Creating px fs…
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using cluster: px-cluster-d547b53f-7253-4336-b058-41e389fdda2b
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using kvdb device: type=gp2,size=150
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Warning skipping device exists check for: type=gp2,size=150.
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using storage device: type=gp2,size=1000
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using scheduler: kubernetes
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Warning skipping device validation check for: type=gp2,size=1000.
Oct 21 13:25:57 ip-10-0-132-7 portworx[605595]: checking remote download site, please wait…
Oct 21 13:25:57 ip-10-0-132-7 portworx[605595]: Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting…
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Main process exited, code=exited, status=10/n/a
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Failed with result ‘exit-code’.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Consumed 6.432s CPU time
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Stopping Portworx FIFO logging reader…
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Stopped Portworx FIFO logging reader.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx-output.service: Consumed 33ms CPU time
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Closed Portworx logging FIFO.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.socket: Consumed 0 CPU time

Can you attach the complete log file ? Just don’t share the content, I will need to go through complete log to understand it.

Hi Sanjay,

I don’t see an option to attach a text file here, tried with the upload button but it says: “Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, heic, heif).”

You can use : https://gist.github.com/ or Pastebin or save the file in above extension and upload I will rename and check it.

Satya,

Can you install the latest version 2.6 ? looks like it is failing to detect the latest kernel version with 2.5.7. Can you give a try with latest release of 2.6.1.3 which has support for the latest kernel.

Sure, I will try with 2.6 version and will update you with the results.

It is still failing for 2.6 as well. Is it some port issue?
https://gist.github.com/satyamodi/784bff332b5a4e55c009447251e72c98

No its not port issue, its kernel dependency issue,
Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting..

Can you tell what is the operator version you have installed ?

it’s Portworx 1.4.0.

Any update on this issue.

Can you do the following :

And let me know how it goes, if anything fails please share the logs again.

It is still failing after i copied px.ko to all three worker nodes and restarted the portworx service.

[core@ip-10-0-132-7 ~] ls -ltr /var/lib/osd/pxfs/latest/8.px.ko -rw-r--r--. 1 root root 2391816 Oct 22 08:17 /var/lib/osd/pxfs/latest/8.px.ko [core@ip-10-0-132-7 ~] date
Thu Oct 22 08:45:44 UTC 2020

attaching logs for the same, it contains yesterday’s logs as well.

Can you restart the Portworx service again and check ? some changes have been made and updated the repo.

Now px volumes are getting created but it is not stable, it de-attaches instantly. Attaching all the logs.

PX_POD=(oc get pods -l name=portworx -n kube-system -o jsonpath=’{.items[0].metadata.name}’)
$ oc exec $PX_POD -n kube-system – /opt/pwx/bin/pxctl status
Defaulting container name to portworx.
Use ‘oc describe pod/portworx-7pprp -n kube-system’ to see all of the containers in this pod.
PX is not running on this host: Could not reach ‘HealthMonitor’

List of last known failures:

Type ID Resource Severity Count LastSeen FirstSeen Description
NODE NodeStartFailure 7d9de820-b8b6-4171-a94a-6f2d812e9029 ALARM 1 Oct 23 09:10:56 UTC 2020 Oct 23 09:10:56 UTC 2020 Failed to start Portworx: failed in internal kvdb setup: failed to create a kvdb connection to peer internal kvdb nodes dial tcp 10.0.135.69:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.
NODE KvdbConnectionFailed 7d9de820-b8b6-4171-a94a-6f2d812e9029 ALARM 1 Oct 23 09:10:50 UTC 2020 Oct 23 09:10:50 UTC 2020 Internal Kvdb: failed to create a kvdb connection to peer internal kvdb nodes dial tcp 10.0.135.69:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.
NODE PortworxMonitorInstallFailed ip-10-0-151-22 ALARM 1 Oct 23 08:59:40 UTC 2020 Oct 23 08:59:40 UTC 2020 Could not finalize OCI install: Timeout
NODE PortworxMonitorSchedulerInitializationFailed ip-10-0-151-22 ALARM 1 Oct 23 08:54:35 UTC 2020 Oct 23 08:54:35 UTC 2020 Could not init scheduler ‘kubernetes’: Could not find my node in Kubernetes cluster: Get https://172.30.0.1:443/api/v1/nodes: dial tcp 172.30.0.1:443: connect: no route to host
command terminated with exit code 1

Satya,

I see that following error : "Failed to attach DriveSet (3b897f20-9729-4dc0-a633-0d00709130f8): Availability zone misma>

Can you share the output of oc get nodes --show-labels and your spec file used for install

Updated the link with the output at the end:

#Portworx Operator will install pods only on nodes that have the label node-role.kubernetes.io/compute=true

I am labeling the nodes using a shell script:
WORKER_NODES=oc get nodes | grep worker | awk '{print $1}'
for wnode in ${WORKER_NODES[@]}; do
oc label nodes $wnode node-role.kubernetes.io/compute=true
done

Following ports are open for master and worker security groups:
aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $WORKER_GROUP_ID

Hi,
Any update on this issue.