Portworx essential 2.5 failing to install on OCP 4.5 on AWS

Trying to install Portworx essential 2.5 on an OCP 4.5 cluster on AWS. It is failing with following error:

$ oc exec $PX_POD -n kube-system – /opt/pwx/bin/pxctl status
Defaulting container name to portworx.
Use ‘oc describe pod/portworx-8fvf9 -n kube-system’ to see all of the containers in this pod.
PX is not running on this host: Could not reach 'HealthMonitor’List of last known failures:Type ID Resource Severity Count LastSeen FirstSeen Description
NODE FileSystemDependency ip-10-0-171-59 ALARM 1 Oct 21 07:33:43 UTC 2020 Oct 21 07:33:43 UTC 2020 Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting…
NODE PortworxMonitorSchedulerInitializationFailed ip-10-0-171-59 ALARM 1 Oct 21 07:02:59 UTC 2020 Oct 21 07:02:59 UTC 2020 Could not init scheduler ‘kubernetes’: Could not find my node in Kubernetes cluster: Get https://172.30.0.1:443/api/v1/nodes: dial tcp 172.30.0.1:443: connect: no route to host
command terminated with exit code 1

Can you share the complete logs from one of the worker/compute node ?
journalctl -lu portworx* > /tmp/px-fail.log

– Logs begin at Wed 2020-10-21 09:44:12 UTC, end at Wed 2020-10-21 13:31:48 UTC. –
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Listening on Portworx logging FIFO.
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Started Portworx FIFO logging reader.
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Starting Portworx OCI Container…
Oct 21 13:25:50 ip-10-0-132-7 systemd[1]: Started Portworx OCI Container.
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Rootfs found at /var/opt/pwx/oci/rootfs”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“PX binaries found at /var/opt/pwx/bin/px-runc”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Initializing as version 2.5.7.0-686a134 (OCI)”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“SPEC READ [1ef01fb2ba5288a806834cfb1d7c4304 /var/opt/pwx/oci/config.json]”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Enabling Sharedv4 NFS support …”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Setting up NFS service”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="> Initialized service controls via DBus{type:dbus,svc:nfs-server.service,id:0xc4201d7180}"
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Checking mountpoints for following shared directories: [/var/lib/kubelet /var/lib/osd]”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="Found following mountpoints for shared dirs: map[/var/lib/kubelet:{isMP=f,Opts=shared:3,Pare>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="PX-RunC arguments: -b -c px-cluster-d547b53f-7253-4336-b058-41e389fdda2b -kvdb_dev type=gp2,>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg="PX-RunC mounts: /dev:/dev /etc/exports:/etc/exports /opt/pwx/oci/mounts/etc/hosts:/etc/hosts>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“PX-RunC env: ALERTMANAGER_PORTWORX_PORT=tcp://172.30.75.232:9093 ALERTMANAGER_PORTWORX_PORT_>
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Switched ‘/var/opt/pwx/oci’ to PRIVATE mount propagation”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Found 2 usable runc binaries: /var/opt/pwx/bin/runc and /var/opt/pwx/bin/runc-fb”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Detected kernel release 4.18.0-193”
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:50Z” level=info msg=“Exec: [”/var/opt/pwx/bin/runc” “run” “-b” “/var/opt/pwx/oci” “–no-new-keyring” ">
Oct 21 13:25:50 ip-10-0-132-7 portworx[605595]: Executing with arguments: -b -c px-cluster-d547b53f-7253-4336-b058-41e389fdda2b -kvdb_dev type=gp2,size=150 -s type=gp2,size=1000 -secre>
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Wed Oct 21 13:25:51 UTC 2020 : Running version 2.5.7.0-686a134 on Linux ip-10-0-132-7 4.18.0-193.24.1.el8_2.dt1.x86_64 #1 SMP Thu Sep 24>
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Version: Linux version 4.18.0-193.24.1.el8_2.dt1.x86_64 (mockbuild@x86-vm-08.build.eng.bos.redhat.com) (gcc version 8.3.1 20191121 (Red >
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: time=“2020-10-21T13:25:51Z” level=error msg="Cannot listen on UNIX socket: listen unix /run/docker/plugins/pxd.sock: bind: no such file >
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: Error: failed to listen on pxd.sock
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /hostusr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /hostusr/src/linux-headers-4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: checking /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: found /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64
Oct 21 13:25:51 ip-10-0-132-7 portworx[605595]: ln: failed to create symbolic link ‘compiler-gcc5.h’: Read-only file system
Oct 21 13:25:53 ip-10-0-132-7 portworx[605595]: Using /usr/src/kernels/4.18.0-193.24.1.el8_2.dt1.x86_64 with defines CC=gcc-8
Oct 21 13:25:53 ip-10-0-132-7 portworx[605595]: Creating px fs…
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using cluster: px-cluster-d547b53f-7253-4336-b058-41e389fdda2b
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using kvdb device: type=gp2,size=150
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Warning skipping device exists check for: type=gp2,size=150.
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using storage device: type=gp2,size=1000
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Using scheduler: kubernetes
Oct 21 13:25:56 ip-10-0-132-7 portworx[605595]: Warning skipping device validation check for: type=gp2,size=1000.
Oct 21 13:25:57 ip-10-0-132-7 portworx[605595]: checking remote download site, please wait…
Oct 21 13:25:57 ip-10-0-132-7 portworx[605595]: Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting…
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Main process exited, code=exited, status=10/n/a
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Failed with result ‘exit-code’.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.service: Consumed 6.432s CPU time
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Stopping Portworx FIFO logging reader…
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Stopped Portworx FIFO logging reader.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx-output.service: Consumed 33ms CPU time
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: Closed Portworx logging FIFO.
Oct 21 13:25:57 ip-10-0-132-7 systemd[1]: portworx.socket: Consumed 0 CPU time

Can you attach the complete log file ? Just don’t share the content, I will need to go through complete log to understand it.

Hi Sanjay,

I don’t see an option to attach a text file here, tried with the upload button but it says: “Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif, heic, heif).”

You can use : https://gist.github.com/ or Pastebin or save the file in above extension and upload I will rename and check it.

Satya,

Can you install the latest version 2.6 ? looks like it is failing to detect the latest kernel version with 2.5.7. Can you give a try with latest release of 2.6.1.3 which has support for the latest kernel.

Sure, I will try with 2.6 version and will update you with the results.

It is still failing for 2.6 as well. Is it some port issue?

No its not port issue, its kernel dependency issue,
Failed to find patch fs dependency on remote site for kernel 4.18.0-193.24.1.el8_2.dt1.x86_64, exiting..

Can you tell what is the operator version you have installed ?

it’s Portworx 1.4.0.

Any update on this issue.

Can you do the following :

And let me know how it goes, if anything fails please share the logs again.

It is still failing after i copied px.ko to all three worker nodes and restarted the portworx service.

[core@ip-10-0-132-7 ~] ls -ltr /var/lib/osd/pxfs/latest/8.px.ko -rw-r--r--. 1 root root 2391816 Oct 22 08:17 /var/lib/osd/pxfs/latest/8.px.ko [core@ip-10-0-132-7 ~] date
Thu Oct 22 08:45:44 UTC 2020

attaching logs for the same, it contains yesterday’s logs as well.

Can you restart the Portworx service again and check ? some changes have been made and updated the repo.

Now px volumes are getting created but it is not stable, it de-attaches instantly. Attaching all the logs.

PX_POD=(oc get pods -l name=portworx -n kube-system -o jsonpath=’{.items[0].metadata.name}’)
$ oc exec $PX_POD -n kube-system – /opt/pwx/bin/pxctl status
Defaulting container name to portworx.
Use ‘oc describe pod/portworx-7pprp -n kube-system’ to see all of the containers in this pod.
PX is not running on this host: Could not reach ‘HealthMonitor’

List of last known failures:

Type ID Resource Severity Count LastSeen FirstSeen Description
NODE NodeStartFailure 7d9de820-b8b6-4171-a94a-6f2d812e9029 ALARM 1 Oct 23 09:10:56 UTC 2020 Oct 23 09:10:56 UTC 2020 Failed to start Portworx: failed in internal kvdb setup: failed to create a kvdb connection to peer internal kvdb nodes dial tcp 10.0.135.69:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.
NODE KvdbConnectionFailed 7d9de820-b8b6-4171-a94a-6f2d812e9029 ALARM 1 Oct 23 09:10:50 UTC 2020 Oct 23 09:10:50 UTC 2020 Internal Kvdb: failed to create a kvdb connection to peer internal kvdb nodes dial tcp 10.0.135.69:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.
NODE PortworxMonitorInstallFailed ip-10-0-151-22 ALARM 1 Oct 23 08:59:40 UTC 2020 Oct 23 08:59:40 UTC 2020 Could not finalize OCI install: Timeout
NODE PortworxMonitorSchedulerInitializationFailed ip-10-0-151-22 ALARM 1 Oct 23 08:54:35 UTC 2020 Oct 23 08:54:35 UTC 2020 Could not init scheduler ‘kubernetes’: Could not find my node in Kubernetes cluster: Get https://172.30.0.1:443/api/v1/nodes: dial tcp 172.30.0.1:443: connect: no route to host
command terminated with exit code 1

Satya,

I see that following error : "Failed to attach DriveSet (3b897f20-9729-4dc0-a633-0d00709130f8): Availability zone misma>

Can you share the output of oc get nodes --show-labels and your spec file used for install

Updated the link with the output at the end:

#Portworx Operator will install pods only on nodes that have the label node-role.kubernetes.io/compute=true

I am labeling the nodes using a shell script:
WORKER_NODES=oc get nodes | grep worker | awk '{print $1}'
for wnode in ${WORKER_NODES[@]}; do
oc label nodes $wnode node-role.kubernetes.io/compute=true
done

Following ports are open for master and worker security groups:
aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $WORKER_GROUP_ID

Hi,
Any update on this issue.