Portworx essential 2.5 failing to install on OCP 4.5 on AWS

sanjay.naikwadi · October 27, 2020, 6:03am

Can you share your StorageCluster Spec ?

Satya · October 27, 2020, 6:16am

I am using this px-spec file to create it: https://gist.github.com/satyamodi/e8b4e41a2a2133494b211d78a74f41cf

sanjay.naikwadi · October 27, 2020, 6:42am

So your not installing from Operator Hub ?

Satya · October 27, 2020, 6:44am

I am using a terraform script to install the cluster and portworx. I am installing operator using below yaml file from cli:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: kube-system-operatorgroup
namespace: kube-system
spec:
serviceAccount:
metadata:
creationTimestamp: null
targetNamespaces:

kube-system

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
generation: 1
name: portworx-certified
namespace: kube-system
spec:
channel: stable
installPlanApproval: Automatic
name: portworx-certified
source: certified-operators
sourceNamespace: openshift-marketplace
startingCSV: portworx-operator.v1.4.0

sanjay.naikwadi · October 27, 2020, 7:08am

Your mixing the installs, you can either use Operator or Spec (DaemonSet) method.

Can you uninstall and try again :

Uninstall curl -fsL https://install.portworx.com/px-wipe | bash
Uninstall the Portworx Operator.

Now once everything is cleaned up just apply your spec file on installation.

Satya · October 27, 2020, 7:16am

Hi Sanjay,

it’s not about mixing the install, it doesn’t work on OCP 4.5 cluster. We have released terraform and AWS Quickstart with OCP 4.3 with Portworx earlier. Even if i am doing everything manually, the portworx is not coming up on an 4.5 cluster. We have this issue with AWS Quickstart as well, portworx work with ocp 4.3 but not with 4.5.

Can you pls try it from your side the installation on an 4.5 cluster.

sanjay.naikwadi · October 27, 2020, 7:24am

Can you find the entry in log file starting with “Availability zone misma …” Can you send me this line completely, this looks that Volume is created in different zone and instance is trying to claim that which might be in different zone

Satya · October 27, 2020, 8:38am

i have copied the lines hidden from the screen: https://gist.github.com/satyamodi/1421e456f68e008f60f54b7b29354522

sanjay.naikwadi · October 27, 2020, 9:06am

This is the Error :
Availability zone mismatch for volume vol-0e65d466e1463c2db: (Volume Zone: ap-south-1b Instance Zone: ap-south-1a)"

Can you check in which Zone your instance is launched ? Does your worker nodes distributions are properly done in A/B/C from the above instance running is running in Zone A and Volume is Created in Zone B.

It looks if the Zone A/B/C distribution is proper then there is high chance of previous uninstall would not be cleaned up properly whiich would have left the Orphan Volumes. If we consider this case then you can simply uninstall the portworx, and delete the Volumes from AWS console.

And now install portworx again. Let me know how it goes. or I will open a remote session just to look at this.

Satya · October 27, 2020, 9:12am

I have created a new cluster. Nodes and volumes are distributed properly in all three zones. Attaching the screenshots.

Satya · October 27, 2020, 9:12am

sanjay.naikwadi · October 27, 2020, 9:18am

Still the install is failing on new Cluster ?

Satya · October 27, 2020, 9:22am

yes, so every time i am creating a new cluster. Earlier volumes were not coming up but after you made some changes to your repo, it’s coming but de-attaching instantly. We had similar issue of de-attaching volumes with ocp 4.3 long time back but after giving access to the below ports on your team’s suggestion it was resolved.

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 17001-17020 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 111 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 2049 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 20048 --source-group $WORKER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $MASTER_GROUP_ID

aws ec2 authorize-security-group-ingress --group-id $WORKER_GROUP_ID --protocol tcp --port 9001-9022 --source-group $WORKER_GROUP_ID

sanjay.naikwadi · October 27, 2020, 9:29am

Earlier problem was with Kernel version detection, it was failing there and not at Volume attaching/detaching. Do you have some time in next 30-40 mins to get on call and i will take a look.

Satya · October 27, 2020, 9:30am

Sure we can connect.

Sandeep_Kumbhar · October 27, 2020, 10:22am

Can you join https://us02web.zoom.us/j/85195402362 ?

Satya · October 27, 2020, 10:27am

I joined the zoom call.

Sandeep_Kumbhar · October 27, 2020, 10:29am

Try again to join the same link

markmoloney · November 5, 2020, 12:39pm

What was the resolution for this please? I’m having exact same issue.

aleks · November 19, 2020, 1:29pm

This is due to RHCOS updating kernel much more frequently than portworx version, as cloudpak4data standardizes on a fixed (usually trailing/older) version of Portworx (and a given PX version only has support for kernels available at the time of its release).

We have improved handling in a newer version of Portworx (2.6.1.5), but for now we have been using a work-around script to get past this issue.

Here are the contents (name this : el8-fslib-fix.sh then chmod+x it and execute on the problem node)

#!/bin/bash

CURKERNEL=$(uname -r)
ARCH=$(uname -m)

EL8_FSLIBS_NM=btrfs.all.el8.kos
EL8_FSLIBS_ARCHIVE=${EL8_FSLIBS_NM}.xz
EL8_FSLIBS_MANIFEST=${EL8_FSLIBS_NM}.manifest

FSLIB_DIR=/opt/pwx/oci/rootfs/pxlib_data/px-fslibs
[ ! -d ${FSLIB_DIR} ] && echo "Unable to find 'px-fslibs' directory" && exit 1
[ ! -e ${FSLIB_DIR}/${EL8_FSLIBS_ARCHIVE} ] && echo "Unable to find 'px-fslibs' archive" && exit 1

cd ${FSLIB_DIR}

if [ -e ${EL8_FSLIBS_MANIFEST} ]; then
    egrep -q "${ARCH}/${CURKERNEL}/" ${EL8_FSLIBS_MANIFEST}
    [ $? -eq 0 ] && echo "No modification needed 'px-fslibs' exist for this kernel." && exit 0
fi

# Need to update archive, 
mkdir -p Unpack && cd Unpack
cp ../${EL8_FSLIBS_NM}.* .
tar xJf ./${EL8_FSLIBS_ARCHIVE}
cd x86_64/
KVER=$(echo ${CURKERNEL} | awk -F- '{print $1}')
PATCH=$(echo ${CURKERNEL} | awk -F- '{print $2}'| awk -F. '{print $1}')
KERN=${KVER}-${PATCH}

NEXT_FSLIB=$(egrep ${KERN}.[0-9] ../${EL8_FSLIBS_MANIFEST} | awk -F'/' '{print $2}' | sort -r | head -n 1)
[ -z "${NEXT_FSLIB}" ] && echo "Unable to find next available 'px-fslibs' for this kernel." && exit 1
cp -a ${NEXT_FSLIB} ${CURKERNEL}
cd ../
XZ_OPTS=-9 tar -Jhcf btrfs.new.xz x86_64
[ $? -ne 0 ] && echo "Failed to create updated 'px-fslibs' archive." && exit 1 
tar -tJf ./btrfs.new.xz | egrep btrfs.ko$ > btrfs.new.manifest
[ $? -ne 0 ] && echo "Failed to create updated 'px-fslibs' manifest." && exit 1 
cp ./btrfs.new.manifest ../${EL8_FSLIBS_MANIFEST}
cp ./btrfs.new.xz ../${EL8_FSLIBS_ARCHIVE}
cd
echo "Done updating 'px-fslibs' archive.."   


#systemctl start portworx; journalctl -lfu portworx\*
# copy new files to other nodes
# /opt/pwx/oci/rootfs/pxlib_data/px-fslibs/btrfs.all.el8.kos.manifest
# /opt/pwx/oci/rootfs/pxlib_data/px-fslibs/btrfs.all.el8.kos.xz

This would need to be run on one node, portworx restarted via systemctl (as mentioned), then for any other node, also copy the files referenced at the end to the same location and restart as well

Topic		Replies	Views
Portworx Essentials/Enterprise on OKD 4.x/Fedora CoreOS Enhancement Requests	14	3139	November 6, 2020
Portworx on Openshift 4.5 Portworx on Kubernetes install	10	1234	November 20, 2020
Portworx does not initialise on OCP version 4.8.24 Portworx Install install	7	764	January 20, 2022
Failed to load PX filesystem dependencies for kernel 4.18.0-259.el8.x86_64 Portworx on Kubernetes	9	1334	January 22, 2021
Portworx unaware of FCOS constraints in OKD Portworx Install	9	1115	December 15, 2020

Portworx essential 2.5 failing to install on OCP 4.5 on AWS

Related topics