Pod failed to mount a Portworx volume - connection refused error

I am deploying Portworx on Kubernetes environments using the spec yaml generated on install.portworx.com. I can create a PVC using a Portworx Storage Class but when I am trying to mount the volume it fails on some environments.
The logs show its trying to connect to an IP which is wrong but I am not sure where it is getting it from.
For example:

MountVolume.SetUp failed for volume "pvc-2cb7697b-4bdf-11e9-a7f6-000c29e98f02" : Get http://node1:9001/v1/osd-volumes/versions: dial tcp 10.0.0.10:9001: connect: connection refused

This usually happens when your DNS is not properly configured to resolve the node name.
In the example above node1 is resolving to IP 10.0.0.10 which is likely not the correct one in this case.
To resolve this problem you can fix DNS resolution for your Kubernetes cluster to resolve node1 to the correct IP, or as quicker fix you can update /etc/hosts in all nodes with correct hostnames and IPs.

This can be a range of things too:

In my case when you’re look at the node in question, the portworx pod would return:

time="2020-04-20T02:10:39Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"

@node2[8381]: time="2020-04-20T02:10:43Z" level=error msg="Failed to start internal kvdb." err="Kvdb took too long to start" fn=kv-utils.StartKvdb id=************-****-****-****-************

@node2[8381]: time="2020-04-20T02:10:43Z" level=error msg="failed in internal kvdb setup: Kvdb took too long to start" func=InitAndBoot package=boot

 Failed to reinitialize internal kvdb: Kvdb took too long to start

To get the error specifically:

kubectl exec -it portworx-tsa22  -n kube-system /opt/pwx/bin/pxctl status
PX is not running on this host

List of last known failures:

Type    ID                      Resource                                Severity        Age     Description

NODE    NodeStartFailure        ******-****-****-****-************    ALARM           56s     Failed to start Portworx: failed in internal kvdb setup: Kvdb took too long to start
NODE    InternalKvdbSetupFailed ******-****-****-****-************   ALARM           56s     Failed to reinitialize internal kvdb: Kvdb took too long to start

I’ve contacted Portworx for support, but they haven’t helped me out on this at this point. Anyone got any ideas?

I’ve already added tags to the node and ensured that the kvdb can

@tschirmer this seems to be a different issue. The problem I posted above is when a pod fails to mount/attach a Portworx volume, in your case it seems Portworx is not even running.

What version of Portworx you are using? Could you please post the portworx pod logs here so I try to better understand what is happening?

I ended up deleting portworx and reinstalling for a 3rd time. I was wanting to run it on EC2 spot instances (where we regularly cycle nodes), and expected that portworx would connect to the existing underlying data.

This doesn’t seem to be the case, and we needed to provision this ONLY on our on demand ec2 instances. It’s a bit of a shame really, becasue it locks us out of using portworx where we really need it.

Right now, we’re running a lot of HA services on EC2 on demand as a result instead of the much cheaper ec2 spot. We’ve found an alternative to portworx and will be testing that over the next few months, but the only way we solved this was with a re-installation.