Failed in internal kvdb setup: Kvdb took too long to start

Hey Guys, running this on EKS 1.14.9 with Portworx essentials 2.5 (With Calico Installed), Anyone seen this before?

`kubectl exec -it portworx-***** -n kube-system /opt/pwx/bin/pxctl status
 PX is not running on this host

 List of last known failures:

Type    ID                      Resource                                Severity        Age     Description
NODE    NodeStartFailure        ceaf76cc-bc5d-4ee0-8050-379a49da0a88    ALARM           12s     Failed to start Portworx: failed in internal kvdb setup: Kvdb took too long to start
NODE    InternalKvdbSetupFailed ceaf76cc-bc5d-4ee0-8050-379a49da0a88    ALARM           12s     Failed to reinitialize internal kvdb: Kvdb took too long to start

`

So far, I’ve added the following tags to my nodes:
px/enabled=true
px/metadata-node=true

Anyone experienced this before? When I installed it, it installed perfectly, but on 1 round of autoscales (we’re using spot instances, so all of the nodes got replaced), and now portworx is causing the error above, and not fulfilling pvcs.

The error Kvdb took too long to start usually indicates a problem with starting up the key-value datastore that Portworx uses to store its metadata.

Are you letting Portworx provision the EBS volumes as its backing disks? What are the specs of those EBS volumes?

If you could post the output of the input configuration to Portworx that can provide more info:

kubectl exec -it <portworx-pod> -n kube-system -- cat /etc/pwx/config.json

This is running ec2 spot instances, but has been able to create the volumes in the past, I installed calico, and added some tags to the ec2 auto scale group (which caused all running instances to be terminated), and then I got this error.

I ended up deleting the entire spec, re-issuing a new spec and now, I’m getting a licensing error.

You will need to unlink your old PX-Essentials cluster in that case.
Follow this post - PX-Essentials license is expired

Ok, so I reinstalled, and it was working as expected. However, we’re running on ec2 spot instances, and they all cycled overnight (all the services in kubernetes remained up that weren’t running on Portworx drives).

But now we’re getting the following:
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb IP: 10.0.3.101 Kvdb PeerPort: 9018 ClientPort: 9019" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Name: 26b197b0-1846-43a6-996d-2f41d085e254" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Cluster State: existing" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Peer Domain Name: portworx-2.internal.kvdb" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16 00:44:50,045 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs) time="2020-04-16T00:44:51Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:11Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16 00:45:18,361 INFO reaped unknown pid 1466 time="2020-04-16T00:45:21Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:31Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:41Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:45:50: PX-Watchdog: (pid 1534): PX REST server died or did not started. return code 7. Timeout 480 time="2020-04-16T00:45:51Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:11Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:46:21: PX-Watchdog: Waiting for px process to start
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:46:21: PX-Watchdog: (pid 1534): Begin monitoring time="2020-04-16T00:46:21Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:31Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:41Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" @ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="Failed to start internal kvdb." err="Kvdb took too long to start" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="failed in internal kvdb setup: Kvdb took too long to start" func=InitAndBoot package=boot
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="Could not init boot manager" error="failed in internal kvdb setup: Kvdb took too long to start"

Another Update on this. It seems that if the volume drives start to get attached before portworx pod is fully initialised we get the failure to attach the volume to the pod.

Portworx Pods eventually came up and passed healthchecks, and then we needed to delete the pods within the stateful set that were attempting to attach the PVC to the pod. Once the pods were deleted, they got re-created, and they all went through as expected.

Do we need to add an initalisation pod in each stateful set / deployment that checks that portworx is live on that node before starting?

In a 3 node cluster, Portworx will require quorum no. of nodes (at least 2) nodes to be up and running before it can process any PVC request. So you can expect a StatefulSet using Portworx PVC to get Attach/Mount errors. However kubernetes will retry these requests and eventually succeed.

You can add a dependency on your statefulset using an init container and check if Portworx is up and running.

I understand, but Portworx is still failing the initialisation of kvdb without any notice as to why.

I found a reference architecture here

under Approach B: Hyperconverged Storage and Compute clusters
“Scaling up and scaling down of the cluster is not that frequent”

How I read that is; Portworx is slow to adapt to changes where there’s a disk attached.

So running something like aws ec2 spot instances where you’re spinning nodes up and down within 5 minutes at a time can’t be used with portworx essentials (you’d need to run storageless nodes, which is a enterprise only feature).

To get around this we’re having to use ondemand ec2 instances for portworx, and share the drives, but we’ve had similar issues spinning those up.

@tschirmer You should be able to run storageless nodes with PX-Essentials as well. You can do that by setting the following field while generating the spec on central.portworx.com on Cloud/AWS page (Step 2)

Max storage nodes per availibility zone 

If you set this field to 1 and lets say your ec2 instances are divided across 3 zones, you will have 3 storage nodes in your cluster and rest of the nodes will be storage less.

Having said that, I would like to know why is PX not coming up after you teardown and start back the cluster. Can you provide the following details:

  1. Can you share your Portworx DaemonSet spec?. Would like to know what are the input arguments given to the container.
  2. What is the cluster size ?
  3. Are the ec2 instances getting terminated at the same time or they get terminated in a staggered fashion?
  4. When you say “on-demand ec2-instances for portworx and share the drives”, what do you mean by sharing the drives ?