Failed in internal kvdb setup: Kvdb took too long to start

tschirmer · April 14, 2020, 8:29am

Hey Guys, running this on EKS 1.14.9 with Portworx essentials 2.5 (With Calico Installed), Anyone seen this before?

`kubectl exec -it portworx-***** -n kube-system /opt/pwx/bin/pxctl status
 PX is not running on this host

 List of last known failures:

Type    ID                      Resource                                Severity        Age     Description
NODE    NodeStartFailure        ceaf76cc-bc5d-4ee0-8050-379a49da0a88    ALARM           12s     Failed to start Portworx: failed in internal kvdb setup: Kvdb took too long to start
NODE    InternalKvdbSetupFailed ceaf76cc-bc5d-4ee0-8050-379a49da0a88    ALARM           12s     Failed to reinitialize internal kvdb: Kvdb took too long to start

`

So far, I’ve added the following tags to my nodes:
px/enabled=true
px/metadata-node=true

Anyone experienced this before? When I installed it, it installed perfectly, but on 1 round of autoscales (we’re using spot instances, so all of the nodes got replaced), and now portworx is causing the error above, and not fulfilling pvcs.

adityadani · April 14, 2020, 6:41pm

The error Kvdb took too long to start usually indicates a problem with starting up the key-value datastore that Portworx uses to store its metadata.

Are you letting Portworx provision the EBS volumes as its backing disks? What are the specs of those EBS volumes?

If you could post the output of the input configuration to Portworx that can provide more info:

kubectl exec -it <portworx-pod> -n kube-system -- cat /etc/pwx/config.json

tschirmer · April 15, 2020, 1:42am

This is running ec2 spot instances, but has been able to create the volumes in the past, I installed calico, and added some tags to the ec2 auto scale group (which caused all running instances to be terminated), and then I got this error.

I ended up deleting the entire spec, re-issuing a new spec and now, I’m getting a licensing error.

adityadani · April 15, 2020, 3:07am

You will need to unlink your old PX-Essentials cluster in that case.
Follow this post - PX-Essentials license is expired

tschirmer · April 16, 2020, 12:54am

Ok, so I reinstalled, and it was working as expected. However, we’re running on ec2 spot instances, and they all cycled overnight (all the services in kubernetes remained up that weren’t running on Portworx drives).

But now we’re getting the following:
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb IP: 10.0.3.101 Kvdb PeerPort: 9018 ClientPort: 9019" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Name: 26b197b0-1846-43a6-996d-2f41d085e254" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Cluster State: existing" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:44:49Z" level=info msg="Kvdb Peer Domain Name: portworx-2.internal.kvdb" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16 00:44:50,045 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs) time="2020-04-16T00:44:51Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:11Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16 00:45:18,361 INFO reaped unknown pid 1466 time="2020-04-16T00:45:21Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:31Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:45:41Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:45:50: PX-Watchdog: (pid 1534): PX REST server died or did not started. return code 7. Timeout 480 time="2020-04-16T00:45:51Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:01Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:11Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused"
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:46:21: PX-Watchdog: Waiting for px process to start
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: 2020-04-16_ 0:46:21: PX-Watchdog: (pid 1534): Begin monitoring time="2020-04-16T00:46:21Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:31Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" time="2020-04-16T00:46:41Z" level=warning msg="Could not retrieve PX node status" error="Get http://127.0.0.1:9001/v1/cluster/nodehealth: dial tcp 127.0.0.1:9001: connect: connection refused" @ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="Failed to start internal kvdb." err="Kvdb took too long to start" fn=kv-utils.StartKvdb id=26b197b0-1846-43a6-996d-2f41d085e254
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="failed in internal kvdb setup: Kvdb took too long to start" func=InitAndBoot package=boot
@ip-**-**-**-**.**-***********.compute.internal portworx[20061]: time="2020-04-16T00:46:49Z" level=error msg="Could not init boot manager" error="failed in internal kvdb setup: Kvdb took too long to start"

tschirmer · April 16, 2020, 1:00am

Another Update on this. It seems that if the volume drives start to get attached before portworx pod is fully initialised we get the failure to attach the volume to the pod.

Portworx Pods eventually came up and passed healthchecks, and then we needed to delete the pods within the stateful set that were attempting to attach the PVC to the pod. Once the pods were deleted, they got re-created, and they all went through as expected.

Do we need to add an initalisation pod in each stateful set / deployment that checks that portworx is live on that node before starting?

adityadani · April 21, 2020, 5:58am

In a 3 node cluster, Portworx will require quorum no. of nodes (at least 2) nodes to be up and running before it can process any PVC request. So you can expect a StatefulSet using Portworx PVC to get Attach/Mount errors. However kubernetes will retry these requests and eventually succeed.

You can add a dependency on your statefulset using an init container and check if Portworx is up and running.

tschirmer · April 21, 2020, 6:58am

I understand, but Portworx is still failing the initialisation of kvdb without any notice as to why.

I found a reference architecture here

under Approach B: Hyperconverged Storage and Compute clusters
“Scaling up and scaling down of the cluster is not that frequent”

How I read that is; Portworx is slow to adapt to changes where there’s a disk attached.

So running something like aws ec2 spot instances where you’re spinning nodes up and down within 5 minutes at a time can’t be used with portworx essentials (you’d need to run storageless nodes, which is a enterprise only feature).

To get around this we’re having to use ondemand ec2 instances for portworx, and share the drives, but we’ve had similar issues spinning those up.

adityadani · April 29, 2020, 4:22pm

@tschirmer You should be able to run storageless nodes with PX-Essentials as well. You can do that by setting the following field while generating the spec on central.portworx.com on Cloud/AWS page (Step 2)

Max storage nodes per availibility zone

If you set this field to 1 and lets say your ec2 instances are divided across 3 zones, you will have 3 storage nodes in your cluster and rest of the nodes will be storage less.

Having said that, I would like to know why is PX not coming up after you teardown and start back the cluster. Can you provide the following details:

Can you share your Portworx DaemonSet spec?. Would like to know what are the input arguments given to the container.
What is the cluster size ?
Are the ec2 instances getting terminated at the same time or they get terminated in a staggered fashion?
When you say “on-demand ec2-instances for portworx and share the drives”, what do you mean by sharing the drives ?

Topic		Replies	Views
Portworx installation fails with 'Failed to start Portworx: failed to setup internal kvdb Portworx Install install , kvdb	3	1077	February 22, 2022
Portworx Failing to start, kvdb failed in initializing drives Portworx on Kubernetes	2	580	August 24, 2023
Unable to install portworx 3.2 Portworx on Kubernetes	0	110	November 18, 2024
OnPrem Internal KVDB installation problem, connection refused on node's port 9019 Portworx Install kvdb	12	4086	June 10, 2020
Unhealthy portworx container Portworx Install	9	1853	May 26, 2021

Failed in internal kvdb setup: Kvdb took too long to start

Related topics