OnPrem Internal KVDB installation problem, connection refused on node's port 9019

Hi,
I tried installing Portworx Essential OnPrem without internal kvdb which was successfully deployed. As it’s not recommended I tried installing with internal kvdb specifying \seperate device(/dev/sdb). However it couldn’t start service responsible for kvdb. All ports(9001-9022) are open. What could prevent starting internal kvdb?

Can we get more details about the failure?
Could you post the output of the following commands?

/opt/pwx/bin/pxctl status
/opt/pwx/bin/pxctl alerts show

Just to clarify, you first installed Portworx without internal kvdb? Did you provide an external etcd endpoint to Portworx in that case?

Thank you for your reply, Here are the output:
Status

  1. Failed to start Portworx: failed in internal kvdb setup: failed to create a kvdb connection to peer internal kvdb nodes [[http–10.1.11.29:9019]]: dial tcp 10.1.11.29:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.
  2. Internal Kvdb: failed to create a kvdb connection to peer internal kvdb nodes [[http–10.1.11.29:9019]]: dial tcp 10.1.11.29:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.

Alerts
PX is not running on this host

By saying internal kvdb I mean I skipped specifying seperate device(/dev/sdb). So it shared the same volume as Portworx volume IO. Which is not recommended.

Extra output from px-log-tail:
PXPROCS: Started watchdog with pid 17526
PX-Watchdog: Waiting for px process to start
PX-Watchdog: (pid 17429): Begin monitoring
level=info msg=“Registered the Usage based Metering Agent…”
level=info msg=“Registering [kernel] as a volume driver”
level=info msg=“Setting log level to info(4)”
level=info msg=“read config from env var” func=init package=boot
level=info msg=“read config from config.json” func=init package=boot
level=info msg=“Alerts initialized successfully for this cluster”
level=error msg=“Could not init boot manager” error=“failed in internal kvdb setup: failed to create a kvdb connection to peer internal kvdb nodes [[http://10.1.11.29:9019]]: dial tcp 10.1.11.29:9019: connect: connection refused. Make sure peer kvdb nodes are healthy.”
PXPROCS: px daemon exited with code: 1
INFO exited: pxdaemon (exit status 1; not expected)
INFO spawned: ‘pxdaemon’ with pid 17475
INFO reaped unknown pid 17418
PXPROCS: Started px-storage with pid 17506
bash: connect: Connection refused
bash: /dev/tcp/localhost/9009: Connection refused
PXPROCS: px-storage not started yet…sleeping

Hyper-V VMs, MacSpoof On, Ubuntu Server 18.04, Kubernetes deployed by Kubespray.

Between the two installations did you wipe the Portworx cluster?

For configuration change like adding a separate device for internal kvdb you will need to do the following:

  1. Uninstall Portworx. For uninstalling follow this doc. Make sure you unlink your old PX-Essentials cluster .
  2. Re-install Portworx cluster with the correct spec.

Yes I’ve faced that problem and already solved. The problem is there is no log output that it has successfully deployed internal kvdb on specified device(/dev/sdb). All three nodes’ output are the same.
The spec: https://install.portworx.com/2.5?mc=false&kbver=1.18.3&oem=esse&user=1085a1d0-a328-11ea-97e6-f6e09c7a4e5e&b=true&s=%2Fdev%2Fsdc&j=auto&kd=%2Fdev%2Fsdb&c=px-cluster-02b69ee3-d853-4230-b41c-4fb82d2de69e&stork=true&lh=true&st=k8s

You are right. We have improved the CLI to display the current used kvdb drive in an upcoming PX release. You will be able to see the device being used as a part of

/opt/pwx/bin/pxctl status

Until then, to ensure that internal kvdb is using your device you can run the following command

kubectl exec -it <portworx-pod> -- blkid | grep kvdb

The device you provided should get listed in the above command’s output. Portworx is going to fingerprint the drives you provided and in this case you will see a label called kvdbvol on the drive /dev/sdc.

Unfortunately there was no output. blkid output is:
/dev/sdb: PTUUID=“dc94207e-8466-9047-8049-17b8b2ec8d4e” PTTYPE=“gpt”
not assigned as you can see. May be there is manual way, to solve this problem.

Do you have ssh access to this node? Can you post the complete blkid output here?

Here you go:
/dev/loop0: TYPE=“squashfs”
/dev/loop1: TYPE=“squashfs”
/dev/loop2: TYPE=“squashfs”
/dev/sda1: UUID=“280C-F93F” TYPE=“vfat” PARTUUID=“42e325da-da6b-4074-baf9-b0bffffba21a”
/dev/sda2: UUID=“48a82b21-79cc-461e-819d-c6a25ded2733” TYPE=“ext4” PARTUUID=“9e9b5dcc-a792-4049-92d1-9468380428d5”
/dev/sda3: UUID=“24563c14-cc8d-4f3d-8d07-fac27929d74d” TYPE=“ext4” PARTUUID=“d0752743-fe76-3649-b431-a5f3ad02f6e2”
/dev/sdc: LABEL=“mdpoolid=0,pxpool=0,mdvol” UUID=“31d1aeae-85ff-47f9-82b4-39a2d21aafa0” UUID_SUB=“8629adf2-9ad3-4840-b37d-23ea846ef18a” TYPE=“btrfs”
/dev/sda4: PARTUUID=“90a443c6-4d8e-6142-9468-62449e40caf7”
/dev/sdb: PTUUID=“dc94207e-8466-9047-8049-17b8b2ec8d4e” PTTYPE=“gpt”

The above output indicates that this node is not initialized completely. Did you go through the Uninstall process? Once you run the uninstall process you should not see any labels. Following command will cleanup your Portworx installation

curl -fsL https://install.portworx.com/px-wipe | bash

The following labels should get removed after uninstall

/dev/sdc: LABEL=“mdpoolid=0,pxpool=0,mdvol” UUID=“31d1aeae-85ff-47f9-82b4-39a2d21aafa0” UUID_SUB=“8629adf2-9ad3-4840-b37d-23ea846ef18a” TYPE=“btrfs”

I did, wipe the cluster between reinstallation. /dev/sdc/ is for Portworx Volume IO.

-s /dev/sdc

in the spec file.

If the cluster is still in the same state after wipe and re-install, can you provide the logs from this node 10.1.11.29 ? You can get the logs by running the following command on the node

journalctl -lau portworx-output > px.log

I’ve wiped px-cluster again. And this time it was scolding that partitions was there from previous install. I formatted /dev/sdb and /dev/sdc. Again no effect. so I did:

sudo dd if=/dev/zero of=/dev/sdb bs=1M

sudo dd if=/dev/zero of=/dev/sdc bs=1M

after that I wiped px-cluster again. I’ve generated spec with no auto journaling on devices. Finally after deploy it get up works like a charm. So the problem was px-wipe.sh script wasn’t enough. Extra wipe of disk was needed. Thank you for your help.