Recover Data from a failed cluster

Hello all. I have a PX Essentials storage cluster that I use for training and a home lab. It was running as part of a Kubernetes cluster and all was well with the world. One day the Kubernetes ETCD instance lost quorum and I am not easily able to get it back up. The easiest path with that flavor of k8s is to just create a new cluster and deploy all of my manifests again. Easy peasy but I am still missing the data from Portworx.

I still have full access to the nodes and all disks and drives are fully accessible.

So here is my question, is there a way to access the volumes that were provisioned by PX to extract the stored data to be able to re-provision? All of my searches end up with what to do with pxctl to extract a snapshot or mount a volume, but I keep getting a Portworx is starting error.

Can I a) mount a volume without Portworx or b) disconnect the Portworx process from the k8s instance so it can start on it’s own?

Examples for reference:
node3 / # opt/pwx/bin/pxctl status
PX stopped working 2.6s ago. Last status: Could not init boot manager (error=“Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out”)

also:

node3 / # opt/pwx/bin/pxctl status
PX is starting (pid3540678), please see alerts below

Anyone know the answer?

Can you clarify a few things?

  1. When you lost etcd qourum, did you delete the etcd cluster, kubernetes cluster and existing portworx cluster and re install?

  2. Can you provide logs from your portworx nodes.
    kubectl -n kube-system logs -l name=portworx

So far I’ve kept the systems as is. The one things that I’ve done are thus:

  • Brought down all ETCD instances, create a new single node cluster and restore the DB on it
  • (Accidentally) remove local configuration from one of the nodes for kubelet, and etcd (that node is 10.0.0.176 that’s listed in the logs below. I can shut it down if it helps and try more troubleshooting.)

There is no load balancing set up on the kube-api side of the cluster and it’s pointed to a different master.

All software is still intact, all partitions are still intact.

dgrekov$ kubectl -n kube-system logs portworx-6hkbn -c portworx
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)


dgrekov$ kubetail portworx -n kube-system
Will tail 15 logs...
....
portworx-api-bl5bf
....
Error from server: Get https://10.0.0.176:10250/containerLogs/kube-system/alertmanager-portworx-0/alertmanager?follow=true&sinceSeconds=10: dial tcp 10.0.0.176:10250: connect: connection refused
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server: Get https://10.0.0.176:10250/containerLogs/kube-system/alertmanager-portworx-0/config-reloader?follow=true&sinceSeconds=10: dial tcp 10.0.0.176:10250: connect: connection refused
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver, verb=get, resource=nodes, subresource=proxy)
Error from server: Get https://10.0.0.176:10250/containerLogs/kube-system/portworx-jl8pl/csi-node-driver-registrar?follow=true&sinceSeconds=10: dial tcp 10.0.0.176:10250: connect: connection refused
Error from server: Get https://10.0.0.176:10250/containerLogs/kube-system/portworx-api-bl5bf/portworx-api?follow=true&sinceSeconds=10: dial tcp 10.0.0.176:10250: connect: connection refused
Error from server: Get https://10.0.0.176:10250/containerLogs/kube-system/portworx-jl8pl/portworx?follow=true&sinceSeconds=10: dial tcp 10.0.0.176:10250: connect: connection refused

Would love to get your insight on what to try. Ultimately I would like to mount the PX volumes somewhere so that I can pull the data off and rebuild everything from scratch. Put better redundancy on the Kubernetes side and use the operator on the PX side.

Also, set up PX-Backup…

Is kubernetes running? It looks like your client is having issues talking with the API server.

Run: kubectl get no

Since you can’t get logs using the client. Can you please log on to one of the Portworx storage nodes and run the following and attach the logs here.

journalctl -lu portworx*

Kubectl replies with nodes and such but I’m not able to start or stop pods. What I really want to do is to blow it away and build a new cluster with better components. But before that I would like to pull data off the PX volumes. Is there any way to mount them to host without using Kubernetes?

here is what I get:

**dgrekov@node3** **/ $** sudo /opt/pwx/bin/pxctl status
PX is starting (pid1185763), please see alerts below

for kubernetes

kubectl get no
NAME                STATUS     ROLES    AGE    VERSION
node0.dimagre.com   Ready      <none>   225d   v1.18.3
node1.dimagre.com   Ready      <none>   224d   v1.18.3
node2.dimagre.com   NotReady   <none>   225d   v1.18.3
node3.dimagre.com   Ready      <none>   200d   v1.18.3
nodev.dimagre.com   Ready      <none>   218d   v1.18.3

As for the journal log entries:

**dgrekov@node3** **/ $** sudo journalctl -lu portworx* | tail -n 25
Jan 11 15:48:58 node3.dimagre.com portworx[47700]: 2021-01-11 15:48:58,437 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
Jan 11 15:48:59 node3.dimagre.com portworx[47700]: time="2021-01-11T15:48:59Z" level=info msg="Alerts initialized successfully for this cluster"
Jan 11 15:49:03 node3.dimagre.com portworx[47700]: 2021-01-11 15:49:03,758 INFO reaped unknown pid 401499
Jan 11 15:49:06 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:06Z" level=error msg="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out" func=InitAndBoot package=boot
Jan 11 15:49:06 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:06Z" level=error msg="Could not init boot manager" error="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out"
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: PXPROCS[INFO]: px daemon exited with code: 1
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: 2021-01-11 15:49:07,570 INFO exited: pxdaemon (exit status 1; not expected)
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: 2021-01-11 15:49:07,591 INFO spawned: 'pxdaemon' with pid 401845
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: 2021-01-11 15:49:07,592 INFO reaped unknown pid 401800
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: PXPROCS[INFO]: Started px-storage with pid 401884
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: bash: connect: Connection refused
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: bash: /dev/tcp/localhost/9009: Connection refused
Jan 11 15:49:07 node3.dimagre.com portworx[47700]: PXPROCS[INFO]: px-storage not started yet...sleeping
Jan 11 15:49:10 node3.dimagre.com portworx[47700]: PXPROCS[INFO]: Started px with pid 401895
Jan 11 15:49:10 node3.dimagre.com portworx[47700]: PXPROCS[INFO]: Started watchdog with pid 401896
Jan 11 15:49:10 node3.dimagre.com portworx[47700]: 2021-01-11_15:49:10: PX-Watchdog: Starting watcher
Jan 11 15:49:10 node3.dimagre.com portworx[47700]: 2021-01-11_15:49:10: PX-Watchdog: Waiting for px process to start
Jan 11 15:49:10 node3.dimagre.com portworx[47700]: 2021-01-11_15:49:10: PX-Watchdog: (pid 401895): Begin monitoring
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="Registering [kernel] as a volume driver"
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="Registered the Usage based Metering Agent...."
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="Setting log level to info(4)"
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="read config from env var" func=init package=boot
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="read config from config.json" func=init package=boot
Jan 11 15:49:11 node3.dimagre.com portworx[47700]: time="2021-01-11T15:49:11Z" level=info msg="Alerts initialized successfully for this cluster"
Jan 11 15:49:13 node3.dimagre.com portworx[47700]: 2021-01-11 15:49:13,081 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

also:

**dgrekov@node1** **~ $** sudo journalctl -lu portworx* | tail -n 25
Jan 11 15:51:13 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:13Z" level=info msg="Alerts initialized successfully for this cluster"
Jan 11 15:51:15 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:15,067 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
Jan 11 15:51:18 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:18,575 INFO reaped unknown pid 3108787
Jan 11 15:51:20 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:20Z" level=error msg="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out" func=InitAndBoot package=boot
Jan 11 15:51:20 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:20Z" level=error msg="Could not init boot manager" error="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out"
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: PXPROCS[INFO]: px daemon exited with code: 1
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:21,099 INFO exited: pxdaemon (exit status 1; not expected)
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:21,127 INFO spawned: 'pxdaemon' with pid 3109285
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:21,129 INFO reaped unknown pid 3109225
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: PXPROCS[INFO]: Started px-storage with pid 3109323
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: bash: connect: Connection refused
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: bash: /dev/tcp/localhost/9009: Connection refused
Jan 11 15:51:21 node1.dimagre.com portworx[418677]: PXPROCS[INFO]: px-storage not started yet...sleeping
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: PXPROCS[INFO]: Started px with pid 3109335
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: PXPROCS[INFO]: Started watchdog with pid 3109336
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: 2021-01-11_15:51:24: PX-Watchdog: Starting watcher
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: 2021-01-11_15:51:24: PX-Watchdog: Waiting for px process to start
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: 2021-01-11_15:51:24: PX-Watchdog: (pid 3109335): Begin monitoring
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="Registering [kernel] as a volume driver"
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="Registered the Usage based Metering Agent...."
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="Setting log level to info(4)"
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="read config from env var" func=init package=boot
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="read config from config.json" func=init package=boot
Jan 11 15:51:24 node1.dimagre.com portworx[418677]: time="2021-01-11T15:51:24Z" level=info msg="Alerts initialized successfully for this cluster"
Jan 11 15:51:26 node1.dimagre.com portworx[418677]: 2021-01-11 15:51:26,551 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

and finally:

**dgrekov@node0** **~ $** sudo journalctl -lu portworx* | tail -n 25
Jan 11 15:51:53 node0.dimagre.com portworx[800]: 2021-01-11 15:51:53,614 INFO reaped unknown pid 192821
Jan 11 15:52:03 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:03Z" level=info msg="Alerts initialized successfully for this cluster"
Jan 11 15:52:33 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:33Z" level=error msg="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: Post https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps: dial tcp 10.3.0.1:443: i/o timeout" func=InitAndBoot package=boot
Jan 11 15:52:33 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:33Z" level=error msg="Could not init boot manager" error="Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: Post https://10.3.0.1:443/api/v1/namespaces/kube-system/configmaps: dial tcp 10.3.0.1:443: i/o timeout"
Jan 11 15:52:36 node0.dimagre.com portworx[800]: 2021-01-11_15:52:36: PX-Watchdog: (pid 192878): PX REST server died or did not started. return code 7. Timeout 60
Jan 11 15:52:43 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:43Z" level=error msg="Failed to flush alerts: timed out in flushing alerts, 6 alerts still in queue"
Jan 11 15:52:43 node0.dimagre.com portworx[800]: PXPROCS[INFO]: px daemon exited with code: 1
Jan 11 15:52:43 node0.dimagre.com portworx[800]: 2021-01-11 15:52:43,442 INFO exited: pxdaemon (exit status 1; not expected)
Jan 11 15:52:43 node0.dimagre.com portworx[800]: 2021-01-11 15:52:43,471 INFO spawned: 'pxdaemon' with pid 192920
Jan 11 15:52:43 node0.dimagre.com portworx[800]: 2021-01-11 15:52:43,472 INFO reaped unknown pid 192867
Jan 11 15:52:43 node0.dimagre.com portworx[800]: PXPROCS[INFO]: Started px-storage with pid 192959
Jan 11 15:52:43 node0.dimagre.com portworx[800]: bash: connect: Connection refused
Jan 11 15:52:43 node0.dimagre.com portworx[800]: bash: /dev/tcp/localhost/9009: Connection refused
Jan 11 15:52:43 node0.dimagre.com portworx[800]: PXPROCS[INFO]: px-storage not started yet...sleeping
Jan 11 15:52:46 node0.dimagre.com portworx[800]: PXPROCS[INFO]: Started px with pid 192970
Jan 11 15:52:46 node0.dimagre.com portworx[800]: PXPROCS[INFO]: Started watchdog with pid 192971
Jan 11 15:52:46 node0.dimagre.com portworx[800]: 2021-01-11_15:52:46: PX-Watchdog: Starting watcher
Jan 11 15:52:46 node0.dimagre.com portworx[800]: 2021-01-11_15:52:46: PX-Watchdog: Waiting for px process to start
Jan 11 15:52:46 node0.dimagre.com portworx[800]: 2021-01-11_15:52:46: PX-Watchdog: (pid 192970): Begin monitoring
Jan 11 15:52:46 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:46Z" level=info msg="Registering [kernel] as a volume driver"
Jan 11 15:52:46 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:46Z" level=info msg="Registered the Usage based Metering Agent...."
Jan 11 15:52:46 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:46Z" level=info msg="Setting log level to info(4)"
Jan 11 15:52:46 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:46Z" level=info msg="read config from env var" func=init package=boot
Jan 11 15:52:46 node0.dimagre.com portworx[800]: time="2021-01-11T15:52:46Z" level=info msg="read config from config.json" func=init package=boot
Jan 11 15:52:48 node0.dimagre.com portworx[800]: 2021-01-11 15:52:48,892 INFO success: pxdaemon entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

Thanks, it looks like the etcd server you configured portworx with is having issues and portworx cannot communicate with it.

The etcd server should be listed in /etc/pwx/config.json. Can you please post the output from this file?

@Ryan_Wallner thank you very much for your help and patience btw.

I used a default configuration from the Portworx portal. I think I’ve set it to use it’s own etcd server and not to share one with Kubernetes.

here is the config.json

{
  "alertingurl": "",
  "clusterid": "px-cluster-19235250-e738-4df6-b6e3-be7a12a535b7",
  "dataiface": "",
  "bootstrap": true,
  "mgtiface": "",
  "scheduler": "kubernetes",
  "secret": {
    "secret_type": "k8s",
    "cluster_secret_key": ""
  },
  "storage": {
    "devices": [
      "/dev/sda10"
    ],
    "cache": [],
    "rt_opts": {},
    "max_storage_nodes_per_zone": 0,
    "journal_dev": "auto",
    "system_metadata_dev": "",
    "kvdb_dev": ""
  },
  "version": "1.0"
}

Thanks.

Okay, essentials will use internal kvdb as default, once we can get etcd for px back up you should be able to start px and recover data. It looks like you ran tail on the logs. Can you provide full logs, instead of -n 25 can you output to a file and attach?

(example)
journalctl -lu portworx* > /tmp/w1.log

It’s mainly the same error repeating, but please find here iCloud.

Thanks.

So i just confirmed with our team that PX Essentials relies on kubernetes for the configmap to bootstrap your portworx cluster to running state again.

Jan 09 09:35:56 node3.dimagre.com portworx[1361046]: time=“2021-01-09T09:35:56Z” level=error msg=“Could not init boot manager” error=“Failed to initialize k8s bootstrap: Failed to create configmap px-bootstrap-pxcluster19235250e7384df6b6e3be7a12a535b7: etcdserver: request timed out”

So, this means we need to have your kubernetes cluster in a healthy state again so we can get that configmap. You said your original k8s etcd cluster had issues and you replaced it. I would start by checking the logs of the various kubernetes services first to get a sense of whats failing. How did you install your Kubernetes cluster?

On kube nodes

 journalctl -u kubelet

On master(s)

 journalctl -u kube-apiserver

logs possibly located in /var/log/ depending on OS and k8s distro as well

so I didn’t yet replace the cluster, kept the servers in the same half limping state as to not affect PX negatively in any new way. Is there any way to avoid using this configmap or to stub it out somewhere else?

But to clarify, the k8s cluster is NOT in a healthy state as of now.

No, unfortunately since Portworx is not in a running state, Essentials depends on Kubernetes being healthy. The failure is with the configmap and that needs to be fetched from Kubernetes.

Is there a way to get at the data after reinstalling portworx or from a different instance somehow? Can I mount my partition manually?

The way to do this would be to recover the etcd data and disks. Etcd cannot start because kubernetes is down.

I can ask internally if you recover the etcd backup (Internal KVDB) and use the same disks mounted on new hosts if you can recovery your cluster, but I am not sure.

The easiest way would be to fix your kubernetes cluster to the point where pods can run so portworx can start and then you can pull data.

I will ask internally if there is another way.

Just to clarify, does anything from the old cluster k8s side need to be recovered? or can I just blow that ETCD away?

Try the following.

Start a single node external ETCD just for portworx.

Then, find the latest portworx internal kvdb dump on one of the portworx nodes filesystem, the latest one will exists on only one Portworx node. When you find it, rename it to pwx_kvdb_disaster_recovery_golden.dump. See here for more details. Internal KVDB

Then update your /etc/pwx/config.json on every portworx node to use the external etcd.

"bootstrap": true,
  "kvdb": [
    "etcd:http://<etcd-ip>"
  ],

This should allow you to bootstrap without k8s. For now do not delete any etcd cluster or k8s cluster until you can recover px to be on the safe side.

Awesome, will try this now.

To clarify, should I remove the rest of the configuration or just add the kvdb key? I’ve tried adding that key but it’s removed each time from the file whenever I start portworx

No, you should not need to remove the rest of the configuration. Just make sure bootstrap is true and the kvdb section points to your external one. Based on your above it should be

{
  "alertingurl": "",
  "clusterid": "px-cluster-19235250-e738-4df6-b6e3-be7a12a535b7",
  "dataiface": "",
  "bootstrap": true,
  "kvdb": [
    "etcd:http://<etcd-ip>"
  ],
  "mgtiface": "",
  "scheduler": "kubernetes",
  "secret": {
    "secret_type": "k8s",
    "cluster_secret_key": ""
  },
  "storage": {
    "devices": [
      "/dev/sda10"
    ],
    "cache": [],
    "rt_opts": {},
    "max_storage_nodes_per_zone": 0,
    "journal_dev": "auto",
    "system_metadata_dev": "",
    "kvdb_dev": ""
  },
  "version": "1.0"
}

Make sure and to start portworx using systemctl, not Kubernetes.