Iled to start Portworx: error loading node identity: Cause: ProviderInternal Error: InvalidVolume.NotFound: The volume 'vol-xxxxx' does not exist

Hi,

I’m unable to wake-up again portworx cluster in eks, after remove the two nodes (ec2). With pxctl status I get:

Failed to start Portworx: error loading node identity: Cause: ProviderInternal Error: InvalidVolume.NotFound: The volume 'vol-09483313007e2265c' does not exist.

vol-09483313007e2265c volume does not exists, but there are 4 volumes not attached with label PX-DO-NOT-DELETE-....: vol-0c68a2f28a9e34e75, vol-015b23bd166e8ef4a, vol-051ad7992f4924428 and vol-08a48d5aafa885abc
I have added two new nodes to try recreate the cluster.

The new node added does not have the PX-DO-NOT-DELETE-.... volumes attached.

What should I do to remove old vol and let to portworx to regenerate the cluster?

@ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: time="2022-02-22T16:37:15Z" level=info msg="Alerts initialized successfully for this cluster"                                                    │
│ @ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: time="2022-02-22T16:37:16Z" level=warning msg="Failed to find locally attached drive set: Cause: ProviderInternal Error: InvalidVolume.NotFound: │
│  The volume 'vol-09483313007e2265c' does not exist.\n\tstatus code: 400, request id: c4e274a2-879e-4de4-851d-6bf73b7fdddf"                                                                                      │
│ @ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: time="2022-02-22T16:37:16Z" level=error msg="Error loading identity: Cause: ProviderInternal Error: InvalidVolume.NotFound: The volume 'vol-0948 │
│ 3313007e2265c' does not exist.\n\tstatus code: 400, request id: c4e274a2-879e-4de4-851d-6bf73b7fdddf" func=setNodeInfo package=boot                                                                             │
│ @ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: time="2022-02-22T16:37:16Z" level=error msg="error loading node identity: Cause: ProviderInternal Error: InvalidVolume.NotFound: The volume 'vol │
│ -09483313007e2265c' does not exist.\n\tstatus code: 400, request id: c4e274a2-879e-4de4-851d-6bf73b7fdddf" func=InitAndBoot package=boot                                                                        │
│ @ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: time="2022-02-22T16:37:16Z" level=error msg="Could not init boot manager" error="error loading node identity: Cause: ProviderInternal Error: Inv │
│ alidVolume.NotFound: The volume 'vol-09483313007e2265c' does not exist.\n\tstatus code: 400, request id: c4e274a2-879e-4de4-851d-6bf73b7fdddf"                                                                  │
│ @ip-10-31-42-13.eu-central-1.compute.internal portworx[10241]: PXPROCS[INFO]: px daemon exited with code: 1                                     

I have discover a ConfigMap named px-cloud-drive-... with a list of volumes and nodeId, so I replaced NodeId with the new ones systemUUID found in kubernetes nodes, and now the EBS volumes are correctly attached to the nodes. Now I get this error:

failed to setup internal kvdb: failed to create a kvdb connection to peer internal kvdb nodes [[http://10.31.6.80:9019 http://10.31.12.162:9019 http://10.31.38.245:9019]]: context deadline exceeded. Make sure peer kvdb nodes are healthy.

There is another ConfigMap named px-bootstrap-... I have updated the ID with the NodeId and the ip with the node ip. Now the only message I get is: Node status not OK (STATUS_INIT)

level=error msg="License watch stopped, re-subscribing..." error="Watch Stopped"                           │
level=error msg="Watch on key pwx/prep-eks-xxxxxxxxxxxxx/lic/.csum2 closed without  a Cancel response."                                                                                                                                                                                              level=info msg="Global license watcher [re]installed." startIdx=193496                                     │
level=info msg="Watch cb for key pwx/prep-eks-xxxxxxxxxxxxx/lic/.csum2 returned err : Watch Stopped"                                                                                                                                                                                                level=warning msg="Could not retrieve PX node status" error="Node status not OK (STATUS_INIT)\n"          

Hello @HugoFreire It looks like, you removed the 2 KVDB nodes from the cluster.
Before removal, you should check the number of the PX nodes and the number of KVDB nodes as we maintain 2 quorums here.
pxctl sv kvdb members - will give you 3 nodes which are taking part in the KVDB
pxctl cloud drive list - will give you disks being used by each PX node including the disks used for KVDB.

Since you removed the 2 nodes, both KVDB and PX cluster node quorum was broken.
If you replace any nodes ion the cluster you need to make sure that you do it 1 by 1 so kvdb can be shifted to other available nodes.
The replaced node ideally picks up the original drives set (disks used by previous instance) and comes up with the same identity and data.

Thanks @varunjain I will try these tests.