How to recover from node full state?

How to recover from node full (no space left on the node) state ?

In situations like this, where your one or more nodes in a cluster have gone partially offline due to no space on them there are various points you can consider depending upon the reason what caused it.

The general reasons for this node status are not limited to following:-

1. The size of a single volume, the volume size is greater than the total node capacity.
2. Over-provisioning of Portworx volumes than the node’s actual total capacity.
3. Accumulation of PX Volume Replicas and Local Snapshots on the nodes.
4. Unwanted volumes created on tests lying in the detached state.
5. Portworx Volumes themselves getting full because of application logs.

Possible Solutions:-

1.> Portworx Volume House-Keeping:-

  • This requires user to perform house-keeping (clean-up) on the Portworx volumes by deleting unwanted logs, files and reports generated by the application using them.
    OR
  • Implement data compression / purging techniques at the application level.

Go to the node where volume is mounted and do the following:

a. Find the volume mount point and (cd) change directory to it :-

mount | grep px
/dev/pxd/pxd349498214087081554 on /var/lib/kubelet/pods/c9b2c0f7-ed98-4412-b3f2-c2bc6e31f883/volumes/kubernetes.io~portworx-volume/pvc-b0be31e4-a114-4993-8278-78ca74ad4b18

cd /var/lib/kubelet/pods/c9b2c0f7-ed98-4412-b3f2-c2bc6e31f883/volumes/kubernetes.io~portworx-volume/pvc-b0be31e4-a114-4993-8278-78ca74ad4b18

[root@worker-node-2] # ls -lrth
total 0
-rw-r–r-- 1 root root 3G Apr 19 07:25 abc
-rw-r–r-- 1 root root 10G Apr 19 07:26 db.log
-rw-r–r-- 1 root root 5G Apr 19 07:26 front-end.log

b. Identify and delete the unwanted log files or backups within the volume.

2.> Identify unwanted volumes / local snapshots (Created for Testing)

  • The user should identify volumes or local snapshots that may be created in error / test purposes that may not be in use any longer and delete them.
  • It is always better to use Cloud Snapshots compared to local snapshot which will prevent nodes to get utilised to store them locally.

Note 1:- Remember that the snapshot of a repl-3 volume is also a repl-3 snapshot - Which means that the snapshots of that volume in-use or not are also consuming disk space on other nodes.

Note 2:- Be careful about deleting scheduled applicationBackup / CloudSnap volumes and incremental snapshots.

  • Listing only the volumes node-wise :-
    pxctl volume list -v --node-id

  • Listing only the local snapshots node-wise:-
    pxctl volume list -s --node-id

3.> Moving / Reduce-Remove the Portworx Volume Replicas

  • You can save a lot of space on the nodes by reducing / removing the volume replication levels from an affected node.

  • Modify the Storage Class to include fg: true, this parameter applies to a group of volumes created from same storage class. This parameter ensures the replicas of the volumes are not created on the same node where replicas of other volumes in the group resides.

4.> Plan / calculate optimal sizing / re-sizing of a Portworx volume:

  • Be sure to be within the node capacity limits.
  • By default, data written to Portworx volumes is striped only across the disks available to the node. It is not striped across the cluster nodes.
  • Portworx does allow creation of volumes higher capacity than Node or even the Global capacity because the volumes are thin-provisioned.
  • It is often observed the applications continue to write data to the thin-provisioned volume far exceeding the node capacity.
  • This not only leads to application outage but also can cause node to become FULL / un-available state.
  • An alternative to this problem is to have aggregated volumes where the volume data can be striped across the nodes in the cluster leading to lesser usage per node. Aggregated volume require more nodes in the cluster if replication is also desired.

5.> Implement Volume Placement Strategy:

6.> Expand the Storage Pool on the nodes :

  • If all of the above has been reviewed and predict that your cluster will grow in future, you can expand the node capacity (Vertical Scaling).
  • Depending upon your infrastructure you have 2 methods of expand your existing storage pools.
  1. Expanding the Backing Disk Drives from AWS / Azure / VMware.
  2. Add New Backing Disk Drives to existing pool.
    a. With Auto provisioned Disks
    b. With Manually provisioned Disks

For more information, please visit: LINK

7.> Add more nodes to your portworx cluster:

  • You can also add more nodes to existing cluster for not just be able to add more disk capacity but also compute capacity.
  • With more nodes in the cluster, you have more compute capacity along with disk space available to your application workloads.
  • You can also move Volume Replicas to newly added nodes to make space in other nodes.

For more information, please visit LINK

1 Like

This is a good to know when working with Px on K8s, nice article !

Thanks for the feedback Gerardo, we update the content regularly on these posts.
Do watch this space for more options added in future.