How to recover from node full state?

varunjain · April 19, 2020, 11:00am

How to recover from node full (no space left on the node) state ?

In situations like this, where your one or more nodes in a cluster have gone partially offline due to no space on them there are various points you can consider depending upon the reason what caused it.

The general reasons for this node status are not limited to following:-

1. The size of a single volume, the volume size is greater than the total node capacity.
2. Over-provisioning of Portworx volumes than the node’s actual total capacity.
3. Accumulation of PX Volume Replicas and Local Snapshots on the nodes.
4. Unwanted volumes created on tests lying in the detached state.
5. Portworx Volumes themselves getting full because of application logs.

Possible Solutions:-

1.> Portworx Volume House-Keeping:-

This requires user to perform house-keeping (clean-up) on the Portworx volumes by deleting unwanted logs, files and reports generated by the application using them.
OR
Implement data compression / purging techniques at the application level.

Go to the node where volume is mounted and do the following:

a. Find the volume mount point and (cd) change directory to it :-

mount | grep px
/dev/pxd/pxd349498214087081554 on /var/lib/kubelet/pods/c9b2c0f7-ed98-4412-b3f2-c2bc6e31f883/volumes/kubernetes.io~portworx-volume/pvc-b0be31e4-a114-4993-8278-78ca74ad4b18

cd /var/lib/kubelet/pods/c9b2c0f7-ed98-4412-b3f2-c2bc6e31f883/volumes/kubernetes.io~portworx-volume/pvc-b0be31e4-a114-4993-8278-78ca74ad4b18

[root@worker-node-2] # ls -lrth
total 0
-rw-r–r-- 1 root root 3G Apr 19 07:25 abc
-rw-r–r-- 1 root root 10G Apr 19 07:26 db.log
-rw-r–r-- 1 root root 5G Apr 19 07:26 front-end.log

b. Identify and delete the unwanted log files or backups within the volume.

2.> Identify unwanted volumes / local snapshots (Created for Testing)

The user should identify volumes or local snapshots that may be created in error / test purposes that may not be in use any longer and delete them.
It is always better to use Cloud Snapshots compared to local snapshot which will prevent nodes to get utilised to store them locally.

Note 1:- Remember that the snapshot of a repl-3 volume is also a repl-3 snapshot - Which means that the snapshots of that volume in-use or not are also consuming disk space on other nodes.

Note 2:- Be careful about deleting scheduled applicationBackup / CloudSnap volumes and incremental snapshots.

Listing only the volumes node-wise :-
pxctl volume list -v --node-id
Listing only the local snapshots node-wise:-
pxctl volume list -s --node-id

3.> Moving / Reduce-Remove the Portworx Volume Replicas

You can save a lot of space on the nodes by reducing / removing the volume replication levels from an affected node.
Modify the Storage Class to include fg: true, this parameter applies to a group of volumes created from same storage class. This parameter ensures the replicas of the volumes are not created on the same node where replicas of other volumes in the group resides.

For information on this please refer to this forum post How to reduce / move / increase portworx volume replicas from one node to another or specific node?

4.> Plan / calculate optimal sizing / re-sizing of a Portworx volume:

Be sure to be within the node capacity limits.
By default, data written to Portworx volumes is striped only across the disks available to the node. It is not striped across the cluster nodes.
Portworx does allow creation of volumes higher capacity than Node or even the Global capacity because the volumes are thin-provisioned.
It is often observed the applications continue to write data to the thin-provisioned volume far exceeding the node capacity.
This not only leads to application outage but also can cause node to become FULL / un-available state.
An alternative to this problem is to have aggregated volumes where the volume data can be striped across the nodes in the cluster leading to lesser usage per node. Aggregated volume require more nodes in the cluster if replication is also desired.

5.> Implement Volume Placement Strategy:

Portworx handles volume and replica provisioning more explicitly. You can do this by creating VolumePlacementStrategy CRDs.
Within a VolumePlacementStrategy CRD, you can specify a series of rules which control volume and volume replica provisioning on nodes and pools in the cluster based on the labels they have.
You can read about this more at https://docs.portworx.com/portworx-install-with-kubernetes/storage-operations/create-pvcs/volume-placement-strategies/

6.> Expand the Storage Pool on the nodes :

If all of the above has been reviewed and predict that your cluster will grow in future, you can expand the node capacity (Vertical Scaling).
Depending upon your infrastructure you have 2 methods of expand your existing storage pools.

Expanding the Backing Disk Drives from AWS / Azure / VMware.
Add New Backing Disk Drives to existing pool.
a. With Auto provisioned Disks
b. With Manually provisioned Disks

For more information, please visit: LINK

7.> Add more nodes to your portworx cluster:

You can also add more nodes to existing cluster for not just be able to add more disk capacity but also compute capacity.
With more nodes in the cluster, you have more compute capacity along with disk space available to your application workloads.
You can also move Volume Replicas to newly added nodes to make space in other nodes.

For more information, please visit LINK

Gerardo_Hernandez · April 24, 2020, 10:27pm

This is a good to know when working with Px on K8s, nice article !

varunjain · April 26, 2020, 5:36am

Thanks for the feedback Gerardo, we update the content regularly on these posts.
Do watch this space for more options added in future.

LudwigAugren · May 27, 2023, 6:53am

This is really helpful since I am using Px. Thank you very much!
Regards, Team mybakersfieldrealtor

DeanChandler · August 17, 2023, 4:19am

Hey, checking up on this thread again, did this work for you? It did not do the trick for me.
Hoping for a positive response
Team Mybakersfieldrealtor

Topic		Replies	Views
Storage Full or Offline	1	709	August 7, 2021
Storage down or offline (Pool used 90% capacity)	1	891	September 22, 2022
How to sync Portworx volume	1	718	March 14, 2022
Portworx kvdb Disaster Recovery	1	678	November 24, 2021
StorageNode offline Portworx Essentials	0	17	November 4, 2024

How to recover from node full state?

Related topics