How to recover from node full (no space left on the node) state ?
In situations like this, where your one or more nodes in a cluster have gone partially offline due to no space on them there are various points you can consider depending upon the reason what caused it.
The general reasons for this node status are not limited to following:-
1. The size of a single volume, the volume size is greater than the total node capacity.
2. Over-provisioning of Portworx volumes than the node’s actual total capacity.
3. Accumulation of PX Volume Replicas and Local Snapshots on the nodes.
4. Unwanted volumes created on tests lying in the detached state.
5. Portworx Volumes themselves getting full because of application logs.
1.> Portworx Volume House-Keeping:-
- This requires user to perform house-keeping (clean-up) on the Portworx volumes by deleting unwanted logs, files and reports generated by the application using them.
- Implement data compression / purging techniques at the application level.
Go to the node where volume is mounted and do the following:
a. Find the volume mount point and (cd) change directory to it :-
mount | grep px
/dev/pxd/pxd349498214087081554 on /var/lib/kubelet/pods/c9b2c0f7-ed98-4412-b3f2-c2bc6e31f883/volumes/kubernetes.io~portworx-volume/pvc-b0be31e4-a114-4993-8278-78ca74ad4b18
[root@worker-node-2] # ls -lrth
-rw-r–r-- 1 root root 3G Apr 19 07:25 abc
-rw-r–r-- 1 root root 10G Apr 19 07:26 db.log
-rw-r–r-- 1 root root 5G Apr 19 07:26 front-end.log
b. Identify and delete the unwanted log files or backups within the volume.
2.> Identify unwanted volumes / local snapshots (Created for Testing)
- The user should identify volumes or local snapshots that may be created in error / test purposes that may not be in use any longer and delete them.
- It is always better to use Cloud Snapshots compared to local snapshot which will prevent nodes to get utilised to store them locally.
Note 1:- Remember that the snapshot of a repl-3 volume is also a repl-3 snapshot - Which means that the snapshots of that volume in-use or not are also consuming disk space on other nodes.
Note 2:- Be careful about deleting scheduled applicationBackup / CloudSnap volumes and incremental snapshots.
Listing only the volumes node-wise :-
pxctl volume list -v --node-id
Listing only the local snapshots node-wise:-
pxctl volume list -s --node-id
3.> Moving / Reduce-Remove the Portworx Volume Replicas
You can save a lot of space on the nodes by reducing / removing the volume replication levels from an affected node.
Modify the Storage Class to include fg: true, this parameter applies to a group of volumes created from same storage class. This parameter ensures the replicas of the volumes are not created on the same node where replicas of other volumes in the group resides.
- For information on this please refer to this forum post How to reduce / move / increase volume replicas from one node to another or specific node?
4.> Plan / calculate optimal sizing / re-sizing of a Portworx volume:
- Be sure to be within the node capacity limits.
- By default, data written to Portworx volumes is striped only across the disks available to the node. It is not striped across the cluster nodes.
- Portworx does allow creation of volumes higher capacity than Node or even the Global capacity because the volumes are thin-provisioned.
- It is often observed the applications continue to write data to the thin-provisioned volume far exceeding the node capacity.
- This not only leads to application outage but also can cause node to become FULL / un-available state.
- An alternative to this problem is to have aggregated volumes where the volume data can be striped across the nodes in the cluster leading to lesser usage per node. Aggregated volume require more nodes in the cluster if replication is also desired.
5.> Implement Volume Placement Strategy:
Portworx handles volume and replica provisioning more explicitly. You can do this by creating VolumePlacementStrategy CRDs.
Within a VolumePlacementStrategy CRD, you can specify a series of rules which control volume and volume replica provisioning on nodes and pools in the cluster based on the labels they have.
You can read about this more at https://docs.portworx.com/portworx-install-with-kubernetes/storage-operations/create-pvcs/volume-placement-strategies/
6.> Expand the Storage Pool on the nodes :
- If all of the above has been reviewed and predict that your cluster will grow in future, you can expand the node capacity (Vertical Scaling).
- Depending upon your infrastructure you have 2 methods of expand your existing storage pools.
- Expanding the Backing Disk Drives from AWS / Azure / VMware.
- Add New Backing Disk Drives to existing pool.
a. With Auto provisioned Disks
b. With Manually provisioned Disks
For more information, please visit: LINK
7.> Add more nodes to your portworx cluster:
- You can also add more nodes to existing cluster for not just be able to add more disk capacity but also compute capacity.
- With more nodes in the cluster, you have more compute capacity along with disk space available to your application workloads.
- You can also move Volume Replicas to newly added nodes to make space in other nodes.
For more information, please visit LINK