Bad write performance on pwx volume

Probably I did something totally wrong.

I’m running px-dev cluster of 3 nodes on Proxmox VMs. When I run dd if=/dev/zero of=test bs=2048 count=1048576 on locally mounted pwx volume I’m getting throughput ~6MB/s, while when I’m running the same command on local disk (disk configs are all the same) I’m getting the speed ~250MB/s.

My cluster configuration is:

root@ydev1:/mnt/infra/starts/pwx# pxctl status
Status: PX is operational
License: PX-Developer
Node ID: e93a6044-d5e8-417e-b858-919b2028328b
        IP: 10.3.0.110
        Local Storage Pool: 1 pool
        POOL    IO_PRIORITY     RAID_LEVEL      USABLE  USED    STATUS  ZONE    REGION
        0       HIGH            raid0           200 GiB 10 GiB  Online  default default
        Local Storage Devices: 1 device
        Device  Path            Media Type              Size            Last-Scan
        0:1     /dev/sdb        STORAGE_MEDIUM_SSD      200 GiB         14 Oct 20 11:50 UTC
        total                   -                       200 GiB
        Cache Devices:
         * No cache devices
        Journal Device:
        1       /dev/sdd1       STORAGE_MEDIUM_SSD
        Metadata Device:
        1       /dev/sdc        STORAGE_MEDIUM_SSD
        * Internal kvdb on this node is using this dedicated metadata device to store its data.
Cluster Summary
        Cluster ID: pwx.testrepl.ygdev
        Cluster UUID: ae0c7c03-d569-4ca3-8145-7789bac04935
        Scheduler: swarm
        Nodes: 3 node(s) with storage (3 online)
        IP              ID                                      SchedulerNodeName               StorageNode     Used    Capacity        Status  StorageStatus   Version         Kernel    OS
        10.88.0.2       e93a6044-d5e8-417e-b858-919b2028328b    7635l43biffytgaff38t6j763       Yes             10 GiB  200 GiB         Online  Up (This node)  2.6.1.2-669fb0c 5.4.0-48-generic   Ubuntu 20.04.1 LTS
        10.88.0.1       78bf22f2-bb50-40e2-8455-8b964b4096e0    t3swt67sbrzw95wvczr6zbbvv       Yes             10 GiB  200 GiB         Online  Up              2.6.1.2-669fb0c 5.4.0-48-generic   Ubuntu 20.04.1 LTS
        10.88.0.3       785f9144-93d9-477c-b3ef-813adc5365ed    r1ezu9t1tni9ojkiwl2hhgfb3       Yes             10 GiB  200 GiB         Online  Up              2.6.1.2-669fb0c 5.4.0-48-generic   Ubuntu 20.04.1 LTS
        Warnings:
                 WARNING: Swap is enabled on this node.
Global Storage Pool
        Total Used      :  30 GiB
        Total Capacity  :  600 GiB

The network between the nodes is configured on local bridge, not limited by physical NIC speed.

The volume was created with the command:
pxctl volume create testvl --shared --async_io --early_ack -s 30 -r 3 --io_priority high

I’ll be grateful for any advise.

We are recommending people use sharedv4 instead of shared (as the latter is considered legacy and being phased out), as sharedv4 has superior performance. Additionally, you can set --io_profile db_remote and see if that results in even better peformance (these require replica level of 3).

Thanks a lot for the advice!!! Using switch --sharedv4 incredibly increases the performance back to native, but I’m loosing the ability to attach the volume on multiple host, despite mentioned here: https://docs.portworx.com/concepts/shared-volumes/

Can you please advise on this as well?
My current volume creation command is:

pxctl volume create testvl --sharedv4 -s 30 -r 3 --io_priority high

And another small question, if you let me. The daemon sporadically replaces my /etc/exports file with the empty one. Howe can I prevent it?

Thanks in advance.

Can you elaborate on what you mean by “losing the ability to attach the volume on multiple hosts” ? It’s unclear what your use-case looks like in more detail from the information provided so far.

You definitely should be able to utilize the volume from other nodes, but can you specify if you are always you using pxctl directly on other hosts?

Typically for kubernetes based environments, we usually have this action performed via the orchestrator control plane via a StorageClass object (where you can specify parameters such as sharedv4: true and repl: 3) and the volume attachment then happens transparently when you have a PersistetVolumCclaim that references that SC, and the generated PersistentVolume is attached/mounted into that node/pod when that workload needs it.

I’m using Docker swarm now.
Indeed I’ve just checked that the volume is shared between multiple instances of container on different nodes.

Also, I have use cases, mounting volumes on the host level for NFS sharing. I’m not sure, the local mount will work in case of failover, So I prefer to mount on all nodes and to use keepalived to failover based on mounted volume availability.

But when I’m running on node 1 I’m getting OK and it works:

root@ydev0:/mnt/infra/starts# pxctl host attach testvl
Volume successfully attached at: /dev/pxd/pxd897951024453320486
root@ydev0:/mnt/infra/starts# ll /dev/pxd/pxd897951024453320486
brw-rw---- 1 root disk 252, 1 Oct 14 15:54 /dev/pxd/pxd897951024453320486

and on node 2 I’m getting also OK, but it doesn’t work:

root@ydev1:/mnt/infra/starts/pwx#  pxctl host attach testvl
Volume successfully attached at: /dev/pxd/pxd897951024453320486
root@ydev1:/mnt/infra/starts/pwx# ll /dev/pxd/pxd897951024453320486
ls: cannot access '/dev/pxd/pxd897951024453320486': No such file or directory

Let’s elaborate a bit more on the attach vs mounting process.

Attaching a volume can only happen on one node at a time, and this is what creates the /dev/pxd/pxdNNNNN device that is a virtual block device that gives you access to the Portworx volume. This typically happens on the node where the volume has replicas, but can happen on other nodes that don’t have replicas (and this will just be accessed over the network to get the data from nodes that have replcias), such as when you have workloads on one node (possibly storageless) but the volume’s storage on another node.

Mounting the volume, on the other hand, will use the /sbin/mount command on either the /dev/pxd/pxdNNNNN device file (if this is a ReadWriteOnce (e.g. non-shared) access-mode volume) or on the first node where a shared volume attached (and this also exports the volume via NFS), and any OTHER node will then mount the NFS exported volume from the first. In a kubernetes-based environment (where Portworx Essentials is supported only as of today), the mount point on the filesystem will be within the kubelet’s pod working directory.

Now whether this approach of yours is the best way, we can’t say but if you want to give us a higher level view of what you’re trying to achieve, we can comment on whether the approach you’re going about is actually the best or if there’s another one you may not have considered.

Sorry, probably I was confused by using “–shared” key, which provides ability to create block devices on different nodes from the same volume.
My goals as following:

  1. I’m good with sharing the same volume inside docker container, running as a swarm service instance.
  2. For host level usage, I’m using a simple scenario to achieve high availability for sharing a directory via nfs. My plan was to mount the volume and to run nfs server on multiple nodes. The HA is provided by keepalive service, that switches Virtual IP between nodes based on the volume accessibility. That works just fine with “–shared” key used with volume creation, except very poor performance that was my initial question.
    If Portworx utilizes NFS for internal sharing, can I just mount a volume directly from Portworx trough NFS, instead of mounting the volume locally and creating another NFS export?

We support accessing the exported the volume via NFS outside portworx if you use one of the following options:

  • either to allow it to be consumed by any IP, add this to the storageclass (or as a label when using pxctl volume create/update) allow_all_ips=true - requires version 2.3.2 or higher
  • alternatively to specify exactly which IP can access it, use this: allow_ips=ip1;ip2 - requires version 2.6.0 or higher

The exact share name can be queried from the node where it’s attached in its /etc/exports (which is a standard NFS file for this configuration), based on the (numeric) volumeID.

Also there is this documentation link about this https://docs.portworx.com/reference/knowledge-base/faqs/#can-i-access-my-data-outside-of-portworx-volume-or-is-it-only-for-containers

However i am not sure if your keepalived service will work well, as remember, a volume can only be attached on one node at a time (and this node is the one who does the NFS exporting in the case of volume types of sharedv4)

Thank you a lot for your suggestion. Your help is very valuable!

The last concern that you mentioned has a workaround by running a dummy container that mounts the volume usual way on each of the Protworx cluster nodes. In this case I have the volume available on the host machine in /var/lib/osd/mounts directory as well.

The last question. After installation of Portworx OCI bundle, creating the cluster and starting the service, my /etc/exports file on the host machine is being overwritten from time to time with the empty one. Is it a way to prevent this without using immutable attribute on the file?

Thanks, YG