04-02-2014 08:27 AM - edited 04-02-2014 08:34 AM
I am having the write latency issue on 1 production cluster and 1 lab cluster. the production cluster is using round robin with two paths to each datastore.
the lab cluster currently is using fixed path policy and it only has 1 iSCSI NIC and 1 available path.
write latency is very poor on both clusters (esxtop shows 200ms+ for write latency) so I don't think the path policy is helpful in my situation.
I feel compelled to contribute since we have experienced the same latency issues using VSA 11.5 / ESXi 5.5 u2. Similar to other contributors experiences in this discussion, the latency seems to occur whenever the VSA cluster is accessed via a local gateway VSA node, thus requiring iSCSI traffic to pass through the local ESXi vSwitch network stack. Accessing the cluster via a remote VSA gateway on another host shows good performance in contrast. The issue would seem to be that having your VSA node sharing the same local vSwitch as your iSCSI vmk ports, introduces the latency if you are accessing a VSA presented datastore that the VSA cluster has determined should be presented by the same local VSA node on the same vSwitch.
This infers that it is as likely to be a hypervisor network stack performance issue as a VSA cluster issue.
2 x HP DL380 Gen8's; local 15K SAS HDD Storage Array; vSphere ESXi 5.5 u2 (HP Build)
2 v HP VSA 10TB v11.5; Software iSCSI Adapter; Standard twin path iSCSI Initiator configuration.
Network is 10GbE with Jumbo Frames (9000MTU). Throughput to non-local VSA node is around 3-400MB/s @ <20ms latency. Throughput to local VSA node is around 1-200MB/s with >1000ms latency spikes.
The VSA paths tend to settle on a pattern where one particular volume / datastore presented by the cluster VIP is always mapped to a local VSA on a particular host. This is desirable since this offers load balancing between VSA’s. However, often this will mean that VSA Datastore 1 being accessed by ESXi Host 1 via its local VSA, and VSA Datastore 2 is being accessed by ESXi Host 2 via its local VSA respectively. Storage degradation is then experienced by ESXi Host 1 on VSA Datastore 1 (local) but not on VSA Datastore 2 (remotely accessed via pNIC / Switch), and vice versa.
Running various storage performance tools, it seems that the throughput / latency to the local VSA node begins acceptably, but as you ramp up the test data it suddenly seems to become saturated wherby latency goes through the roof. Using Round Robin Path Policy at iops 1 or default 1000 gives very good storage performance on the non-local VSA, but abysmal performance on the local VSA. Defaulting to Most recently Used Path Policy gives poorer but acceptable performance on the non-local VSA, and poor performance on the local VSA, however latency seems to remain just within acceptable tolerances - still spiking occasionally to several hundred ms, but averaging between 20-30ms. The inference perhaps is that the lower throughput / path switching reduces the frequency of the saturation of the local hypervisor network stack with iSCSI traffic passing between a local target and initiator.
As suggested here already in this dicussion, the solution would seem to be to separate out the VSA and the iSCSI Software Initiator vmk's, however we have no more pNIC's to offer each ESXi Node at the moment and 10Gbe cards and switch modules are expensive!
Hope all this helps someone.
I agree with your conclusions, but getting VMware to resolve the bug will only happen if a very large customer of theirs complains about this. Any customer large enough to have the clout needed will probably not be using HP VSA.