02-18-2013 07:01 AM
I'm having this issue whereby the VCM will looses connectivity with the OA up to several times a day. It has now happened 40+ times since Dec 2012.
The chassis has a standard config :
- C7000 Enclosure
- 2x Flex-10 in Bay1/2
- 2x FC 8 Gb in Bay3/4
- 8 blades, all BL460c G7 except one BL460c G8
- FW level is 3.60 on OA and VC
- FW/driver levels on servers as per SPP 2012.06
- Mix of ESXi 5.0 U2 and Windows 2008 R2
From the VC log, I can see that after the VCM reconnects to the OA (usually 3mn later), the VCM seems to "re-discover" the content of the enclosure including how many blades are present, what their power state is and if they have a profile assigned.
95% of the time, the blades do not suffer any I/O interruption but sometimes during that "discovery" phase, one or two blades are not identified although they are running, and as a result, these servers will immediately loose all NIC/FC connections. The VC profile of these servers will show as <Not Present>. At this point, you're only solution is to force a power off from the iLO and upon reboot the VC profile loads just fine and the box regains NIC/FC connectivity.
- Anyone has seen this before?
- is the VCM-OA communication happening inside the chassis or does it go out to the external switch and back in?
So far HP tells me they can't spot anything wrong...so any help would be much appreciated :-)
02-18-2013 01:10 PM
I am running OA 3.60 and VC 3.70. I am seeing the same behavior on multiple domains but fortunately not seeing the outages you are. And not as often as you are seeing, the logs show this happening once per domain at random times. With a couple of the domains being in the same row connected to the same switches, but the drops happening at separate times.
So that would lead me to guess that it's not something on the switch.
Do you have the OA connected to a 1GbE switch? Is there a lot of traffic (ISO load to a server or something) when you see those drops?
02-19-2013 05:21 AM
On the BL460c G7 I would recommend to update the LOM firmware to the latest version as there are IMHO many horrible bugs fixed.
Easiest way to update is probably with the offline or online ISOs:
02-20-2013 12:21 AM
Thanks for your reply. In a way I'm glad to hear we're not alone with this problem and like I said 95% of the time we don't see I/O disconnections when it happens but this behavior is not normal and a "VCM-OA Communication Down" event is clearly marked as Critical in VCM logs.
I've logged a call with HP but so far they refuse to admit there's something wrong with VCM and they're going the route of replacing the motherboard on the BL460c G8. Since Monday I have switched the management of the VCM from interconnect #1 to #2 using the "vcm reset -failover" command and I haven't seen any errors yet but it might be too soon.
The chassis has 1x 10 Gbit connection per Flex-10 module (bays 1-2) and some other 1 Gbit links on the pass-through modules (bays 5-6). There should be plenty of bandwidth available for the blades to use but I have to admit I didn't measure the load on the uplinks to prove my point.
OA's are connected at 1 Gbit and seem fine, they're constantly monitored by HP SIM and I've never seen them go down. Additionally, the HP technician I've talked to is telling me that VC-OA communication happens intra-chassis so it does not need to go out of and back in through an exeternal network swithch.
I'll move this G8 blade to a different chassis this week and I'll see if the problem follows the blade...
02-20-2013 12:30 AM
Thanks for your reply. All the blades were updated using HP SUM and the SPP 2012.06 bundle to make sure that firmware and drivers for all OS are part of the same family. With regards to ESXi, I've aligned the NIC/HBA/Smart Array drivers with the "HP-VMware Software Recipe" doc from June 2012 (http://vibsdepot.hp.com/hpq/recipes/June2012VMware
02-20-2013 02:06 AM
There is this one issue that I've had with the old BL460c G7 NIC firmware, but only after upgrading VC: HP NC55X Adapters - FIRMWARE UPDATE RECOMMENDED: Device Control Channel (DCC) May Be Unavailable wit...
Maybe your affected G7 blades are having these DCC issues? It's just a guess...
02-20-2013 05:51 AM
Interestingly I noticed that HP just released the Feb 2013 Software Recipe doc in which they recommend Emulex NIC FW 4.2.401.6...together with VC FW 3.75
02-20-2013 01:32 PM
My firmware is newer, we are running 4.1.450.16 and we are at SPP 2012.02 for the servers (G7's). I would suggest updating your firmware, hopefully that can fix your problem.
It still makes no sense that we are seeing the OA and VC lose communication, if it's internal how is that possible?
02-20-2013 10:41 PM
Hmm another wild guess: maybe there are duplicate OA IPs assigned on the network where the OAs/VCs are? Which might lead to VC sometimes not getting the expected reply and losing the communication. Or your OAs are failing over all the time because of a flaky OA cable?
So when the VC domain gets re-imported after the NO_COMM to OA, some blades might lose I/O when at the same time their iLOs are hung or not reachable. But I have had this only happen with multi-bay Integrity servers, not sure if this also applies to ProLiants.
02-21-2013 05:17 AM
Switching the VC management to interconnect module #2 did nothing for us as I just spotted 2x NO_COMM errors during the night (no impact on the server this time). I have now moved the G8 blade to another chassis today to see if the disconnections follow the server. If the problem happens again, I'll definitely upgrade the Emulex FW.
HP is asking me to upgrade to VC FW 3.70 and now admits there's a"DSS (direct) communication problem between OA and VCM". They now have 2 support engineers from BladeSystem + Networking depts looking at the issue. I've also told them the same errors appear on FW 3.70.
If this happens to anyone out there, you can always reference my case #4693499393.
Oh by the way I've checked all our OAs and VCs IPs, they're all unique and statically assigned, no conflict on the horizon.
03-14-2013 06:19 AM
Please check all the VC downlink port statistics from Bay1/bay2 & look for Dot3InPauseFrames. If you notice any high number of Dot3InPauseFrames. Refer the below advisory :
Advisory: (Revision) HP Virtual Connect - Servers Transmitting Excessive Pause Frames May Cause Virtual Connect Uplink Port To Experience Channel Flapping, Stacking Link Removal and Server Communication Loss:
Get the NIC Firmware/driver updated to the recommended version listed in the advisory. You can also check the below advisory.
Advisory: HP ProLiant Servers - HP NC55x Network Adapters Using Firmware Older Than Version 4.1.450.7 May Experience Continuous Pause Frames :
Notice: BladeSystem c-Class Server Blades: NC532i/NC532m Network Adapters - Optimizing Network Performance Under Windows Server When the MTU is Set to 5000 Bytes (or More) and PAUSE Frames Are Enabled :
08-22-2013 07:54 AM - edited 08-22-2013 08:16 AM
Wondering if any of the folks experiencing this have found a resolution? We're experiencing the same problem with VC FW 3.70/OA FW 3.60 in an environment with 3 linked chassis. Interestingly enough, we have a test environment with 2 chassis and the same FW versions that has never experienced the issue. Both have inter-chassis stacking links setup per the best practices guides, etc.
I was thinking of upgrading both OA/VC FW, but it sounds like VC FW 3.75 still exhibits the issue. We've got an open case with HP at the moment and yesterday they swapped an OA (problem still occurred last night, after the swap.)
Has anyone tried VC FW4.01? This seems a bit to early for a production release, but we need this issue resolved. As stated earlier in the thread - the majority of the time this has no impact on the blades. The one time it did impact things it tore down an entire Oracle RAC cluster :(
08-22-2013 06:36 PM
What fixed the problem for us was to upgrade all chassis to v3.70 and all the blades (FW + Drivers) to the corresponding SPP (2012.10) version. For ESX servers I have looked at the Solution Recipe documents from HP-VMware to make sure all firmware/driver versions are aligned.
I have to say it's been very stable for months now.
10-18-2013 07:26 AM
We are still seeing this as well on different platforms. We see this on a VC 1/10 platform that is updated to 3.60 as well as a 3.70 Flex 10 platform. On my 1/10 platform all of the VC's disconnect and reconnect from the network. So for HP and Cisco have found nothing to suggest a cause. No Pause Frames, and NLP is disabled. No duplicate IP's either.
Any suggestions would be appreciated.