02-18-2010 12:26 PM
In both VC Flex-10 Ethernet Module firmware 2.30 and 2.31 smart links is not working correctly in our configuration.
C7000 chassis using BL460c G6s and BL685c G6's
In bay1 and bay2 we have a VC F10 module.
bay1 is connected via 5 1GB links to a primary cisco 6500 in etherchannel mode + vlan tagging
bay2 is connected via 5 1GB links to a secondary cisco 6500 in etherchannel mode + vlan tagging
both VC units are configured for smart link failure
Blades are configured with failover mode linux bonding, the blades do the actual vlan breakout (these are actually xen hypervisors)
if I poweroff or reset a VC module, the blades notice the link down and failover to their secondary link.
However, if i happen to port down all 5 of those links on the cisco switch, the GUI reports all external links down but the host never sees the link drop so bonding can kick in.
This has two major issues:
1) if I lose a cisco switch or need to do maintenance on it, I lose connectivity to all my blades (and all their guests).
2) Doing a code upgrade of the VC Flex-10 modules has always cased a major outage to the blade units during the last reset of the active VC module.
Any ideas on what I can do to address this?
02-19-2010 08:53 AM
The latest Broadcom driver that supports DCC was just published to the VMware site. Please refer to the following URL:
Remember to upgrade the NIC firmware first.
Look to upgrade to VM 2.32 as that has been released.
02-21-2010 01:46 PM
Thanks for the comments folks, but this is not VMware, nor does it have much to do with Linux drivers on the blades. The VC module itself needs to drop link to the blades, which it does not. Page 57 of the VC 2.30 user guide:
"Enabling Smart Link configures the network so that if all external links lose their link to external switches, Virtual Connect drops the Ethernet link on all local server blade Ethernet ports connected to that network.. This feature can be useful when using certain server network teaming (bonding) configurations."
I encourage you to test this in your own configurations.
02-21-2010 03:05 PM
The initial release of Flex-10 did not support SmartLink. VC firmware 2.30 enabled SmartLink through a DalaLink Control Channel(DCC). DCC requires The latest drivers and firmware to work. Make sure you have the latest firmware and drivers for your OS.
02-22-2010 09:18 AM - edited 02-22-2010 12:23 PM
bnx2x versions that I'm using:
version: 1.48.107 (Oracle VM 2.2.0)
version: 1.50.13 (Oracle VM 2.1.5 - we installed the latest driver to get the OS working before 2.2 was released)
Name: HP ProLiant BL685c G6 (A17)
Software Version: 12/09/2009
Can anyone tell me how to read the NIC fw level from Linux w/o watching it via post?
It could be that we have this configured incorrectly, however, if this is a software feature that lies to the NIC card that the link is down when the physical connection is still up, I can see how the drivers and firmware play a large role.
May I ask which FW version and driver are known to work? :smileywink:
I ran the NIC updater to get the versions installed:
Found HP NC532i Dual Port 10GbE Multifunction BL-c Adapter
Update Boot Code 4.8.0 to 5.2.7 - y/n/q/c (y):n
Update BRCM_ISCSI 3.1.0 to 3.1.5 - y/n/q/c (y):n
On another 685:
Found HP NC532i Dual Port 10GbE Multifunction BL-c Adapter
Update Boot Code 5.0.11 to 5.2.7 - y/n/q/c (y):n
*** WARNING *** - Installed BRCM_ISCSI is the same version as selected BRCM_ISCSI.
Update BRCM_ISCSI 3.1.5 to 3.1.5 - y/n/q/c (n):n
Also, the latest bnx2x driver from HP is 1.45.19-2 (7 Oct 2008), I'm running way newer than that. - update again, the latest multifunction driver is newer than the 1.50.13 I have installed. What version works? 1.48 is the stock supported driver from Oracle and RedHat on 5.3.
02-23-2010 09:43 AM
Do SmartLinks only apply to SUS? My network configurations only consist of external uplink ports in LACP. They also do VLAN tagging but the VC doesn't see that at all as the blades themselves deal with the tagging.
03-04-2010 11:44 AM
I going to open a service ticket about this. Even using bnx2x driver 1.52.12 and code 5.2.7/3.1.5 I have the same issue. Doing a code upgrade to VC 2.3.2 made boxes with the latest driver and firmware unreachable at various stages of the activation on both the primary and secondary VC -- and those were not at the times Linux saw the link drop.
03-15-2010 08:04 PM
I have the same problem.
Problem appears to surface at almost exactly 60 seconds after the Flex10 is powered on.
Problem appears to resolve at 20 seconds after the outage occurs.
During the 'outage', the upstream switch doesn't see link.
During the outage, the NIC inside the chassis is suddenly presented with link.
It looks like the Flex is presenting itself as a valid path to the hosts, even though it doesn't have a valid forwarding plane.
If I shut down the external links during the reboot process, the link still presents itself to the host 60 seconds after I power on the flex- but then smartlink kicks in and disables that link after another 20 seconds.
Seems like a case of the flex presenting itself to the host prior to having smartlink initialized- or any sort of data forwarding, for that matter.
Removing the flex works great.
Shutting down the external links and causing a non-reboot flap inside the chassis via smartlink works great.
Re-presenting the external links and load-balancing back onto these links works great.
it's that 60-80 seconds after a reboot of the flex that is causing me untold amounts of grief.
If somebody could confirm similar behavior, i'd really appreciate it. There's a couple of ways around the situation that I see, but I'm not in a position to fix it.
1: have smartlink on the Flex come online PRIOR to whatever happens at 60 seconds. (optimal! initialize your data plane prior to telling the blades you're good to go.)
2: Have the guest OS not try to blindly forward down a link that just came up. Give it 30-60 seconds to present itself before trying to forward down that link. (I imagine this could cause problems in some environments. But if a link is flapping, it's better to ignore it completely than to try and use it between bounces. But 60 seconds can seem like forever to wait when you're in a failure scenario.)
3: Modify smartlink to not present a flapping link to the host, similar to step 2. Minor variation.
04-12-2010 07:27 AM - edited 04-12-2010 07:30 AM
Ha. Nice to know. This seems to be like a major programming bug in HP VC.
HP Virtual Connect should not bring up its downstream interfaces, before the upstream interfaces are ready to go. The sequence should be:
* boot switch. initialise code. Keep downstream interfaces down
* Wait until fully synchronised with slave switch. Wait until ACTIVE/STANDY and eventually needed pre-empt has been determined.
* bring up upstream interfaces. Wait until they are fully LACP negotiated.
* Then detect which VLANs are configured on these uplink that are ready to go.
* Then bring up the upstream interfaces. Do dummy MAC address flooding if pre-empt is needed, so mac address are learned on new link.
* Then bring up the PHYSICAL downstream interface
* Then bring up only those logical interfaces that have VLANs in UP state. Keep the other logical interfaces down (for this last step you need DCC and compatible drivers on host system)