11-07-2011 04:41 AM - edited 11-07-2011 05:07 AM
I have some problems with the configuration poll to some Cisco VSS clusters.
The first trigger that there is something wrong was that we got false alarms (node down) from that devices sometimes.
Figuring out the reason of that I saw that the configuration poll to a VSS cluster hang at the point where NNMi gets the FDB table from the switch. After 600 sec. it timed out and I get the false alarm.
That looks like the SNMPwalk will bring down the whole snmp service in the VSS so the state poller gets no answer and sends the node down.
And now the device is in state "Rediscovery in Progress" since ca. three hours. That can't be normal behavior?!?
Is there any known problem with such devices? I don't know if this is a problem on NNMi side or VSS side.
11-07-2011 05:09 AM
I did some more tests now with other devices. All configuration polls hang and can't finish.
For example on a cisco 2960 switch it breaks after the FdbAnalyzer has completed.
Looks like a bigger NNMi problem now?!?
11-07-2011 05:28 AM
Community configuration --> specific node configuration
check out each community string taken by NNMi node wise, some times NNMi takes wrong strings in specific configuration.
check and retry config poll
11-07-2011 05:47 AM
I checked the communities, they are all okay!
The state polling is working as well at this moment and the communication configuration of each devices is correct.
The configuration poll starts right and breaks at a specific place. So I don't think of a problem with the communities.
Is it possible that NNMi has an internal problem with the data from the polling?
In the system health report I saw a lot of late responses in the SNMP Health Agent from two of our VSS clusters.
11-07-2011 05:53 AM
oh k sorry.. i faced the the similar problem, with community string mismatch in specific node configuration, where me too was able to poll only through command line.
11-08-2011 01:59 AM
No problem. Thanks for your reply, Bharath.
Today I have a Minor error on Disco Health Agent.
There are 199 devices which stuck in discovery process for 24h.
That means the configuration polling of the VSS cluster yesterday at 10:37 am causes the whole discovery agent to stuck.
Since that time I have such messages in the jbossServer.log:
2011-11-07 10:37:12,859 WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_58] - Abort of action id 41a020f:b3f1:4eb65149:19cb80f invoked while multiple threads active within it. 2011-11-07 10:37:12,863 WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.CheckedAction_2
] - CheckedAction::check - atomic action 41a020f:b3f1:4eb65149:19cb80f aborting with 1 threads active! 2011-11-07 10:37:13,085 WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_58] - Abort of action id 41a020f:b3f1:4eb65149:19cbb6b invoked while multiple threads active within it. 2011-11-07 10:37:13,085 WARN [com.arjuna.ats.arjuna.logging.arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.CheckedAction_2 ] - CheckedAction::check - atomic action 41a020f:b3f1:4eb65149:19cbb6b aborting with 1 threads active!
In that thread I read something about a bug with ID QCCR1B49172 in 9.01 Patch 2.
But I have Patch 4 installed where this failure should be fixed.
11-08-2011 08:57 AM
If your Cisco devices are replying normally to an snmpwalk, then maybe they are unhappy replying to SNMP getBulk requests. By default, NNMi will use getbulk, as in general, it's a more efficient way of retrieving information.
As a test, can I suggest that you open up the "Communication Configuration" and select the "Specific Node Settings" tab? You could add an entry for one of the problem devices, fill in the settings that you wish for it but ensure that you uncheck the box "Enable SNMP GetBulk". Save and close those windows and then select the device and check it's communication settings just to be sure that it has picked those up.
Then you can try a configuration poll a couple of more times to see if there is any difference.
You might also consider that some devices may work better using a specific SNMP version? For example, maybe SNMPV2 communication is an issue with certain devices, in which case it might be worth setting the various versions as a test using the same method as above (in the specific node settings).
If the problem you see appears to lie only within the processing of bridge/FDB tables, then it's possible that the issue only shows itself when processing per-vlan bridge tables. NNMi will be sending directed requests to the device using the special community string derived from the vlan information that the devices are reporting.
You could also try running snmpwalks to the device, for the bridge tables, but using the various special community strings (commstr@vlanid) to see if any of them cause the hang situation that you describe.
Following on from this, 9.01 P3 (which you have already), included a new file disco.NoVLANIndexing. This enables you to switch off the per-vlan indexing of a device. It does mean that you will not get connections derived from the bridge tables, but that tends not to be an issue for most Cisco devices since they are using CDP or LLDP instead. You can see the details of how to use this file by taking a look at the 9.01 deployment guide - it's in the section entitled "Suppressing the Use of VLAN-indexing for Large Switches". Let me know if you have problems getting your hands on the manual (you can find them all at http://support.openview.hp.com).
So, a few things to try there, I hope that this helps,
If you find that this or any post resolves your issue, please be sure to mark it as an accepted solution.
06-22-2012 07:40 AM
We are seeing the same issues here with our our production and test NNMi 9.11, Patch-3, environments. We disabled the SNMP GetBulk as suggested in this thread and also tried the noVlanIndexing thing in test. So far nothing seems to be helping with these. Did you ever find a solution for the issue you were seeing? Thanks in advance for any help you can offer!
08-17-2012 03:08 AM
no, I have no final solution for that, sorry!
I changed the polling cycles a little bit and changed the rediscovery cycle from 1 day to 5 days. That was working better till last week. But now I see the same issue again.
I will try the solution with the suppressing VLAN-indexing and give a short reply about the results.
But my feeling is that I need to upgrade to 9.1x or 9.2x for getting better discovery performance.
But that is not so easy at the moment for me to do that.
08-21-2012 04:27 AM
I tried to implement this file disco.NoVLANIndexing with two IP ranges of huge Cisco 6500 switches to exclude them from the FDB analysis.
I created the file under /var/opt/OV/shared/nnm/conf/disco/
and the put in this lines:
I want to exclude two subnets 10.100.102.0 /24 and 10.100.96.0 /22.
Is that the right notation of the file? Which owner should that file have? bin bin as the other files under the folder "disco"?
After restarting the services I still get response for FDB from one of the devices in the subnets with a configuration polling. That seems that it is not working....
Thanks a lot for a short reply!
08-21-2012 06:35 AM
I found the reference pages in another thread here in the forum.
I didn't use the right notation for the IP addresses.
Will try it again!
08-22-2012 03:54 AM
nnmnoderediscover.ovpl -all or to the specific node is the only solution i got it for this issue, when ever it got struck up in the middle or pending for re-disco in progress. Just check it out!!!
08-24-2012 12:45 AM
The FDB filter is working now.
Before the implementation I got more than 4500 entries one some switches and now I get only something about 120 entries.
That's okay from the first view. Now I have to wait if the dicovery and snmp polling is more stable than before.
I'll keep this thread updated.