08-07-2013 06:28 AM
We have NNMi 9.1 Installed in our environment- Windows 2008. We had a recent power outage for a region with routers and multiple switches. When the site went down, only one device [Secondary Router] was alerted in the NNMi Incident view when the entire site was down for sometime. Ideally, the entire region devices should be alerted as down right?
Can anyone pls provide solution..
Solved! Go to Solution.
08-07-2013 07:27 AM
NNMi should create an incident for each device that is down.
A new incident is only created after device be polling (default 05 minutes) and after dampening (default 06 minutes).
If you have an outage for a short time, it couldn't be sufficient to create a new incident.
If this is your case, you can change the dampening value to detect more quickly node down incidents.
08-07-2013 07:45 AM
Incorrect assumption its working as designed. Root cause analysis determined that the single device that was alerted on was the cause of the remaing devices to become isolated. The isolated devices would have turned blue as unknown.
Unless you put devices into the "Important nodes" node group, topological connectivity is use to determine the status of every node.
Andy Kemp, CISSP
08-07-2013 08:34 AM
Even I thought it may be due to RCA done on the network and as you had mentioned, all the other devices were in Unknown status.
But the issue was not due to the Router being down, it was an entire region power failure. So it may not be that only that router lost its power supply and the other devices went down. If the entire power supply goes down, all devices must alert right??
And please explain a bit about Important Node Group.
08-07-2013 09:00 AM - edited 08-07-2013 09:01 AM
When it uses topology to determine isolation or not its because its identified how devices are connected via l2 and l3 sources . It keeps track of this and updates its model every time discovery oer spiral discovery collects information... essentially it matches information available in table form on each device with all the other tables from all the other devices.
Logically whent he power went out and all of those devices became unreachable NNMi was telling you that it knew something was wrong and that the most likely point to investigate was the last logical hop it could reach, it does not make the assumption that the nodes past it are down because it cannot poll them directly.
The Important nodes group side steps this logic, if any node in that group becomes unreachable it will alarm as node down.
Andy Kemp, CISSP
08-08-2013 08:25 AM
Let me clarify something here.
Is it this router, that alerted, that connects all devices in this region?
08-09-2013 04:18 AM
Thanks a lot for the explaination AndyKemp.
Let me rephrase, NNMi checks the availability of the devices by following the topology and communicating to the next connected devices to get its availabilty.
When the entire site went down, the NNMi checked the region availability by discovering through the primary router in the region. When it found that it was down, it was unable to check the other devices connected to it and it turned unknown status. Thats is the reason the only one device alerted and others didnt.
@pafreire: Yes it was the Router that connects to all devices in that region.