12-05-2011 05:57 AM
NNMi Advanced 9.1 in 2 node app failover configuration.
Traffic Extension and Leaf Collector are installed on the NNMi management servers.
The Master Collector is installed on a dedicated NPS.
Netflow is v5.
I am having an issue where the following iSPI Traffic ETL jobs fail, sporadically:
This causes a build up of Interface_Traffic_*.gz files in the \\.....\perfSpi\datafiles\metric\final directory. If left unchecked over the weekend we're talking hundreds of thousands of files. The current file count is around 322,000.
When I restart the NNM iSPI Performance ETL Server service on the NPS the ETL jobs will restart and the file count in the above directory will start to decrease, however, it seems the bulk load only processes 12 files per run. I did a little math and at the current rate of load it would take over 20 days to process all these files.
I need to know if there is a way to increase the number of files processed per ETL execution or whether there's any other way to get these files processed quicker.
Also if anyone has had any experience dealing with iSPI Traffic ETL jobs failing from time to time and have found a resolution I'd be happy to hear about it.
I have iSPI metric installed and it runs flawlessly... anyone else experiencing issues with iSPI Traffic 9.1 stability?
Solved! Go to Solution.
12-07-2011 07:28 AM
NNMPerformanceSPI.cfg in D:\ProgramData\HP\HP BTO Software\NNMPerformanceSPI\rconfig contains the settings that helped me process 320,000 files very quickly (around 8 hours... default settings would have been about 20 days).
The settings I changed are:
ETL_MaxChildProcs = 30 .... default was 5
ETL_MaxMetricsFilesPerBatch = 1000 .... default was 12
I also changed the flush record limit on the master collector to 50,000 from 10,000 but not sure whether that was truely needed.
I incremented these settings until I reached these final values. So for ETL_MaxMetricsFilesPerBatch I went from 12 to 48 to 100 to 500 to 1000 and was watching CPU, Memory, Disk and Network. Disk spiked to 100% utilization but only for short durations. The master collector seems to be very scalable if you have the right hardware.
In case anyone is curious about the hardware:
NNM and NPS servers are HP DL380 G7's w/:
2 Intel Xeon 6-Cores @ 3GHz
24GB of RAM
Data Drive: 4 1TB 15K RPM SCSI Disks (Raid 10)
1GB SCSI Controller Cache
Windows 2008 R2 Enterprise SP1
12-28-2011 07:02 AM
If you have iSPI Metrics installed, setting ETL_MaxMetricsFilesPerBatch = 1000 will cause the ETL Interface_Health bulk load process to fail with the following error if there is a large queue of Interface_Health files in the PerfSpi share:
ETL.Interface_Health Failed command system("cygpath "D:\ProgramData\HP\HP BTO Software\NNMPerformanceSPI\Interface_Health\temp\U
I discovered this when, for whatever reason, the ETL stopped processing these Metrics files and they backed up for a few days. When I restarted the ETL service on the NPS there were 1,280 Interface_Metric files in the PerfSpi share and ETL tried loading 1000 of the files. I had to set ETL_MaxMetricsFilesPerBatch to 48 for the error to go away.
IMO, That said setting ETL_MaxMetricsFilesPerBatch to 1000 is OK temporarily if you need to clear up a large queue of Interface_Traffic* files.
07-15-2012 08:28 PM
Sorry for reviving an old thread, but it looks like we're seeing the same problem - Interface_Traffic files stop getting processed, and they quickly build up. Not quite as many as you've got, but still a lot. At first it's just the Interface_Traffic files, then if it goes on too long, it will stop processing the other files too.
Did you ever find out what was making the jobs fail? Think I'll have to log a case about it.
08-05-2012 05:02 PM
I've made similar changes to my configuration, and it looks like that's resolved the issue with files not being consumed.
The only difference is that I now see the CPU consumption being more "choppy." Previously it was a smoother graph, now it spikes up more every 5 minutes when new files appear in the directory. But then it drops down to a lower level. I'm happy with that, as it's actually consuming all the files now, instead of periodically stopping.