10-14-2011 01:26 AM
We are working with OpenVMS 8.4 in a HP BL870C. We are having sporadic but painful I/O problems and, according to the Performance Tuning manual and as we have plenty of free memory, we have considered to modify the ACP SYSGEN parameters in order to increase the hit percentage of some caches.
We have tested different values online and we have realized that, although there is no fixed maximum limit for most of these parameters, when changing values. there are upper limits for some of them. We executed AUTOGEN, changed PAGEDYN parameter, as suggested by AUTOGEN, and rebooted, but the result is the same. Look at ACP_HDRCACHE, ACP_DINDXCACHE, ACP_FIDCACHE and ACP_EXTCACHE parameters values after reboot:
$ sear sys$system:setparams.dat "acp_"
set ACP_MULTIPLE 0
set ACP_MAPCACHE 20360
set ACP_HDRCACHE 89920
set ACP_DIRCACHE 16288
set ACP_DINDXCACHE 20360
set ACP_FIDCACHE 81920
set ACP_EXTCACHE 81920
set ACP_QUOCACHE 652
set ACP_SYSACC 29
set ACP_SWAPFLGS 14
sysgen> show /acp
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
ACP_XQP_RES 1 1 0 1 Boolean
ACP_REBLDSYSD 1 1 0 1 Boolean
ACP_MULTIPLE 0 0 0 1 Boolean D
ACP_SHARE 1 1 0 1 Boolean
ACP_MAPCACHE 20360 9 2 -1 Blocks D
ACP_HDRCACHE 24384 36 8 -1 Blocks D
ACP_DIRCACHE 16288 22 4 -1 Blocks D
ACP_DINDXCACHE 20360 26 2 -1 Blocks D
ACP_WORKSET 0 0 0 -1 Pagelets D
ACP_FIDCACHE 16384 64 0 -1 File-Ids D
ACP_EXTCACHE 16384 64 0 -1 Extents D
ACP_EXTLIMIT 100 100 0 1000 Percent/10 D
ACP_QUOCACHE 652 64 0 2337 Users D
ACP_SYSACC 29 8 0 -1 Directorie D
ACP_MAXREAD 32 32 1 64 Blocks D
ACP_WINDOW 7 7 1 -1 Pointers D
ACP_WRITEBACK 1 1 0 1 Boolean D
ACP_DATACHECK 2 2 0 99 Bitmask D
ACP_BASEPRIO 8 8 4 31 Priority D
ACP_SWAPFLGS 14 15 0 15 Bitmask D
Are we doing anything wrong?. Why is the reason it's not possible to raise those four parameters?. The AUTOGEN report doesn't say anything about a problem; in fact, it seems to admit those higher values but, when booting, they don't appear.
Any help will be very appreciated.
Thank you very much in advance.
10-14-2011 02:08 AM
I didn't check all of these parameters, but the system cells for them are VMS words, as you can see from the name ACP$GW_HDRCACHE:
10-14-2011 03:56 AM
The system parameter is limited by the size of the system cell, where it is stored. A VMS word consists of two bytes. So for unsigned integers you have a max of %XFFFF, which is 65535. When setting a bigger value, for example 89920, which is %X15F40 only the two lower bytes are stored into the system cell, which is %X5F40, which in turn is 24384. That's the value you see for ACP_HDRCACHE. That's the same value which you get from the expression 89920 & %XFFFF.
10-14-2011 04:57 AM
Documentation beyond what's in the system parameter help text (SYSMAN> HELP SYS_PARAMETERS, etc) and related published documentation and available via Google queries is what is found in the OpenVMS source listings; a review of the operating system source code.
I would not automatically assume that the performance manual has been updated to reflect the XFC changes, either. XFC is built on the existing structures, but it has its own controls and displays. (Given that there are all of five (5) references to XFC within that document, there isn't a whole lot of coverage in that manual.)
As for the "sporadic but painful I/O problems", you will want to look at what your application is doing in some detail (and describe to us what you are observing in some detail), and particularly characterize what's happening when the performance goes sideways. VMS I/O is very slow and very cautious, and many of the I/O subsystems I've encountered on VMS (like most of the SAN hardware) are far from current speeds and feeds.
With hit rates and I/O caching, if you're thrashing your caches due to file opens or piles of file access, or if you're generating logical I/O requests for instance, then it's not at all difficult to produce poor hit rates and ineffective caching. Faster I/O, and reduced I/O, and application I/O, and potentially implemting changes such as enabling RMS global buffers. But without some idea of what's occuring with these "sporadic but painful I/O problems", all of this is sheer speculation.
http://labs.hoffmanlabs.com/node/632 (finding hot files)
And various other documents and searches.
And if you need more performance-related assistance with this issue, then get somebody familiar with tracing VMS I/O, or ring up HP support and see what they might suggest.
10-14-2011 06:44 AM
Ana, there are many factors related to tuning OpenVMS and AUTOGEN alone is a conservative tool to use. IF there are system parameters that are "out of tune" (so to speak) AUTOGEN will make adjustments as it sees fit if you're not "telling it" to be more aggressive by adjusting the rules with additions to MODPARAMS.DAT There are other tools available that can tell you if those ACP parameters are really underallocated besides AUTOGEN. I'd suggest reviewing the MONITOR utility as there are several "classes" of MONITORing tools you can use to watch how the ACP caches are being used. That would show when your system's "hit rate" for those caches is low and that will usually trigger AUTOGEN (with feedback) to increase them conservatively as needed.
If your system or cluster is new or your workload has changed dramatically or you've migrated to new hardware it might be beneficial to engage someone with tuning skills to help your situation. There are several tuning tools available (from HP and other vendors) that can be used to generate tuning suggestions. Based on all the factors involved you might find that you're already seeing relatively optimal performance OR that you have room for improvement. The factors related to tuning OpenVMS for your circumstances can be very complex and usually involve more than adjusting one or two system parameters or classes of parameters. The main consideration is that there is NO "fast switch" for OpenVMS and any tuning endeavor becomes a regular, sustained effort at achieving the best balance of combined factors that hopefully achieve the most positive results.
10-14-2011 03:01 PM
>>> when booting, they don't appear
Review the help sysgen help for write. Note the difference between SYSGEN>WRITE ACTIVE and SYSGEN> WRITE CURRENT
>>> having sporadic but painful I/O problems
It depends. What's the storage behind this system, what's the application, what's the ratio of read to write? A VMS system can easily generate enough I/O to bottleneck a single (raid/mirror) disk. Work out what the application is doing and then plan to resolve the problems. Changing sysgen parameters might be a start. You may be better off relocating files or reconfiguring storage. T4 can be a major help. It depends.
10-14-2011 03:52 PM
> >>> when booting, they don't appear
> Review the help sysgen help for write. ...
And y'all probably shouldn't be making system parameter changes out at SYSBOOT, in general. That's what MODPARAMS.DAT and AUTOGEN are intended and used for. Making changes at SYSBOOT and directly in SYSGEN can destabilize a VMS system, as it's AUTOGEN that knows the relationahips and the limits among the parameters, and not SYSBOOT nor SYSGEN. And changes made via SYSBOOT nor SYSGEN can get lost if the changes are not (also) reflected within the MODPARAMS.DAT file.
10-14-2011 04:03 PM
10-17-2011 01:40 AM
Thanks all for the information provided.
The problematic situation arises when accessing to directories with a lot of files and multiple users accessing them, and when doing deletion operations on them. We have seen very high values in the I/O operation rate (about 1500 ) on the affected disk. So, we are trying to reduce the disk I/O increasing the main cachés hit ratio.
We collect MONITOR information at 5 minutes interval and, as the problem is sporadic, the average values are, in general, good. The same happens with feedback information in AUTOGEN that tells that cache hit are at 100%. But analyzing these caches in the moment of the problem are not as good.
According to the ACP parameters help, increasing the value of these parameters only has, as a negative effect, the memory consuption. As we have a system with plenty of free memory, I want to make use of a part of that memory to increase these parameters until a reasonable amount, and I thought that AUTOGEN could tell me that information. As a test exercise and in a test system, I tried to assign the maximum value for those parameteres in MODPARAMS.DAT and execute AUTOGEN to see the changes or the warnings in case this didn't work and, after AUTOGEN calculations, the only change was, logically, the PAGEDYN parameter (a great increase, in fact). Changing this value and executing again AUTOGEN, there were no more changes. I rebooted and the message at booting was "Insufficiente dynamic memory". Clearly, there were more parameters involved that AUTOGEN didn't reply about.
Until now, I have been able to increase something these parametes (the values written in my first post )with some improvement in the performance.But my question is: Is there an easy way to increase these parameters until the highest, but not painful value for my system, without the risk of not being able to boot?.
I know this is a possible solution or improvement to my problem. Of course, I don't discard the possibility of decreasing the number of files and acceses to the disk, as you have suggested.
Thank you very much in advance.
10-17-2011 01:45 AM
how big is the Directory file (xxx.DIR) ? Deleting files in a big directory file is a very costly operation, as the .DIR file contents needs to be shuffled around. And all other users of that directory will see 'bad performance'.
10-17-2011 02:29 AM
10-17-2011 03:04 AM
The directory was 163000 files. I know that these are a lot of files and that the deletion operations are costly but, in our system, this was a special case (a cache directory of a web with many files). The problem was that, after realizing that the deletion operation was the cause of the performance degradation, we stopped the process that was running the deletion and the situation continued after some time, affecting the overall system .
Anyway, as we have, in general terms, good performance in our systems (it's a cluster behaving, basically, as a web server) and the problem described is difficult to evaluate and sporadic (except the disc with all the webs logs that has high I/O rate compared to the others) , I want to focus in increasing these parameters to better manage these special cases where the I/O rate of a disk increases sporadically.
10-17-2011 04:34 AM
There are no particular ACP parameters around altering the performance of what OpenVMS now considers large directories, there are environment and application changes, and OpenVMS upgrades. Your OpenVMS version has the large-directory fixes that were implemented (in V7.2) and various other RMS and XQP performance tweaks), so that's not an option.
The usual approach for rapidly deleting what OpenVMS considers to be large numbers of files in these directories is the reverse-delete hack. Code your own delete, and delete the files in reverse alphanumeric order, or scrounge up one of the freeware tools that does this. The default forward-delete behavior of the DELETE command is pessimal, given the OpenVMS directory structures.
Put another way, you're not getting hammered so much on the ACP parameter settings here, you're getting hammered on fiundamental design decisions within the application, and how these decisions interact with fundamental design decisions within OpenVMS and its XQP processing, and particularly around the directory I/O processing, and around how VMS doesn't cache this directory data in memory; VMS really, really wants to write the data to disk. (When one of these directories gets rebuild from a mass-delete with cluster locking, other hosts can have delays writing to the directories, and I can see SCS locking Getting Busy with the directory locks; where you can, split your write processing and avoid using the shared storage, and start work to split up this directory.
Which in aggregate means a move to faster storage hardware (SSD will completely obliterate HDD performance) or a move to a RAM-based pseudo disk, or related I/O changes, or application design changes, or a move to a platform that better fits the needs of these sorts of application designs.
I'd also run a check for hardware and network errors, given the transient nature of this report. That probably isn't the case, but transient errors involving (for instance) heavy I/O activity can become unstable in the presence of lower-level and hardware errors.
10-17-2011 09:45 AM - edited 10-17-2011 09:48 AM
As noted, delete operations on a large directory can be expensive. I supported an application with similiar issues. Some of the options to mitigate this behavior, depending on your budget.
- RAMdisk. configure your report files on a virtual disk. A DEC-Ram license and memory can resolve all sorts of I/O issues.
- Use a search logical and spread the I/O to multiple disks and directories. On the fly updating of the disk sequence allows you to schedule delete operations on aged directories with minimal user impact
- Same option but spread directories on a single disk. Less cost, and less benefit, but keeps directory size small. I used a batch job to shuffle directories every 10 minutes. One site schedules daily deletes after midnight and kept writing report files to new directories.
- Review file allocations and bump the cluster size to something that allows most of these files to fit within one allocation. Faster create and delete time.
- If you have anything else generating I/O on the cache disk, move it.
The 2.x version of our web app skipped the temporary report file and and and just sent data to users. More database overhead if they wanted to go back to an old report but better performance since the disk bottleneck was relieved.
Don't expect tuning to have a significant impact. Complete this step and get caught up but be ready to move on to the next phase.
10-25-2011 07:25 AM
Hoff>> There are no particular ACP parameters around altering the performance of what OpenVMS now considers large directories
I beg to differ. yes there is: ACP_MAXREAD
That paramter defines the IO size in blocks which will be used during the directory shuffle.
The current default is 32
You could set to 64 for temporary releave, while working towards a real solution (cleaning out that directory).
(Set to 1 for the pre 7.2 behaviour :-)