Are the tools that you are using to test your application affecting the results of those tests?

Success_failure.jpgThere is a growing number of LoadRunner and Performance Center users who what to take advantage of the reduced footprint and (possibly) reduced costs of virtual servers to act as Load Generators for their load tests. Support for this was added in LR and PC around few years ago but the question is, how we can make the use of that in the way that it doesn’t affect the test?

 

Depending on how you can answer this next question will determine if running virtual users on virtual servers is a good idea.

 

Question:

Are the tools that you are using to test your application affecting the results of those tests?

 

It doesn't matter if you are using VM's or physical machines. If the answer to that simple question is “yes” then your performance testing environment is not setup correctly. The results you get from that environment may be misleading and you need to re-evaluate your plans.

 

Below are some edited comments on this subject that I recently posted on the Performance Center forums.

 

There are two factors that no one seems to be discussing and both are related to how controlled your testing environment is (Physical or Virtual) and the reproducibility of your tests.

The first factor is “Noise”. Noise is defined and any kind of activity, network, CPU, memory or disk that is either

a)       From an undetermined source

b)       Is not easily reproducible

 

load testing.jpgWhen you write a script you are defining specific steps to be taken in a specific order so that you can see how your application's servers respond.  When you run those scripts the results represent the response times of the servers during that test run. If anything interferes with the process, the results could be tainted. If we can't reproduce those exact same interferences there is a gap in our testing process and there is a chance we will miss a performance-related issue.

 

When we first started doing load testing noise would come from things like jobs that run on the server in the middle of our test or a garbage collection process that runs once a server hits a certain threshold (that our load tests might be the cause of hitting those thresholds was something that would need to be determined). We would recommend running load tests after hours to minimize the noise that other users generated on the network during the day. We would also recommend suspending server backup or data copy processes that might interfere with the results we collected during the official test runs that were performed overnight.  Since we were rarely given a pure load testing environment where we could completely control the noise, we had to do what we could to minimize the noise.

 

In the world of physical load generators we have to worry about network and server noise but we rarely have to consider the noise on the individual load generator system (beyond monitoring CPU and Memory utilization consumed by our virtual users).  To minimize the noise we make our LG's dedicated and give them powerful CPUs with lots of RAM and we kick off anyone else who wants to use the system during our tests. The key was making sure that our tools collected the data but did not have an effect on that data.

 

In the world of virtual load generators external "Noise" is still something that needs to be considered. But, now noise can also come from the server that is hosting your virtual load generator. Every other virtual machine that is running on an ESX server is generating CPU, Memory, Network, and Disk noise. Reproducing any of this noise is going to be extremely difficult, if not impossible. The more noise there is in a load test the less likely that load test will represent the results you will see in production. It is much more likely that our tools didn't just collect the data but instead affected the data.

 

The more noise there is (regardless of its origin) in a load test the less likely you will be able to re-run a test and see the same results—making the narrowing down of problems more difficult.  As load testers, we are scientists, coming up with a hypothesis (my server can handle 1000 users) and then setting up experiments (scripts and environments) to either prove or disprove that hypothesis. The more tainted our lab environment, the more it is polluted with noise, slow servers, or incorrect methods, the more suspect the findings gathered in that environment will be.  It isn't a matter of if you can use virtual load generators, of course you can. It is a matter of if you should.

 

If your customer understands the differences, the potential consequences, and ultimately decides to go with virtual load generators, they can make their tests more accurate by doing whatever they can do to minimize the noise, from wherever that noise originates.

 

The second factor that we have to concern ourselves with is Extrapolation

 

We are all familiar with the concept that our test environments are rarely mirrors of the production environment. We don't have the same number of servers and the servers are not as powerful in the testing environment. 

 

The more differences you have between your production and test environments the more difficult the process of extrapolation is. Extrapolation is made even more difficult when you have many different layers of differences.  Part of the process of developing a good load test is documenting all of these differences on every level. Then you need to develop a load test where results can be accurately extrapolated to represent the performance that will be seen by the production system.

 

Until the event of virtualization, we never really needed to consider extrapolating our performance testing environment. But now, you must consider that a Virtual Windows 2008 R2 server with 4GB of ram is not nearly the same as a physical Windows 2008 R2 server with 4GB of ram so you must determine at what number of virtual users the tools interfere with the results.  Some say it is a general 50 percent fewer users (if the physical LG can run 1000 virtual users the virtual LG will be able to run 500). I suggest that this is an unknown percentage and you will need to test to determine each individual virtual server’s percentage. Remember that number could be a moving target when multiple LG's are hosted on the same physical ESX server and could be further complicated by other virtual machines running on that same host and their level of usage at the time you are testing for each VM’s capacity.

 

Virtualization has made the job of predicting performance much more difficult. If you are not careful you can completely invalidate your entire testing effort. The question was asked, “what are the best practices?”

1) Do not use Virtual Load Generators

2) Use an isolated network

3) Have your testing environment match exactly your production environment.

 

Since it is unlikely you can perform one of these – let alone all three, we have to make do with what we are given. That means taking into consideration things like noise and taking the time to properly determine the capabilities of our testing environment and then accurately extrapolating the results to predict the performance of the production systems.

 

There is a publically available video that was put together by Mark Tomlinson, a leader in the performance testing space by any measure.  The video (which can be viewed here: http://h20621.www2.hp.com/video-gallery/us/en/382f043d0c0ed471856dadcb17de423511fec645/r/video ) should be considered required viewing for anyone who is thinking of using VM's as load generators.  The video is the best explanation of a very specific problem with using virtual load generators known as “clock drift”. However, because of the singular focus of video it could be inferred that this is the only issue customers might face so I would add the following:

 

This is a great video discussing the issue of clock drift and I believe that it actually fully supports what I was saying about testing your environment’s abilities and making sure that the tools do not affect the results of the test.  There are however some assumptions made in this video that are not explicitly mentioned that can affect the results of a test.

 

Shared vs. Dedicated networking interface: It seems that this video is assuming dedicated NIC cards.  A much more common configuration is shared NIC cards. If you have an ESX server with 4 NICS and you run 20 Virtual machines on that ESX server you have an average of 5 virtual machines per NIC.  Large payload tests can cause collisions not just at the switch level but also at the NIC level.

 

ESX Server CPU. This was briefly mentioned in the video but it is at the core of my concerns of using virtual load generators. While ensuring that the CPU on your virtual machines doesn’t exceed 85 percent(something that you shouldn’t do with physical servers either) you must also consider the overall CPU usage on the ESX server (monitoring of which is not automatic in LR or PC).  Just adding more VM’s will not solve the problem of an ESX server that is overloaded.

 

Shared Disk IO.  All of your virtual servers are most likely writing to a common or at least similar location. While I admit it would be a very large test that could exceed the IO constraints of modern ESX servers, this scenario isn’t out of the realm of possibilities. This is especially true when considering all the IO needs of the VM’s that are not doing load testing.

 

Here is a possible scenario that I could see small to mid-sized companies could run into. You decide to purchase a 25 – 50K ESX server that has six processors, has 60GB of RAM, and supports 25 VMs with 2GB ram each. Right away you load up 20 VM’s thinking that will solve all of their issues and will allow them to retire 20 physical machines.  Now if five of those old physical machines are the old load testing environment and you believe that you can just transplant the existing physical performance testing environment onto the virtual environment, it is my theory that you will see dramatically different testing results and you would be displeased with your new configuration. You may also be upset that you spent so much money for bad results.

 

Your first option would be to watch Mark’s very good explanation of the problem and go back and test the new environment and make sure that the CPU of the individual VM’s is not exceeding 85 percent. Then per the instructions you add more VM’s to compensate. Now your old one controller and four generators become one controller and eight generators and all of sudden you are at the 25VM max (time to drop another 25 – 50K on another ESX server? 

 

This scenario is maybe possible for big companies, but that is going to be a hard sell for smaller and mid-sized companies).  Even when the clock drift issue is resolved, the concept of noise is still a concern.  You have 15 other servers running out there completing for the ESX server compliment of processors, ram, NIC cards, and disk IO and you cannot easily replicate the activities on those servers through the load testing process.

 

Can you have a dedicated or nearly dedicated ESX server that houses your Load Testing tools? Great! Then I see very little concern for issues that cannot be addressed by balancing the load and monitoring the overall performance of the ESX server.  But how many performance testing teams get their own ESX server?

 

Ultimately I believe that this comes down to one maxim that has been a part of the scientific community for hundreds maybe even thousands of years.

 

You must ensure that the tools that are used to conduct experiments do not affect the results of those experiments, otherwise the tests are not pure and any results or conclusions that are arrived at based on those tests is suspect.

 

Download LoadRunner to test up to 50 virtual users for unlamented time and be part of the largest pe...

 

Labels: Load Testing
Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
I have been working in the computer software industry since 1989. I started out in customer support then software testing where I was a ver...
Featured


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation.