Case Study: Using HP Diagnostics to analyze complex Spring/REST call interactions

Guest post by Ramy Asselin, HP R&D Java Software Engineer

 

This is a true story.

 

I recently started work on a large Spring/REST project involving numerous web application archives (WARs) deployed across several Java virtual machines (JVMs) and developed by different teams.  My Spring REST code was developed, and JUnit test cases were passing (using plenty of Mockito mocks). Next step, let’s do some integration testing.

 

After ironing out issues missed by my test cases and fixing integration assumptions, I was getting HTTP 500 Errors telling me something timed out and not much more. Not very useful, is it? So why did the timeout occur?

 

At this point, I’d either need to start sifting through log files or stepping through with a remote debug session. Neither of them are very appealing nor productive. First, log files have too much information, and oftentimes irrelevant to the problem you’re trying to solve because what you need isn’t logged. Second, stepping through code involving multiple webapps on multiple JVMs to solve a timing issue is not only tedious, but won’t work, since you’re slowing down the execution.

 

Wouldn’t it be better if a tool could somehow record what happened and show a picture that tells the story?

 

How HP Diagnostics helped

Fortunately I had experience with HP Diagnostics, a very powerful Java profiling tool with an overhead low enough to be used in production. You can download HP Diagnostics here.

 

I installed it and had it instrument the key JVMs, reproduced the timeout issues, and out came the picture I was looking for:

 

 

 HP Diagnostics 1.png

On the top left you see “Server Requests” with the URL of what I was testing: registration. Below, you see a few tabs, the Call profile along with a few more tabs showing the captured Exception stack traces. In the Call profile tab, you see a “wedding cake” view of what actually happened. The first thing you notice are the colors: red, blue, and green. These represent the different JVMs and/or webapps. Mine is the one in red, which invokes the “Service Component” webapp (blue), which in turn calls the “Management Instance” webapp (green).

 

Identifying the timeout

Looking through the red portion, I first looked at the yellow highlighted layer, ServiceComponentResourceManagerClient.updateServiceComponent()

 

The yellow represents an exception. Its stack trace and message is shown in the first exception tab, ReadTimeoutException. So why did it timeout? I noticed that the timeout is set to about 2 seconds. (You get a more accurate time in the tree view, which is not captured here.) And how long did the service actually take? Thanks to this Cross-VM call profile, that’s an easy question to answer: about 5 seconds. (Good luck figuring that out with log messages or remote debugging!)

 

Digging deeper

OK, so why did the service take 5 seconds? Let’s take a look at the blue layers.

 

Notice the three consecutive RestTemplate.doExecute() calls into the green. The third has a ReadTimeoutException, also after about 2 seconds. The Management Instance service call took about 3 seconds, and spend most of its time in Object.wait(). Notice these calls are shaded. This indicates that the methods are not instrumented but detected using a low-overhead technique called “Stack Trace Sampling”, which finds long-running method invocations. Without looking at the code, I can get a pretty good idea of what’s going on: the Management Instance service is calling OS-level scripts to generate certificates and users.

 

With this single picture, I have a much better idea of what’s going on: Services are taking longer to process than their respective clients allow them to.  It’s not immediately clear why, but it is possible the VM (or its ESX host) is overloaded, causing these services to take longer than usual to process. There are many possible reasons, and more data is required. 

 

Exploring timeout changes

Out of curiosity, I decided to take a look at increasing the client timeouts. After all, that’s not too much effort and can hopefully provide at least a temporary workaround. Using the latency values provided in the first call profile, I increased the Service Component Client timeout to 5s, and the Management Instance Client Timeout to 3s. I got this implemented and deployed in about 2 hours clock time (based on the call profile timestamps). Here’s the resulting call profile, notice it looks very different:

 

 

 HP Diagnostics 2.png

 

A new issue revealed

Unfortunately, this didn’t provide the workaround I had hoped. First, the red layer completely covers the blue, and the blue covers the green, indicating that the Service Component and Management Instance clients call did not time out, as expected. However, the SCRM Service still had more work to do after calling the Management Instance services: it had to call the Credential Service using the Credential Client, which timed out after two seconds. (Note: time scale not shown.) Since the Credential service is run in a different JVM, which I did not instrument (you don’t see what happened there, but it would have been shown in green), the code for this service and client are not in my code repository and it would have been complicated to adjust that timeout value.

 

Interestingly enough, this call profile exposed a new issue: subsequent retries returned HTTP 409 conflict errors. Why? I don’t know. But send the above call profile to the developer of the Service Component, and they can easily write a unit test for the ServiceComponentService.updateServiceComponent() method and mock this sequence of client calls. They should have the issue reproduced and resolved in the comfort of their IDE.  Doesn’t that make it easier to follow Test Driven Development?

 

Making the job easier

At this point, I still didn’t have the final answer for why the timeouts were occurring. But at least I could productively investigate the issue and constructively discuss it with other developers, QA, architects, and managers so that root cause can be identified and resolved.

 

HP Diagnostics makes my job easier in three ways:

  1. It automatically captures what’s going on across multiple webapps and JVMs in a single picture that’s easy to understand
  2. It shows long-running methods without over-instrumentation that negatively impacts the performance I am trying to analyze in the first place
  3. Defects visually show me what happened, so that I can more easily reproduce the issue with unit tests and mocks before fixing them

Download a free profiler version of HP Diagnostics for yourself.

 

 

 

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
About the Author
This account is for guest bloggers. The blog post will identify the blogger.


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation