Using Eclipse MAT to track down a ColdFusion server Out-of-Memory Error

Monday, January 11, 2010

Recently, this came across my desk:  “This process is causing ColdFusion to crash on the DEV server. It hasn’t crashed the developer’s local machine. Can you take a look?”

In this post, I’ll show how I came to the answer in this specific case. I’m putting this out there because the solution was both surprising and simple. I’m also writing about it because I want to show how I came to the answer very quickly, rather than spending hours going down a bunch of dark alleys.

The Short Answer

For the impatient, or the people who got here from Google, here’s the conclusion: ColdFusion Server Monitor can, indeed, cause your memory to spike, and crash your server, with “Profiling” turned on. It’s common knowledge that keeping the “Memory Tracking” turned on can crash CF; however, it’s rare that you hear about “Profiling” causing troubles.  In my case, I believe the reason was setting the Database “Slowest Queries” and “Most Frequently Run” Queries to be very aggressive. For Slowest Queries, I had it set to give me the last couple hundred queries of 0 seconds. For Most Frequently run, I changed the default from 20 to at least 100.

To solve the server crash, I simply turned Profiling off.

 

The Journey

For those interested in seeing how I concluded Profiling (or, rather, my aggressive settings) was to blame, read on.

The Usual Suspects

I started by checking the usual OutOfMemory culprits:

  • Debugging on?
  • Memory Monitoring on?

Nope.

I went to server monitor and looked at the database stats: it confirmed my gut reaction, which was that the process was running thousands of queries. This isn’t necessarily a server killer in and of itself, but in my experience, OutOfMemory problems usually lead back to database access. So I went to the source code and checked out the queries that server monitor indicated were frequently run queries. I was looking for the tell-tale server killer: a couple of queries in a loop, and the loop having the possibility of running dozens of thousands of times. What I saw was perhaps questionable, but nothing shocking or easily identifiable as the offender. So, I put this in the back of my head and went to the next step in my process: get a heap dump.

By now, I’m about 10 minutes into the investigation. The 2 most obvious suspects weren’t implicated, and the 3rd most obvious was suspicious but not definitive. Now, at this point, JRun was up over 1GB of RAM, so a heap dump and parsing the dump would take a long time.

Getting a Heap Dump

I RDP’d into the dev server, using “mstsc /admin”. You might need to use /console instead. In my experience, using jmap over RDP usually requires one of these options.

I always have a JDK on the CF Server, so these instructions assume you either have a JDK installed or can install one.

  • Open a cmd, and type: jmap –dump:format=b,file=c:\jrun4\heapdump.bin <pid>, where pid is the windows Process ID of jrun.exe

this will put a heap dump file at c:\jrun4\heapdump.bin. If you want it somewhere else, just change the file param to the jmap command.  Now, we need something to parse this file and show us what’s sucking up all the memory.

Analyzing a Heap Dump with Eclipse MAT

Before I describe the few steps I took, I want to point out Brian Ghidinelli’s excellent set of posts on analyzing memory problems in ColdFusion. My post is meant to be  a quick and dirty description of how I found an answer to a memory problem quickly using Eclipse MAT. Brian’s articles are exhaustive, and I strongly encourage you to check them out if you want to *really* learn how to do this stuff.

Now then: I opened the MAT perspective in Eclipse (this is on 3.5), and I Opened the heapdump.bin file that jmap spit out. It spun for quite some time, and then I got an Out Of Memory error in Eclipse. Ironic, eh? This one was easy to identify, since I was running Eclipse with only 512MB of RAM, and the file I was parsing was over a gig. So I set Eclipse to run with 1350MB of ram by using the-Xmx option in eclipse.ini, and tried to open the file again. This time, it got pretty far, gave an error, but it still had generated some data. It didn’t give the full information one would need for an exhaustive analysis, but it gave me enough. Had I needed more information, I would’ve followed Brian’s instructions in this post for using Eclipse MAT to parse very large files.

Once I had some information on screen, I clicked the “Dominator Tree” link, and this is what I saw:

bigheap

Go ahead. Open that file. Do you see that line at the top… the one with 22 million instances of a single class name consuming almost 1GB of Memory? Yeah… that’s the problem. His name is coldfusion.monitor.stack.CFStackFrame, and he was my huckleberry.

At this point in my investigation, I do not know exactly what that class is, but judging by its name, I’m pretty sure it’s related to the CF server monitor. I open server monitor, turn off just the “Monitoring” option, and re-run the application. It crashes within minutes.

I restart CF, open server monitor, and turn Profiling off. I re-run the application, and it successfully completes. I then turn Monitoring back on – to be sure that it’s isolated to Profiling – and re-run the application. It completes.

If there’s a lesson, it’s:

After you spend a few minutes investigating and ruling out the usual suspects, stop speculating. Obtain data, and let it guide you.

2 comments:

Wim said...

The question is: is this NORMAL behaviour? I mean, this is a bug in CF, isn't it?

Marc Esher said...

Wim, it's a good question, and I don't know the answer. I kind of chalked it up to "you get what you ask for", since I had the profiler settings set so aggressively. Basically, I was asking profiler to record information about all database access. And with there being thousands and thousands of queries, I suppose it shouldn't be surprising.