Recently we've upgraded OracleAS environments in one of our biggest clients here in Brazil from version 10.1.3.0.0 to version 10.1.3.3.0. To do that, we've applied patchset #6148874 (which can be downloaded from Oracle Metalink) carefully following all the instructions found in its installation guide. As expected, the patchset application went perfectly fine and all of our configured containers and applications were still there, up and running, and everybody was happy... until we noticed that one of our oldest and most annoying problems was still there: using ASControl to manage our clustered production environment sometimes was just impossible due to its unacceptable performance when trying to access anything in cluster topology context.
Unfortunately, the only workaround we knew was to restart ASControl application in the cluster's "master node" (the node elected to run the ascontrol application). Fortunately, today I found the problem and also the solution for this situation.
There's a documented bug on Metalink (bug #6601697) explaining that this problem is caused by the underlying RMI communication between the ASControl 10.1.3.3.0 and the other components of the cluster. ASControl uses RMI protocol to connect to other nodes, makes some request and then waits for a response. The problem is that there's no timeout for this response waiting. If one or more components are in a heavy load situation, for example, ASControl will keep waiting for their responses, which can take too long, causing the impression that ASControl stopped working.
In fact, this problem is known since version 10.1.3.1.0 but it seems that patchset 10.1.3.3.0 does not include the correction as it should. So, for now, the solution is to apply the one-off patch #6124143, also found on Metalink. This patch is originally targeted to OracleAS 10.1.3.1.0 but, as written in bug #6601697, it's ok to apply it on version 10.1.3.3.0 though. We just have to follow these simple instructions, found in the bug, before actually applying it:
- Edit the patch file etc/config/actions and replace 10.1.3.1.0 by 10.1.3.3.0
- Edit etc/config/inventory and replace 10.1.3.1.0 by 10.1.3.3.0
Once it's successfully applied, we must use ASControl to navigate to the OC4J instance where ascontrol application is running (typically the "home" instance). Click on Administration -> Server Properties -> Command Line Options -> Start-parameters: Java Options and add "-Drmi.client.connection.timeout=X", where "X" should be the number of seconds to wait for a RMI response before timeout. The bug suggests 5 seconds.
We've successfully applied this one-off patch in our OracleAS 10.1.3.3.0 environment and, until now, it seems to have solved the problem.
I feel this might be a problem faced by many other OracleAS 10.1.3 users so I hope this post can help them.
Cheers and... keep reading!