I spent the whole today stuck on an issue with jclouds and I thought that it would be a good idea to blog about it so that others don't have to spend so many hours on it.
My target was to write an integration test, that will start a node on Amazon EC2 and install a service that would be used for the integration test. So I created a script that performed a curl to download the tarball of the service unpack the service and run the service. So far so good. The problem I encountered was that my invocations on the method runScriptOnNode (a jclouds method for invoking scripts on remote nodes) timed out after waiting for 10 minutes. However, the script only needed 1 minute and was successfully executed.
Diving into jclouds run script methods
After spending some time to make sure that no network issues, like firewalls and such where involved, I decided to examine in depth how the runScriptOnNode method works.
Jclouds uses an initialization scripts, which installs the target script to the node and invokes it. The initialization script keeps track of the targets script pid and is able to tell if the target script has completed its execution. So the runScirptOnNode will block for as long as the initialization scripts replies that the target script is running.
Where's the catch?
The initialization script keeps track of the target scripts PID by executing findPid which is ps and grep using a pattern which matches the execution path. That's not a problem by itself, but if you install your service and run it inside the same folder, then initialization script will get confused and won't be able to tell when if the target script finished its execution. As a result the runScirptOnNode method will block till it times out.
The figure above displays a setup that can have problems. In this setup the init script will query the status of the target script by performing a ps and using the jc1234 to filter out processes. However, if a new process is started under that folder (by the target script), say folder service, then the init script will not be able to properly detect when target script finished. That's because the findPid will now return the pid of the service.
Never start a service inside the same folder where the target script is executed, make sure you unpack and run your service from inside an other folder. Even better use a framework for installing the service (e.g. Apache Whirr) for installing mainstream service and only put your fingers on it if you really have to.