LAMPlights Personal anecdotes from my experiences using the LAMP stack

11Apr/115

Retrying Failed Gearman Jobs

The gearman job queue is great for farming out work.  After reading a great post about Poison Jobs, I limited the number of attempts the gearman daemon will retry a job.  This seemed fairly straight-forward to me: if a job fails, then the gearman daemon will retry the job the specified number of times.  I learned the hard way that it was not that simple.  There is specific criteria the gearman daemon follows in order to retry a job.

This all came about when I noticed a particular gearman worker was throwing an uncaught exception under certain conditions.  I assumed that an uncaught exception would cause the gearman daemon to retry the job.  I found out that not only did gearman not retry the job, the client was receiving a return code of GEARMAN_SUCCESS.  In other words, the client had no idea the worker was blowing up.

The GearmanJob class provides some methods to inform the gearman daemon the result of a job.  They are primarily used for synchronous jobs.  The sendComplete method will cause the gearman daemon to send a return code of GEARMAN_SUCCESS to the client and can also be used to pass data back to the client.  The sendFail method will cause the gearman daemon to send a return code of GEARMAN_WORK_FAIL.  This may seem fairly obvious, but it is important to note that calling sendFail will not cause the job to be automatically retried.  The client code would have to recognize a return code of GEARMAN_WORK_FAIL and decide whether or not to call the job again.

Then there is the sendException method, which will cause the gearman daemon to send a return code of GEARMAN_WORK_EXCEPTION to the client.  Do not make the mistake I did by thinking this will implicitly be called if a worker throws an uncaught exception.  The main difference between sendFail and sendException is that a string detailing the exception can be added to the sendException method.  If you wrap a worker in a try/catch block, you can catch exceptions and call sendException with the exception error message.  The sendFail method does not take any parameters and leaves the client guessing as to why the failure occurred.

The worker does not know if it was called synchronously or not.  If the worker was called synchronously, using any of the aforementioned methods will allow the client to determine the status of the job.  The client can then decide whether or not to retry a failed job.  If the worker was called asynchronously, sending back the job status falls on deaf ears.  Nothing is listening for the job status and the gearman daemon will not log failed jobs.

We still have no idea what criteria must be met in order for the gearman daemon to retry a job.  I read some gearman mailing lists and perused the daemon source code and I think I have found a definitive answer.  The worker must exit with a non-zero code during a job in order for the gearman daemon to retry the job.  The strange thing is that an uncaught exception causes a php script to exit with a code of 255.  Explicitly calling exit(255) will force a retry, but an uncaught exception will definitely not force a retry.  In fact, an uncaught exception will not even cause a GEARMAN_WORK_FAIL or GEARMAN_WORK_EXCEPTION return code.

After some reviewing of the pecl/gearman code, I have found that the pecl/gearman worker code is not checking for an exception before returning GEARMAN_SUCCESS.  I have submitted a bug report with a patch to at least return GEARMAN_WORK_FAIL when an exception is enountered instead of GEARMAN_SUCCESS.  I do think there is an argument to be made that an uncaught exception should force a retry of the job, but I will leave that discussion for another day.

The best way to force a job retry on an uncaught exception is to simply use the exit() function.

function func($job)
{
    try {
        // work
    } catch (Exception $e)
        syslog(LOG_ERR, $e);
        exit(255);
    }
}

This will cause your worker to stop running, but so will an uncaught exception.  Most gearman architectures have a monitor that will restart fallen workers.  If you don't, get one and have it send out alerts if any worker exists with a status of non-zero.

  • Craig Lumley

    Hey, I’ve left a comment on your bug report – Surely to allow us to work with exceptions properly the pecl extension should be making use of the exception callback functionality rather than always assuming an exception is a failure?

    I’d be interested to know what do you think?
    Craig

  • Herman Radtke

    Thanks for following up. I responded in the bug ticket my original thoughts (http://pecl.php.net/bugs/bug.php?id=22636). Now that you mention it, using the callback for exception would be nice. My only concern is that handling exceptions automatically seems like a slippery slope. Do we send back just the message or include the trace as well? The protocol leaves it ambiguous.

  • Craig Lumley

    Yes, I think discussing it gives everyone or anyone involved an opportunity to get the best out of and for the project itself – I’m more than happy to bounce some idea’s around.

    In reality I think it’s up to the person using the client to deal with any exceptions that are thrown – that’s surely the point of the GearmanClient::setExceptionCallback, if nothing has been set the most sensible route would be to fail as you have suggested.

    In relation to the exceptions, it would be nice to return a serialized version of the exception returned, although that leaves the question of which job threw the exception and what do we do with that information? So this information would be good to add.

    I think we’re both on the same page really, just that you’re in a better position to push this forward however I am more than happy to offer any help and support that I can. Please feel free to email me directly if you like.

    I have mentioned this on the bug report – where would you like to discuss this further – here, or the bug report?.

    Thanks for obliging me,
    Craig

  • Herman Radtke

    I have a patch ready for catching the exception, but this causes some problems with other parts of the code. I know that libgearman 0.14 does not handle GEARMAN_WORK_EXCEPTION correctly. Looking at the newer 0.20 libgearman version to see if this is still the case.

  • Herman Radtke

    Finally managed to fix that bug. Should be pretty close to what people expect now. May have to work with Brian Aker a bit on libgearman to get it perfect.