Monday, May 06, 2013

Scale if you must, but don't forget reliability

My home sensor base station is using STOMP to transfer sensor node messages into an AWS cloud instance (running RabbitMQ).  I am using STOMP because I can't find an AMQP binding that supports SSL (outside of Java and Erlang).  Or, more to the point: the sensor base station runs Lua and my AMQP binding for Lua is with  amqplib (which doesn't support SSL).  My pure Lua STOMP binding runs over SSL. But I digress...

The STOMP client runs 24x7, pumping messages into the cloud hosted RabbitMQ at a rate of 2-3 messages per minute.  This is not a lot of traffic, but I expect it to scale.

In order to develop (and test) the "server/logic" side of the system, I am running the consumer of messages on my laptop (it connects to the AWS RabbitMQ as a consumer).  The consumer is also talking STOMP.

However, when my laptop lid is closed, all consumption stops. So, for example, overnight I can accumulate a few thousand sensor messages.  Firing up the consumer in the morning should just suck all those messages down and pick up where it left off. Unfortunately, there is a glitch (in RabbitMQ?) where after a few hundred messages (all ACK based -- fault tolerance, baby!) it stops receiving new messages and the remaining messages are marked by RabbitMQ as "unacked".  Restarting the consumer happily consumes a few hundred messages before the same glitch re-occurs.

This is a serious problem and I need to figure out if RabbitMQ is the culprit.  Interestingly, I found that by slowing down my consumption rate (read/respond every 100ms) the problem is fixed. Argh.  Well, that won't scale, now will it?

So, until I figure out what the real problem is, I'll keep consuming as fast as I can.  But, what about the unacked messages? Well, this is where "reliability" comes in.  By default, all of my consumers do timed reads. If they don't receive a message in 60 seconds, then they terminate. (I don't use RabbitMQ heartbeats. I have an "application" level heartbeat that makes sure the whole system flow is working from base station to cloud. This heartbeat fires every 30 seconds).   Failure is an Erlang technique. The consumers are written in Lua, but I did learn one thing from Erlang: Failure is okay, just plan your recovery.

So, once the process terminates, Ubuntu Upstart restarts it.  The result: The system recovers on its own from this bug.  The system continues to run.  The unacked messages are requeued and delivered.

Scaling is great, but don't forget reliability!

No comments: