Sunday, April 22, 2012

Does the Roku need Erlang/OTP?

My Roku streaming player locks up every once in a while (every couple of months).  It seems to do this in two different ways:  Sometimes it seems to get confused about the wi-fi connection and sometimes it just freezes in the middle of a movie.  We are heavy Roku users in my household, and it works flawlessly most of time, so it is hard to get a good idea why this is happening (there is no consistent scenario).

So, here is an embedded device that is suppose to work 24/7 and my only recourse when it locks up is to unplug the power supply and plug it back in.  There are no buttons on the unit itself to "reboot" it.

This kind of device is ripe for the "Erlang" approach.  Now, of course, if you've been doing embedded firmware for any decent length of time, you'd realize that you don't "need" Erlang to fix this.  Watchdog timers, process monitors, a soft-reboot button, etc are the first things that come to mind.

But, a lot of the software that run on these set-top boxes seem to come the "other side" (non-embedded developers). I have no proof of this, but the lock ups smell of that kind of development mentality. When I use to build highly available internet servers systems, I always made sure that I had terminal access so that I could log in and kill/restart stuff to fix problems.  I knew that my stuff needed to run standalone, but I also knew that I could log into a running system, look around at what the problem was, restart stuff and take the fix back to development for the next release.

You don't get any of that with "appliance" devices.  So, the more I did embedded work, the more I developed a "no login; no logs" mindset.  Stuff needs to run, damn the logs.

This year I had to do some Cloud apps. I used Erlang/OTP.  I thought I caught all of failure conditions but some third party code would fail every once in a while.  The system would run for a week or two, but then mysteriously crash. Thank goodness for the logs. I logged in and reviewed about 1MB of logging to find the problem.  I fixed it, uploaded new code and restarted the servers.

This doesn't work for a Roku.  Once it ships, there is no developer login.  The device must never lock up.  The user must never lose control.  Even in the need of a full reset, the user should be able to do this from the couch.  All processes (and devices) must be monitored.

My home monitoring system base station is currently using Erlang/OTP -- not because Erlang solves these problems, but because Erlang/OTP was designed to solve these problems.

No comments: