Tuesday, August 28, 2012

Measuring CPU usage of Node.js from inside the program

or: Learning the hard way that you have to watch your CPU load yourself sometimes

These days, it feels like we are stretching the limits of what can (or should) be done with Node.js. While running a distributed crawler on EC2, we noticed that the central control server did get into trouble working off the events that it received. At least, that was how it seemed. The problem manifested itself in queries not coming trough to the MySQL instance. After ramping up the load for a while, the Node process was continuously running at 100% CPU and finally got kicked because sequelize was repeatedly disconnected while trying to reconnect to the database.

Googling around suggested that this is usually caused by a flaky network connection. Wich felt strange, because the network was highly saturated, but had bandwidth left. The rest of the communication was doing fine.

We're using socket.io for communication between the modules, and react to inter-module messages with custom event handlers. After a while, we narrowed the problem down to one central function that did some unoptimized computation whenever a page was to be analyzed after being downloaded.

It turns out that when the CPU is blocked by one function running continously, Node's asynchronicity comes to a forced end - no timers are called again before the function exits. Here is an example:


Look at this for a minute. It starts a cpu hogging function and sets a timer to stop that after a second. At least, I was certain that the timer would kick in and end the busyWait procedure. But it doesn't. Because it can't. If you think about it, this makes sense - having only one thread, there is no way for the timer to be fired. I fell into the trap of thinking about Node like Java - where a timer is usually in another thread and able to fire regardless of what other threads are doing. It might get slow, but it will fire eventually. Not so with Node.js, in extreme cases.

So, the lesson is this: Node.js timers make things asynchronous, but CPU intenvise functions can severely interfere with this. Write your functions so that they yield from time to time. If necessary, factor out the work and process it in batches, to give the rest of the program some time to catch up with new events coming in. If you do this, Node once again can run with the amazing speed it usually does.

Apart from applying the necessary optimization to the function in question, we introduced a throttler that would feed new messages into the system only if the current CPU load wasn't too high. If the system is too busy, the message gets queued. I found a great way to measure the current CPU usage on linux on StackOverflow and wrapped it into easy to require in your script:


It works by reading out /proc/[pid]/stat, where linux reports status of process [pid] [link]. The two fields read in the script report user and system cpu cycles spent on the process. I had to experiment a bit to find out how many cycles are run per second - the StackOverflow answer impliess 1000 per second and processor, but for me, 500 worked. If you'd like to verify that it calculates correctly on your machine, you can use this little script to generate some load:


If you look at top while running this, you can easily verify the values and adjust accordingly.