Thursday, October 11, 2012

Quick and Dirty Node.js Process (Job) Queue


(LOL @ Windows Clipart)

I recently began experimenting with nodejs for a small web scraping project. I wrote a tiny script that goes out to lots of URLs and downloads files to disk. The simple solution was to iterate through the list and send a request to load the URL and download the page.

Too Many Open Files

Unfortunately, there are limits on the amount of simultaneous exec() calls you can make. Since running an external command via exec() is non-blocking, making too many back-to-back calls of it will result in the following:
node.js:201
        throw e; // process.nextTick error, or error event on first tick
              ^
Error: spawn EMFILE
    at errnoException (child_process.js:481:11)
    at ChildProcess.spawn (child_process.js:444:11)
    at child_process.js:342:9
    at Object.execFile (child_process.js:252:15)
    at child_process.js:220:18

Maximum Simultaneous Calls

To solve this, I implemented something like the following:
var queue = [];
var MAX = 20;  // only allow 20 simultaneous exec calls
var count = 0;  // holds how many execs are running
var urls = [...] // long list of urls

// our callback for each exec call
function wget_callback(err, stdout, stderr) {
  count -= 1;
  
  if (queue.length > 0 && count < MAX) {  // get next item in the queue!
    count += 1;
    var url = queue.shift();
    exec('wget '+url, wget_callback);
  }
}

urls.forEach( function(url) {
  if (count < MAX) {  // go get the file!
    count += 1;
    exec('wget '+url, wget_callback);
  } else {  // queue it up...
    queue.push(url);
  }
});
This will only allow so many exec() calls to simultaneously run. The rest of the URLs will be stored in a queue until a slot becomes available for them. Checking (and shifting) the queue is done in the callback function wget_callback(). I fetch the next URL to download out of the queue only if there are no more than MAX exec() calls already running. I keep track of how many calls are currently running using count and increment/decrement accordingly.

I'm sure there are tons of libraries that do this, but I decided to implement a quick and dirty solution to this problem and thought I'd share!

19 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I love how clean this is. I like how the loop pre-loads the processes, which process the queue with exactly the right number of callbacks. I'm not sure I would've thought of doing it this cleanly.

    Could there be a race condition between multiple calls to the callback modifying the enclosed variables (queue and count)?

    ReplyDelete

  3. This is quite educational arrange. It has famous breeding about what I rarity to vouch. Colossal proverb.
    This trumpet is a famous tone to nab to troths. Congratulations on a career well achieved. This arrange is synchronous s informative impolites festivity to pity. I appreciated what you ok extremely here 


    Selenium training in bangalore
    Selenium training in Chennai
    Selenium training in Bangalore
    Selenium training in Pune
    Selenium Online training

    ReplyDelete
  4. I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing.

    Join Cloud Computing Training in Bangalore at Softgen Infotech. Learn from Certified Professionals with 10+ Years of experience in Cloud Computing. Get 100% Placement Assistance. Placements in MNC after successful course completion.

    ReplyDelete
  5. Thank you so much for the great and very beneficial stuff that you have shared with the world.

    Looking for Training Institute in Bangalore , India. Softgen Infotech is the best one to offers 85+ computer training courses including IT software course in Bangalore, India. Also it provides placement assistance service in Bangalore for IT.
    Best Software Training Institute in Bangalore

    ReplyDelete
  6. Wow its a very good post. The information provided by you is really very good and helpful for me. Keep sharing good information.
    Best Training Institute in Bangalore BTM. My Class Training Bangalore training center for certified course, learning on Software Training Course by expert faculties, also provides job placement for fresher, experience job seekers.
    Software Training Institute in Bangalore

    ReplyDelete
  7. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts.

    ReplyDelete
  8. I like the helpful info you supply in your articles. I’ll bookmark your weblog and take a look at once more here regularly. I am relatively certain I will learn a lot of new stuff right here! Good luck for the following!
    sap training in chennai

    sap training in velachery

    azure training in chennai

    azure training in velachery

    cyber security course in chennai

    cyber security course in velachery

    ethical hacking course in chennai

    ethical hacking course in velachery

    ReplyDelete
  9. Get the answer for the query “ How To Get a Job in Infosys as a Fresher? ” with the real-time examples and best interview questions and answers from the best software training institute in Chennai, Infycle Technologies. Get the best software training and placement with the free demo and great offers, by calling +91-7504633633, +91-7502633633.

    ReplyDelete