快乐双彩走势图大:Downloading top X sites’ data with ZombieJS

June 6th, 2014. Tagged: CSS, HTTP, tools

Update: Easier way to get top X URLs: //httparchive.org/urls.php, thanks @souders

Update: found and commented an offensive try{}catch(e){throw e;} in zombie.js (q.js, line 126), now the script doesn't fatal that often

Say you want to experiment or do some research with what's out there on the web. You need data, real data from the web's state of the art.

In the past I've scripted IE with the help of HTTPWatch's API and exported data from Fiddler. I've also fetched stuff form HTTPArchive. It doesn't have everything, but still you can manage to refetch the missing pieces.

Today I played with something called Zombie.js and thought I should share.

Task: fetch all HTML for Alexa top 1000 sites

Alexa

Buried in an FAQ is a link to download the top 1 million sites according to Alexa. Fetch, unzip, parse csv for the first 1000:

$ curl //s3.amazonaws.com/alexa-static/top-1m.csv.zip > top1m.zip
$ unzip top1m.zip
$ head -1000 top-1m.csv | awk 'BEGIN { FS = "," } ; { print $2 }' > top1000.txt

Zombie.js

Zombie is a headless browser, written in JS. It's not webkit or any other real engine but for the purposes of fetching stuff is perfectly fine. Is it any better than curl? Yup, it executes javascript for one. Also provides DOM API and sizzling selectors to hunt for stuff on the page.

$ npm install zombie

Taking Zombie for a spin

You can just fiddle with it the Node console:

$ node

> var Zombie = require("zombie");
undefined
> var browser = new Zombie();
undefined
> browser.visit("//phpied.com/", function() {console.log(browser.html())})
[object Object]
> <html><head>
<meta charset="UTF-8" />....

The html() method returns generated HTML after JS has had a chance to run which is pretty cool (there may be problems with document.write though, but, hell, document-stinking-write!?).

You can also get a part of the HTML using a CSS selector, e.g. browser.html('head') or browser.html('#sidebar'). More APIs...

The script

var Zombie = require("zombie");
var read = require('fs').readFileSync;
var write = require('fs').writeFileSync;
 
read('top1000.txt', 'utf8').trim().split('\n').forEach(function(url, idx) {
  var browser = new Zombie();
  browser.visit('//www.' + url, function () {
    if (!browser.success) {
      return;
    }
    write('fetched/' + (idx + 1) + '.html', browser.html());
 
  });
});

Next challenge: fetch all CSS and strip JS

The API provides browser.resources which is an array of JS resources and really useful - HTTP request/responses/headers and everything.

Since I'll need all this HTML and CSS for a visual test later on, I don't want the page to change from one run to the next, e.g. to include different ads. So let's strip all JavaScript. While at it, also get rid of other possibly changing content, like iframes, embeds, objects. Finally all images should be spacer gifs, so there will be no surprises there either.

The bad

After some experimentation, I think I can conclude that Zombie.js is not (yet) suitable for this kind of "in the wild" downloading of sites. It's a jungle out there.

  • It could be that my script is not good enough but I couldn't find a way to catch errors gracefully (wrapping browser.visit() in a try-catch didn't help)
  • All JS errors show up, some cause the script to hang
  • The script hangs on 404s.
  • Sometimes it just exits.
  • Relative URLs (e.g. to fetch a CSS file) don't seem to work.
  • I couldn't get the CSS resources included (probably a bug in the version I'm using) so had to hack the defaults to enable CSS.
  • I had to kill the script when it hangs and restart it pretty often, that's why I write file with the index of the next top X file.
  • All in all I had something like 25% errors fetching the HTML and CSS.

The good

Loading a page in a fast headless browser with DOM access - priceless!

And it worked! (Well, in 75% of the sites out there)

I'm sure in more controlled environment with error-free JS and no 404s, it will behave way better.

The script

var Zombie = require("zombie");
var read = require('fs').readFileSync;
var write = require('fs').writeFileSync;
 
var urls = read('top1000.txt', 'utf8').trim().split('\n');
 
// where are we in the list of URLs
var idx = parseInt(read('idx.txt'));
console.log(idx);
 
// go do it
download(idx);
 
function download(idx) {
  // remember the place in the likely scenario that
  // the script crashes
  write('idx.txt', idx + 1);
 
  var url = urls[idx];
 
  if (!url) {
    // we're done!
    console.log('yo!');
    process.exit();
  }
 
  var browser = new Zombie();
 
  browser.visit('//www.' + url, function () {
    if (!browser.success) {
      return;
    }
 
    var map = {}; // need to mathch link hrefs to resource indices
    browser.resources.forEach(function(r, i) {
      map[r.request.url] = i;
    });
 
    // collect all CSS, external and inline
    var css = [];
    var sss = 'link[rel=stylesheet], style'; // Select me Some Styles
    [].slice.call(browser.querySelectorAll(sss)).forEach(function(e) {
      if (e.nodeName.toLowerCase() === 'style') {
        css.push('/**** inline ****/');
        css.push(e.textContent);
      } else {
        var i = map[e.href];
        if (i && browser.resources[i].response) {
          css.push('/**** ' + e.href + ' ****/');
          css.push(browser.resources[i].response.body);
        }
      }
    })
 
    // remove style and these nodes that may cause the UI to change
    // from one run to the next
    var stripem = 'style, iframe, object, embed, link, script';
    [].slice.call(browser.querySelectorAll(stripem)).forEach(function(node) {
      if (node.parentNode) {
        node.parentNode.removeChild(node);
      }
    });
 
    // 
    [].slice.call(browser.querySelectorAll('img')).forEach(function(node) {
      node.src = 'spacer.gif';
    })
 
    // placeholder, probably useless
    browser.body.appendChild(
      browser.document.createComment('specialcommenthere'))
 
    // we got the stuffs!
    var html = browser.html();
    css = css.join('\n');
 
    if (html && css) { // do we? ... got? ... the stuffs?
      write('fetched/' + (idx + 1) + '.html', html);
      write('fetched/' + (idx + 1) + '.css', css);
      console.log(idx + " [OK]");
    } else {
      console.log(idx + " [SKIP]");
    }
 
    download(++idx); // next!
 
  });
 
}

Tell your friends about this post: Facebook, Twitter, Google+

Sorry, comments disabled and hidden due to excessive spam.

Meanwhile, hit me up on twitter @stoyanstefanov


  • 76人参加高山滑雪跨界跨项选材测试 或进冬奥备战梯队 2019-05-18
  • 不是特效!上港队长一脚踢中海鸥 这脚法简直坑爹 2019-05-08
  • 新疆坚决打好污染防治攻坚战 2019-05-07
  • 高考在即,晋中市招生考试管理中心提醒广大考生及家长要把握好高考“四个趋势” 2019-04-25
  • 运城市在长三角招商引资149.9亿元 2019-04-08
  • 济南五胞胎雪虎宝宝亮相 四雌一雄萌态十足 2019-04-08
  • 去产能迎年中考 煤炭、钢铁企业债务问题依然存在 2019-04-07
  • 最高检等四部门出台意见 指导依法办理恐怖活动和极端主义犯罪案件 2019-04-07
  • 广州市黄埔区人民法院公告专栏 2019-03-28
  • 这个“海之宁”是个死抱着相对论旧谬误不放,疯狂反对科学新真理的跳梁“小丑”,这个跳梁“小丑”根本就不懂得尊重客观事实及其规律,总是无视、脱离、歪曲客观... 2019-03-22
  • 如何理解孔子这句话?北大教授胡军动情论生死 2019-03-21
  • 海外舆论关注中国最新军备 称赞习近平主席强军号令 2019-03-19
  • 女性之声——全国妇联 2019-03-19
  • 美司法部科米在希拉里邮件门调查中存在过失 2019-03-17
  • “四新”彰显党的十九大思想灵魂和精髓要义 2018-08-17
  • 340| 968| 975| 990| 449| 562| 819| 17| 671| 870|