Post-Javascript DOM with Aptana Jaxer
It is hard to imagine the web as we know it today without AJAX. Within just a couple of years, it has transformed how we think about web applications, how we build them, and the technology stack underneath. Prototype, JQuery, YUI and dozens of others frameworks let us build rich and interactive applications, while Javascript and rendering engines are progressing in leaps and bounds when it comes to performance. Exciting stuff. However, this client-side functionality comes at a price. The AJAX testing experience is still painful (even with solutions like Selenium, Webrat, etc), and even more generally, working with, or even getting access to the post-Javascript DOM remains a challenge. Want to build a crawler which sees the page as the author intented? That's a tough problem.
Javascript is (Almost) Everywhere
Like it or not, disabling JavaScript in your browser today is likely to render a large section of the web as completely unusable, and that is unlikely to change anytime soon. For that reason, over the years there have been many recurring rumors of the Google bot rendering Javascript when it came to your page. Google denies this, but there are also theories that the V8 engine is a direct result of such initiatives - which, I think, is not completely unfounded.
In either case, if you ever attempted this yourself, you quickly realized just how non-trivial this problem actually is. A typical spider just downloads the HTML of the page, it has not knowledge of the DOM, or Javascript. So, at the end of the day, what you need is a browser: the content must be downloaded, the HTML needs to be interpreted into a DOM model, and then finally handed off to the Javascript runtime for further processing. That's a lot of work.
Jaxer, the AJAX Server
Having attempted at this problem many times with Rhino, SpiderMonkey and even raw Webkit bindings, I finally stumbled on Jaxer. Developed at Aptana, it is based directly on Gecko (Firefox rendering engine), minus the graphical rendering, which means that the DOM is already there and Javascript is free of charge. I never bought into the server-side Javascript movement, but seeing Jaxer in this light definitely opened my eyes to a lot of opportunities. The installation is as easy as unpacking the Apache runtime and then proxying your requests through the server. For example, to get the full post-Javascript DOM of any web-page, just drop this template into your Jaxer directory:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Post JavaScript DOM</title>
</head>
<body>
<!-- Render post-JS DOM: http://jaxer/post-js.html?www.google.com -->
<script runat='server'>
var url_query = Jaxer.request.queryString;
var sb = new Jaxer.Sandbox(url_query);
Jaxer.response.setContents(sb.toHTML());
</script>
</body>
</html>
Accessing the Server-Side DOM
The first thing you'll notice is just how slow most of the pages will come back when proxied through Jaxer. However, this has nothing to do with Jaxer and everything to do with the dozens of Javascript files and slow remote calls our browsers have to do on our behalf. Their asynchronous nature just happens to mask the abysmal performance of most pages. Which, incidentally, is also likely the reason why our search engines today do not look at the post-JS DOM.
If you haven't already, take a look at the Jaxer tutorials. You have full access to the DOM on the server, which means that you can use JQuery to alter the response, connect to a database, test your Javascript on the fly, etc. Having said that, I'm still secretly hoping that one day our curl command client will just have a flag to return the final post-Javascript DOM.