Let's put the Research into R&D. I guess it should be called Shit Research to go along with the name of this series (p1) of articles (p2) I started one year ago. It takes a lot of time to read over hundreds of thousands of lines of undocumented C++ and java code, so part three took much longer than expected.
Another benefit of researching each implementation is that we can find all the different tools they use. For example different test suites.
Make sure to read the URL's for each implementation to find out more information about each one.
There is a list of ECMAScript implementations at wikipedia. We will not cover all of the ones listed there. If like me, you may spend a few hours or even days reading through the links from there.
It is a good implementation to study, since it is fairly small, and is meant to be easy enough to read.
It uses a hand written parser, not a generated one.
The original narcissus source repository.
The new repository of narcissus, and two articles about it.
It also has excellent documentation. Especially the js/src/README.html file which includes a design walk through.
"""The compiler consists of a recursive-descent parser and a random-logic rather than table-driven lexical scanner. Semantic and lexical feedback are used to disambiguate hard cases such as missing semicolons, assignable expressions ("lvalues" in C parlance), etc. The parser generates bytecode as it parses, using fixup lists for downward branches and code buffering and rewriting for exceptional cases such as for loops. It attempts no error recovery. The interpreter executes the bytecode of top-level scripts, and calls itself indirectly to interpret function bodies (which are also scripts). All state associated with an interpreter instance is passed through formal parameters to the interpreter entry point; most implicit state is collected in a type named JSContext. Therefore, all API and almost all other functions in JSRef take a JSContext pointer as their first argument.
The decompiler translates postfix bytecode into infix source by consulting a separate byte-sized code, called source notes, to disambiguate bytecodes that result from more than one grammatical production."""
The parser is hand written parser in C++. It's not generated.
'Preparses', which creates tokens. Then creates an AST by parsing. Finally compiling the AST. parser.css is where all the parsing happens.
The projects documentation is quite limited - so reading the source is the best way to get in there. There are some videos which describe the architecture, and engineering decisions behind the choices taken.
A handwritten scanner they call TokenStream creates tokens, and then parses those into an AST. Despite being mostly hand written, some parts are generated. Specifically the stringToKeyword method, which detects keywords is generated somehow.
The documentation of the architecture of the project is limited. There is however some API documentation. With a couple of modifications to some ant build files I was able to build it, as well as even make a few small modifications.
This uses a hand written lexer(tokeniser), and a hand written parser. The code structure of the parser and lexer looked eerily familiar. The code base is mostly written in C++ and is quite massive. 140,837 SLOC
There is lots of platform specific code, but it also has a jit, uses byte code, and an interpreter. There is also lots of development code in there for things like debuggers, and profilers.
They have however published papers and blog posts about their implementations. I won't cover them any more, because not as much can be learned without the source code.
The description of the project mentions it's currently using the spider monkey parser, but it appears to generate one using a parser generator provided by pypy. Using a EBNF grammar file. It also creates an AST.
It's written in RPython (a restricted subset of python) and python. Running on top of pypy, it should theoretically be able to take advantage of that platforms jit and garbage collector.
It weighs in at 5452 Source Lines of Code (SLOC). Which is much smaller, but the implementation is also not complete, so that is to be expected.
So what have we learned then?
We see that most of the implementations use a hand written parser. We also see that the implementations in js and python are much smaller. So despite them being incomplete, I think it proves that it should be feasible to make our shit interpreter in python. We don't need half a million lines of C++ to do our project.
We have also learned that there are test suites available, which should help us out a lot. In fact, many of the implementations share the test suites. Having a test suite already available makes it way easier to write an implementation of something yourself. It acts as a guide to development, and also reduces the time for testing since a lot of it can be automated.
Exercise for next time
Choose One(1) of the implementations, build it, run it, and modify it slightly to do something different. Try and run the tests that come with it.
Further reading.This whole article is "further reading", but we can never have too much to read. Can we!?
This time, instead of reading it on the train or in the bath tub - may I suggest reading these on a couch?