Parsing JavaScript with SpiderMonkey

SpiderMonkey is intended to be used as an interpreter for executing JavaScript, but it is also possible to use SpiderMonkey simply as a parser for the language. This can be useful if you are writing a tool that needs to work with JavaScript code. For example, the JSCoverage tool uses this technique.

This document demonstrates how to use SpiderMonkey to parse JavaScript. Note that some of the SpiderMonkey APIs used in this document are not considered public and are subject to change at any time; this document applies to version 1.60 of SpiderMonkey.

Header files

The following header files must be included:

#include <jsapi.h>
#include <jsparse.h>
#include <jsscan.h>

Initialization

The first step is to initialize the JavaScript engine:

JSRuntime * runtime;
JSContext * context;
JSObject * global;

runtime = JS_NewRuntime(8L * 1024L * 1024L);
if (runtime == NULL) {
  fprintf(stderr, "cannot create runtime");
  exit(EXIT_FAILURE);
}

context = JS_NewContext(runtime, 8192);
if (context == NULL) {
  fprintf(stderr, "cannot create context");
  exit(EXIT_FAILURE);
}

global = JS_NewObject(context, NULL, NULL, NULL);
if (global == NULL) {
  fprintf(stderr, "cannot create global object");
  exit(EXIT_FAILURE);
}

if (! JS_InitStandardClasses(context, global)) {
  fprintf(stderr, "cannot initialize standard classes");
  exit(EXIT_FAILURE);
}

Parsing

The SpiderMonkey parser can receive JavaScript code from a FILE * input stream:

JSTokenStream * token_stream;
JSParseNode * node;

token_stream = js_NewFileTokenStream(context, NULL, stdin);
if (token_stream == NULL) {
  fprintf(stderr, "cannot create token stream from file\n");
  exit(EXIT_FAILURE);
}

node = js_ParseTokenStream(context, global, token_stream);
if (node == NULL) {
  fprintf(stderr, "parse error in file\n");
  exit(EXIT_FAILURE);
}

The result of the call to js_ParseTokenStream is a parse tree.

The parse tree

The format of the parse tree is documented in jsparse.h. The nodes of the tree are struct JSParseNode objects. This structure has the following fields:

pn_type

This field is an integer code representing the type of the node: TOK_FUNCTION for function definitions, TOK_IF for if statements, TOK_WHILE for while statements, and so on. All possible values are defined in jsscan.h.

pn_u

This is a union with the following fields:

unary: This field is valid for nodes which can have a single child node. It is a struct which has a JSParseNode * field named kid.
binary: This field is valid for nodes which can have two child nodes. It is a struct which has two JSParseNode * fields named left and right.
ternary: This field is valid for nodes which can have three child nodes. It is a struct which has three JSParseNode * fields named kid1, kid2 and kid3.
list: This field is valid for nodes which can have an arbitrary number of child nodes. It is a struct which has a JSParseNode * field named head which points to a linked list of JSParseNode objects. For each element of the list, the pn_next field points to the next element of the list or NULL for the tail of the list.
func: This field is valid for TOK_FUNCTION nodes (function definitions).
name: This field is valid for nodes which represent identifiers (variable names, statement labels, and so on.) It is a struct which has a JSAtom * field named atom and a JSParseNode * field named expr.
dval: This field is valid for numeric literals. It is a double value.

pn_arity

This field acts as a union discriminant specifying which field of pn_u is valid (although in most cases this can be determined from the pn_type field). The value is one of the following:

PN_UNARY
PN_BINARY
PN_TERNARY
PN_LIST
PN_FUNC
PN_NAME
PN_NULLARY

pn_next

This field is used for nodes which are members of a linked list, to point to the next node in the list.

pn_op

This field contains extra information for certain node types.

pn_pos

This field describes the location of the node in the input stream:

node->pn_pos.begin.lineno gives the line number of the first character in the node.
node->pn_pos.end.lineno gives the line number of the last character in the node.
node->pn_pos.begin.index gives the column number of the first character in the node.
node->pn_pos.end.index gives the column number of the last character in the node.

Lines are numbered starting with 1. Columns are numbered starting with 0.

For example, suppose that the variable node points to a node representing a while statement; then the node->pn_type field is TOK_WHILE, the node->pn_arity field is PN_BINARY, and the node->pn_u.binary field is used. The node->pn_u.binary.left field points to a JSParseNode which represents the condition of the while statement, and the node->pn_u.binary.right field points to a JSParseNode which represents the body of the while statement.

It is possible to refer to fields in the pn_u union using a shorter form, thanks to some macros defined in jsparse.h. For example, the node->pn_u.binary.left field has the shorter form node->pn_left.

We now have sufficient information to write code to display an outline of a parse tree:

const char * TOKENS[81] = {
  "EOF", "EOL", "SEMI", "COMMA", "ASSIGN", "HOOK", "COLON", "OR", "AND",
  "BITOR", "BITXOR", "BITAND", "EQOP", "RELOP", "SHOP", "PLUS", "MINUS", "STAR",
  "DIVOP", "UNARYOP", "INC", "DEC", "DOT", "LB", "RB", "LC", "RC", "LP", "RP",
  "NAME", "NUMBER", "STRING", "OBJECT", "PRIMARY", "FUNCTION", "EXPORT",
  "IMPORT", "IF", "ELSE", "SWITCH", "CASE", "DEFAULT", "WHILE", "DO", "FOR",
  "BREAK", "CONTINUE", "IN", "VAR", "WITH", "RETURN", "NEW", "DELETE",
  "DEFSHARP", "USESHARP", "TRY", "CATCH", "FINALLY", "THROW", "INSTANCEOF",
  "DEBUGGER", "XMLSTAGO", "XMLETAGO", "XMLPTAGC", "XMLTAGC", "XMLNAME",
  "XMLATTR", "XMLSPACE", "XMLTEXT", "XMLCOMMENT", "XMLCDATA", "XMLPI", "AT",
  "DBLCOLON", "ANYNAME", "DBLDOT", "FILTER", "XMLELEM", "XMLLIST", "RESERVED",
  "LIMIT",
};

const int NUM_TOKENS = sizeof(TOKENS) / sizeof(TOKENS[0]);

void print_tree(JSParseNode * root, int indent) {
  if (root == NULL) {
    return;
  }
  printf("%*s", indent, "");
  if (root->pn_type >= NUM_TOKENS) {
    printf("UNKNOWN");
  }
  else {
    printf("%s: starts at line %d, column %d, ends at line %d, column %d",
           TOKENS[root->pn_type],
           root->pn_pos.begin.lineno, root->pn_pos.begin.index,
           root->pn_pos.end.lineno, root->pn_pos.end.index);
  }
  printf("\n");
  switch (root->pn_arity) {
  case PN_UNARY:
    print_tree(root->pn_kid, indent + 2);
    break;
  case PN_BINARY:
    print_tree(root->pn_left, indent + 2);
    print_tree(root->pn_right, indent + 2);
    break;
  case PN_TERNARY:
    print_tree(root->pn_kid1, indent + 2);
    print_tree(root->pn_kid2, indent + 2);
    print_tree(root->pn_kid3, indent + 2);
    break;
  case PN_LIST:
    {
      JSParseNode * p;
      for (p = root->pn_head; p != NULL; p = p->pn_next) {
        print_tree(p, indent + 2);
      }
    }
    break;
  case PN_FUNC:
  case PN_NAME:
  case PN_NULLARY:
    break;
  default:
    fprintf(stderr, "Unknown node type\n");
    exit(EXIT_FAILURE);
    break;
  }
}

For example, consider this JavaScript program:

var x, y;
x = 1;
y = x * 2;

The program has the following parse tree:

LC: starts at line 0, column 0, ends at line 3, column 10
  VAR: starts at line 1, column 0, ends at line 1, column 8
    NAME: starts at line 1, column 4, ends at line 1, column 5
    NAME: starts at line 1, column 7, ends at line 1, column 8
  SEMI: starts at line 2, column 0, ends at line 2, column 5
    ASSIGN: starts at line 2, column 0, ends at line 2, column 5
      NAME: starts at line 2, column 0, ends at line 2, column 1
      NUMBER: starts at line 2, column 4, ends at line 2, column 5
  SEMI: starts at line 3, column 0, ends at line 3, column 9
    ASSIGN: starts at line 3, column 0, ends at line 3, column 9
      NAME: starts at line 3, column 0, ends at line 3, column 1
      STAR: starts at line 3, column 4, ends at line 3, column 9
        NAME: starts at line 3, column 4, ends at line 3, column 5
        NUMBER: starts at line 3, column 8, ends at line 3, column 9

Cleanup

Memory used by SpiderMonkey can be freed when no longer needed:

JS_DestroyContext(context);
JS_DestroyRuntime(runtime);

Summary

You can download the complete code for displaying the parse tree. You will requires SpiderMonkey version 1.60 to compile the code.

To compile the example on Linux, using GCC:

gcc -DXP_UNIX -I/path/to/spidermonkey/headers -o example example.c -L/path/to/spidermonkey/library -ljs -lm

To compile the example on Windows, using Microsoft's command line compiler:

cl.exe /DXP_WIN /I\path\to\spidermonkey\headers example.c /MD /link /libpath:\path\to\spidermonkey\library js32.lib

Appendix: Parse tree nodes

Detailed documentation for all the different types of parse nodes is located in jsparse.h and will not be repeated here. The following sections merely attempt to clarify the details for some types of nodes.

`PN_NAME` nodes

Nodes with a pn_arity of PN_NAME have a field node->pn_atom of type JSAtom *. This data type can converted to JSString *. The JSString data type is defined in jsstr.h to represent an array of 16-bit jschar characters.

#include <jsatom.h>

...

JSAtom * atom = node->pn_atom;
JSString * string = ATOM_TO_STRING(atom);
for (int i = 0; i < string->length; i++) {
  /* assuming that this atom contains only ASCII characters */
  char c = string->chars[i];
  putc(c);
}

`PN_FUNC` (`TOK_FUNC`) nodes

If you need to obtain the name of a function or its parameters, see the js_DecompileFunction function in jsopcode.c for an example. The following code shows roughly what you need to do:

    JSObject * object = ATOM_TO_OBJECT(node->pn_funAtom);
    JSFunction * function = (JSFunction *) JS_GetPrivate(context, object);

    /* function name */
    if (function->atom) {
      /* function->atom is a JSAtom * containing the function name */
      ...
    }

    /* function parameters */
    JSAtom ** params = malloc(function->nargs * sizeof(JSAtom *));
    for (int i = 0; i < function->nargs; i++) {
      /* initialize to NULL for sanity check */
      params[i] = NULL;
    }
    JSScope * scope = OBJ_SCOPE(object);
    for (JSScopeProperty * scope_property = SCOPE_LAST_PROP(scope);
         scope_property != NULL;
         scope_property = scope_property->parent) {
      if (scope_property->getter != js_GetArgument) {
        continue;
      }
      params[(uint16) scope_property->shortid] = JSID_TO_ATOM(scope_property->id);
    }
    for (int i = 0; i < function->nargs; i++) {
      assert(params[i] != NULL);
      /* params[i] is a JSAtom * containing the function name */
      ...
    }
    free(params);

`TOK_ASSIGN` nodes

A TOK_ASSIGN is not a simple assignment if pn_op has one of the following values:

`pn_op`	Operator
`JSOP_ADD`	`+=`
`JSOP_SUB`	`-=`
`JSOP_MUL`	`*=`
`JSOP_DIV`	`/=`
`JSOP_MOD`	`%=`
`JSOP_LSH`	`<<=`
`JSOP_RSH`	`>>=`
`JSOP_URSH`	`>>>=`
`JSOP_BITAND`	`&=`
`JSOP_BITOR`	`\|=`
`JSOP_BITXOR`	`^=`

`TOK_BITOR`, `TOK_BITXOR`, `TOK_BITAND`, `TOK_EQOP`, `TOK_RELOP`, `TOK_SHOP`, `TOK_PLUS`, `TOK_MINUS`, `TOK_STAR`, `TOK_DIVOP` nodes

These nodes usually have a pn_arity of PN_BINARY, but an expression like 1 + 2 + 3 will be represented as a single node with a pn_arity of PN_LIST.

`TOK_INC`, `TOK_DEC` nodes

The pn_op value is one of the following:

`pn_op`	Operator
`JSOP_INCNAME`, `JSOP_INCPROP`, `JSOP_INCELEM`	prefix increment
`JSOP_DECNAME`, `JSOP_DECPROP`, `JSOP_DECELEM`	prefix decrement
`JSOP_NAMEINC`, `JSOP_PROPINC`, `JSOP_ELEMINC`	postfix increment
`JSOP_NAMEDEC`, `JSOP_PROPDEC`, `JSOP_ELEMDEC`	postfix decrement

`TOK_STRING` nodes

For a node of type TOK_STRING, pn_atom contains the text of the string, without quotes and with any escape characters already applied. For example, the string literal "1\\2\n" is represented by a node whose pn_atom contains four characters: 1, \, 2, and a newline character.

`TOK_OBJECT` nodes

Nodes with a pn_op value of JSOP_REGEXP represent regular expression literals. The text of the regular expression can be obtained with the following code:

JSObject * object = ATOM_TO_OBJECT(node->pn_atom);
jsval result;
js_regexp_toString(context, object, 0, NULL, &result);
JSString * string = JSVAL_TO_STRING(result);

`TOK_INSTANCEOF` nodes

These nodes have a pn_arity of PN_BINARY, with pn_left representing the expression to the left of instanceof and pn_right representing the expression to the right of instanceof.

`TOK_IN` nodes

These nodes have a pn_arity of PN_BINARY, with pn_left representing the expression to the left of in and pn_right representing the expression to the right of in.

siliconforks.com
Last updated June 14, 2007
siliconforks@siliconforks.com

» Documentation » Parsing JavaScript with SpiderMonkey

Header files

Initialization

Parsing

The parse tree

Cleanup

Summary

Appendix: Parse tree nodes

PN_NAME nodes

PN_FUNC (TOK_FUNC) nodes

TOK_ASSIGN nodes

TOK_BITOR, TOK_BITXOR, TOK_BITAND, TOK_EQOP, TOK_RELOP, TOK_SHOP, TOK_PLUS, TOK_MINUS, TOK_STAR, TOK_DIVOP nodes

TOK_INC, TOK_DEC nodes

TOK_STRING nodes

TOK_OBJECT nodes

TOK_INSTANCEOF nodes

TOK_IN nodes