Silicon Forks » Documentation » Parsing JavaScript with SpiderMonkey

SpiderMonkey is intended to be used as an interpreter for executing JavaScript, but it is also possible to use SpiderMonkey simply as a parser for the language. This can be useful if you are writing a tool that needs to work with JavaScript code. For example, the JSCoverage tool uses this technique.

This document demonstrates how to use SpiderMonkey to parse JavaScript. Note that some of the SpiderMonkey APIs used in this document are not considered public and are subject to change at any time; this document applies to version 1.60 of SpiderMonkey.

Header files

The following header files must be included:

#include <jsapi.h>
#include <jsparse.h>
#include <jsscan.h>

Initialization

The first step is to initialize the JavaScript engine:

JSRuntime * runtime;
JSContext * context;
JSObject * global;

runtime = JS_NewRuntime(8L * 1024L * 1024L);
if (runtime == NULL) {
  fprintf(stderr, "cannot create runtime");
  exit(EXIT_FAILURE);
}

context = JS_NewContext(runtime, 8192);
if (context == NULL) {
  fprintf(stderr, "cannot create context");
  exit(EXIT_FAILURE);
}

global = JS_NewObject(context, NULL, NULL, NULL);
if (global == NULL) {
  fprintf(stderr, "cannot create global object");
  exit(EXIT_FAILURE);
}

if (! JS_InitStandardClasses(context, global)) {
  fprintf(stderr, "cannot initialize standard classes");
  exit(EXIT_FAILURE);
}

Parsing

The SpiderMonkey parser can receive JavaScript code from a FILE * input stream:

JSTokenStream * token_stream;
JSParseNode * node;

token_stream = js_NewFileTokenStream(context, NULL, stdin);
if (token_stream == NULL) {
  fprintf(stderr, "cannot create token stream from file\n");
  exit(EXIT_FAILURE);
}

node = js_ParseTokenStream(context, global, token_stream);
if (node == NULL) {
  fprintf(stderr, "parse error in file\n");
  exit(EXIT_FAILURE);
}

The result of the call to js_ParseTokenStream is a parse tree.

The parse tree

The format of the parse tree is documented in jsparse.h. The nodes of the tree are struct JSParseNode objects. This structure has the following fields:

pn_type
This field is an integer code representing the type of the node: TOK_FUNCTION for function definitions, TOK_IF for if statements, TOK_WHILE for while statements, and so on. All possible values are defined in jsscan.h.
pn_u
This is a union with the following fields:
unary
This field is valid for nodes which can have a single child node. It is a struct which has a JSParseNode * field named kid.
binary
This field is valid for nodes which can have two child nodes. It is a struct which has two JSParseNode * fields named left and right.
ternary
This field is valid for nodes which can have three child nodes. It is a struct which has three JSParseNode * fields named kid1, kid2 and kid3.
list
This field is valid for nodes which can have an arbitrary number of child nodes. It is a struct which has a JSParseNode * field named head which points to a linked list of JSParseNode objects. For each element of the list, the pn_next field points to the next element of the list or NULL for the tail of the list.
func
This field is valid for TOK_FUNCTION nodes (function definitions).
name
This field is valid for nodes which represent identifiers (variable names, statement labels, and so on.) It is a struct which has a JSAtom * field named atom and a JSParseNode * field named expr.
dval
This field is valid for numeric literals. It is a double value.
pn_arity
This field acts as a union discriminant specifying which field of pn_u is valid (although in most cases this can be determined from the pn_type field). The value is one of the following:
  • PN_UNARY
  • PN_BINARY
  • PN_TERNARY
  • PN_LIST
  • PN_FUNC
  • PN_NAME
  • PN_NULLARY
pn_next
This field is used for nodes which are members of a linked list, to point to the next node in the list.
pn_op
This field contains extra information for certain node types.
pn_pos
This field describes the location of the node in the input stream:
  • node->pn_pos.begin.lineno gives the line number of the first character in the node.
  • node->pn_pos.end.lineno gives the line number of the last character in the node.
  • node->pn_pos.begin.index gives the column number of the first character in the node.
  • node->pn_pos.end.index gives the column number of the last character in the node.
Lines are numbered starting with 1. Columns are numbered starting with 0.

For example, suppose that the variable node points to a node representing a while statement; then the node->pn_type field is TOK_WHILE, the node->pn_arity field is PN_BINARY, and the node->pn_u.binary field is used. The node->pn_u.binary.left field points to a JSParseNode which represents the condition of the while statement, and the node->pn_u.binary.right field points to a JSParseNode which represents the body of the while statement.

It is possible to refer to fields in the pn_u union using a shorter form, thanks to some macros defined in jsparse.h. For example, the node->pn_u.binary.left field has the shorter form node->pn_left.

We now have sufficient information to write code to display an outline of a parse tree:

const char * TOKENS[81] = {
  "EOF", "EOL", "SEMI", "COMMA", "ASSIGN", "HOOK", "COLON", "OR", "AND",
  "BITOR", "BITXOR", "BITAND", "EQOP", "RELOP", "SHOP", "PLUS", "MINUS", "STAR",
  "DIVOP", "UNARYOP", "INC", "DEC", "DOT", "LB", "RB", "LC", "RC", "LP", "RP",
  "NAME", "NUMBER", "STRING", "OBJECT", "PRIMARY", "FUNCTION", "EXPORT",
  "IMPORT", "IF", "ELSE", "SWITCH", "CASE", "DEFAULT", "WHILE", "DO", "FOR",
  "BREAK", "CONTINUE", "IN", "VAR", "WITH", "RETURN", "NEW", "DELETE",
  "DEFSHARP", "USESHARP", "TRY", "CATCH", "FINALLY", "THROW", "INSTANCEOF",
  "DEBUGGER", "XMLSTAGO", "XMLETAGO", "XMLPTAGC", "XMLTAGC", "XMLNAME",
  "XMLATTR", "XMLSPACE", "XMLTEXT", "XMLCOMMENT", "XMLCDATA", "XMLPI", "AT",
  "DBLCOLON", "ANYNAME", "DBLDOT", "FILTER", "XMLELEM", "XMLLIST", "RESERVED",
  "LIMIT",
};

const int NUM_TOKENS = sizeof(TOKENS) / sizeof(TOKENS[0]);

void print_tree(JSParseNode * root, int indent) {
  if (root == NULL) {
    return;
  }
  printf("%*s", indent, "");
  if (root->pn_type >= NUM_TOKENS) {
    printf("UNKNOWN");
  }
  else {
    printf("%s: starts at line %d, column %d, ends at line %d, column %d",
           TOKENS[root->pn_type],
           root->pn_pos.begin.lineno, root->pn_pos.begin.index,
           root->pn_pos.end.lineno, root->pn_pos.end.index);
  }
  printf("\n");
  switch (root->pn_arity) {
  case PN_UNARY:
    print_tree(root->pn_kid, indent + 2);
    break;
  case PN_BINARY:
    print_tree(root->pn_left, indent + 2);
    print_tree(root->pn_right, indent + 2);
    break;
  case PN_TERNARY:
    print_tree(root->pn_kid1, indent + 2);
    print_tree(root->pn_kid2, indent + 2);
    print_tree(root->pn_kid3, indent + 2);
    break;
  case PN_LIST:
    {
      JSParseNode * p;
      for (p = root->pn_head; p != NULL; p = p->pn_next) {
        print_tree(p, indent + 2);
      }
    }
    break;
  case PN_FUNC:
  case PN_NAME:
  case PN_NULLARY:
    break;
  default:
    fprintf(stderr, "Unknown node type\n");
    exit(EXIT_FAILURE);
    break;
  }
}
For example, consider this JavaScript program:
var x, y;
x = 1;
y = x * 2;
The program has the following parse tree:
LC: starts at line 0, column 0, ends at line 3, column 10
  VAR: starts at line 1, column 0, ends at line 1, column 8
    NAME: starts at line 1, column 4, ends at line 1, column 5
    NAME: starts at line 1, column 7, ends at line 1, column 8
  SEMI: starts at line 2, column 0, ends at line 2, column 5
    ASSIGN: starts at line 2, column 0, ends at line 2, column 5
      NAME: starts at line 2, column 0, ends at line 2, column 1
      NUMBER: starts at line 2, column 4, ends at line 2, column 5
  SEMI: starts at line 3, column 0, ends at line 3, column 9
    ASSIGN: starts at line 3, column 0, ends at line 3, column 9
      NAME: starts at line 3, column 0, ends at line 3, column 1
      STAR: starts at line 3, column 4, ends at line 3, column 9
        NAME: starts at line 3, column 4, ends at line 3, column 5
        NUMBER: starts at line 3, column 8, ends at line 3, column 9

Cleanup

Memory used by SpiderMonkey can be freed when no longer needed:

JS_DestroyContext(context);
JS_DestroyRuntime(runtime);

Summary

You can download the complete code for displaying the parse tree. You will requires SpiderMonkey version 1.60 to compile the code.

To compile the example on Linux, using GCC:

gcc -DXP_UNIX -I/path/to/spidermonkey/headers -o example example.c -L/path/to/spidermonkey/library -ljs -lm

To compile the example on Windows, using Microsoft's command line compiler:

cl.exe /DXP_WIN /I\path\to\spidermonkey\headers example.c /MD /link /libpath:\path\to\spidermonkey\library js32.lib

Appendix: Parse tree nodes

Detailed documentation for all the different types of parse nodes is located in jsparse.h and will not be repeated here. The following sections merely attempt to clarify the details for some types of nodes.

PN_NAME nodes

Nodes with a pn_arity of PN_NAME have a field node->pn_atom of type JSAtom *. This data type can converted to JSString *. The JSString data type is defined in jsstr.h to represent an array of 16-bit jschar characters.

#include <jsatom.h>

...

JSAtom * atom = node->pn_atom;
JSString * string = ATOM_TO_STRING(atom);
for (int i = 0; i < string->length; i++) {
  /* assuming that this atom contains only ASCII characters */
  char c = string->chars[i];
  putc(c);
}

PN_FUNC (TOK_FUNC) nodes

If you need to obtain the name of a function or its parameters, see the js_DecompileFunction function in jsopcode.c for an example. The following code shows roughly what you need to do:

    JSObject * object = ATOM_TO_OBJECT(node->pn_funAtom);
    JSFunction * function = (JSFunction *) JS_GetPrivate(context, object);

    /* function name */
    if (function->atom) {
      /* function->atom is a JSAtom * containing the function name */
      ...
    }

    /* function parameters */
    JSAtom ** params = malloc(function->nargs * sizeof(JSAtom *));
    for (int i = 0; i < function->nargs; i++) {
      /* initialize to NULL for sanity check */
      params[i] = NULL;
    }
    JSScope * scope = OBJ_SCOPE(object);
    for (JSScopeProperty * scope_property = SCOPE_LAST_PROP(scope);
         scope_property != NULL;
         scope_property = scope_property->parent) {
      if (scope_property->getter != js_GetArgument) {
        continue;
      }
      params[(uint16) scope_property->shortid] = JSID_TO_ATOM(scope_property->id);
    }
    for (int i = 0; i < function->nargs; i++) {
      assert(params[i] != NULL);
      /* params[i] is a JSAtom * containing the function name */
      ...
    }
    free(params);

TOK_ASSIGN nodes

A TOK_ASSIGN is not a simple assignment if pn_op has one of the following values:

pn_op Operator
JSOP_ADD +=
JSOP_SUB -=
JSOP_MUL *=
JSOP_DIV /=
JSOP_MOD %=
JSOP_LSH <<=
JSOP_RSH >>=
JSOP_URSH >>>=
JSOP_BITAND &=
JSOP_BITOR |=
JSOP_BITXOR ^=

TOK_BITOR, TOK_BITXOR, TOK_BITAND, TOK_EQOP, TOK_RELOP, TOK_SHOP, TOK_PLUS, TOK_MINUS, TOK_STAR, TOK_DIVOP nodes

These nodes usually have a pn_arity of PN_BINARY, but an expression like 1 + 2 + 3 will be represented as a single node with a pn_arity of PN_LIST.

TOK_INC, TOK_DEC nodes

The pn_op value is one of the following:

pn_op Operator
JSOP_INCNAME, JSOP_INCPROP, JSOP_INCELEM prefix increment
JSOP_DECNAME, JSOP_DECPROP, JSOP_DECELEM prefix decrement
JSOP_NAMEINC, JSOP_PROPINC, JSOP_ELEMINC postfix increment
JSOP_NAMEDEC, JSOP_PROPDEC, JSOP_ELEMDEC postfix decrement

TOK_STRING nodes

For a node of type TOK_STRING, pn_atom contains the text of the string, without quotes and with any escape characters already applied. For example, the string literal "1\\2\n" is represented by a node whose pn_atom contains four characters: 1, \, 2, and a newline character.

TOK_OBJECT nodes

Nodes with a pn_op value of JSOP_REGEXP represent regular expression literals. The text of the regular expression can be obtained with the following code:

JSObject * object = ATOM_TO_OBJECT(node->pn_atom);
jsval result;
js_regexp_toString(context, object, 0, NULL, &result);
JSString * string = JSVAL_TO_STRING(result);

TOK_INSTANCEOF nodes

These nodes have a pn_arity of PN_BINARY, with pn_left representing the expression to the left of instanceof and pn_right representing the expression to the right of instanceof.

TOK_IN nodes

These nodes have a pn_arity of PN_BINARY, with pn_left representing the expression to the left of in and pn_right representing the expression to the right of in.

Copyright © 2007 Silicon Forks siliconforks.com
Last updated June 14, 2007
siliconforks@siliconforks.com