Saturday, October 1, 2016

Parsing invalid JSON

JSON (image: source, license)

When developing Citation.js, I needed to get a JavaScript Object from a string representing one. There are several methods for this, and the one that comes to mind first is JSON.parse(). However, this didn't work. Consider the following code:

{
  type: 'article',
  author: ['Chadwick D. Rittenhouse','Tony W. Mong','Thomas Hart'],
  editor: ['Stuart Pimm'],
  year: 2015,
  title: 'Weather conditions associated with autumn migration by mule deer in Wyoming',
  journal: 'PeerJ',
  volume: 3,
  pages: [1,21],
  doi: '10.7717/peerj.1045',
  publisher: 'PeerJ Inc.'
}

It contains data in the standard format Citation.js uses. And it's written in JavaScript, not JSON. Valid JSON requires all double quotes (") and property names wrapped with, again, double quotes. Valid JSON is valid in JavaScript too, but I prefer to write it like this. To accommodate to myself and other people preferring simple syntax, I had to come up with something else.

Option two is eval(), a code that parses JavaScript in strings and executes it on the fly. However, using eval is usually strongly discouraged for multiple reasons, one being code injection. Here are two strings. Both are valid JavaScript Object when pasted directly in the script. Only the second is valid JSON.

When the first gets processed by eval, it alerts a string (which may suppressed in the iframe above). Any code can be put where the alert() function is called. The first can't be processed by JSON.parse(), so we skip to processing the second with eval. This doesn't alert "Bar" as opposed to the first alerting "Foo". The second can get processed by JSON.parse(), and when it does it outputs the expected data. As you can see, only JSON.parse() never permits code injection, as it gives an error when it isn't valid JSON and valid JSON can't contain code.

Better use JSON.parse() then. But how are we going to parse invalid JSON without code injection? I hate to say it, but with regex. I know you shouldn't parse anything with regex, but I don't really parse it and when it fails, JSON.parse() will throw an error anyway. I use the following regex patterns (in this order):

  1. /((?:\[|:|,)\s*)'((?:\\'|[^'])*?[^\\])?'(?=\s*(?:\]|}|,))/g
    Changes single-quoted strings to double-quoted ones. Explanation and example on Regex101
  2. /((?:(?:"|]|}|\/[gmi]|\.|(?:\d|\.|-)*\d)\s*,|{)\s*)(?:"([^":\n]+?)"|'([^":\n]+?)'|([^":\n]+?))(\s*):/g
    Wraps property names in double quotes. Explanation and example on Regex101

As I said, this doesn't work perfectly, but it does the trick and it doesn't seem to be dangerous. When using this on the invalidString, it produces invalid JSON, the parser throws an error and the user is kindly asked to input valid JSON. But when using normal JavaScript, with somewhat normal string content, it works just fine. And you still can use normal JSON if you want, of course. It tries if that would work before using the regex, as you can see in the source code here:

case '{':case '[':
  // JSON string (probably)
  var obj;
  try       { obj = JSON.parse(data) }
  catch (e) {
    console.warn('Input was not valid JSON, switching to experimental parser for invalid JSON')
    try {
      obj = JSON.parse(data.replace(this._rgx.json[0],'$1"$2"').replace(this._rgx.json[1],'$1"$2$3$4"$5:'))
    } catch (e) {
      console.warn('Experimental parser failed. Please improve the JSON. If this is not JSON, please re-read the supported formats.')
    }
  }
  var res = new Cite(obj);
  inputFormat = 'string/' + res._input.format;
  formatData = res.data;
  break;

No comments:

Post a Comment