2016-06-04 by Vladimir Schneider

Progress on pegdown replacement parser

I set out to replace the pegdown parser with one based on commonmark-java which I picked for two main reasons: speed and ease of extending. The details in the last blog: pegdown - Achilles heel of the Markdown Navigator plugin

As happens often, the battle plan does not survive beyond the first shot. Once I got into the commonmark-java code I realized that it is a major change not just to the parser but to everything that it touches.

The parser does all its text processing during the parsing phase so the AST contains only what is needed for HTML generation. Effectively, the AST is geared to HTML generation and does not reflect the original file elements. For example, ref links are converted to links. All links become Link nodes regardless of whether they are from references, inline, auto links or mail links.

However, the advantages far outweighed the potential effort so I decided to make the changes and see where the results land.

Having to keep track of original source position is not an easy task, especially if you want to keep the parsing code as close as possible to original string search, compare and extract. I opted out to use a special CharSequence that I used in plugin formatting that keeps track of original source positions but in a simplified form so it does not impact performance.

The resulting BasedSequence is a wrapper around the original CharSequence which has additional functions to map its characters back to the base sequence. It makes all manipulations nothing more than using a CharSequence and getting the source position tracking for almost free via:

  • getStartOffset()
  • getEndOffset()
  • getSourceRange()
  • getIndexOffset(int index)

I had to remove all the pre-processing from the parser so that the AST reflects the source and not the to be generated HTML. If you don’t keep track of source then all * are the same, but if you do they are all unique based on their position in the source. All locations where characters were appended as part of processing had to be changed to use a subSequence() from the actual location of that character in the source. That turned out to be practically everywhere in the parser, duh.

Some flies in the ointment came from processing partially used up leading tabs in the file. The parser simply replaces them with the number of spaces yet to be processed. The other is normalizing or removing the EOLs in the file during parsing. Since the plugin has to work with the file as is and preserve all characters this too was moved to the HTML generating phase.

Delimiter processing had to be made a tad more complex because there is a need to use actual character from the base sequence when a delimiter turns out to be just text. The original implementation has no such constraints and can just append the character to the text.

Overall the effort of conversion was less work than I feared and more than I hoped.

Four days later, I have the parser converted and passing all tests. Nothing is optimized in the new code but I decided to run the primitive benchmark to see how much performance was lost.

The new parser is only 1.6 times slower overall from the original commonmark-java parser and still about 7 times faster than intellij-markdown which is why I chose to work with commonmark-java in the first place.

Now I have an easily extensible, in theory, parser that tracks original source position with no great effort. To map a part of a node to the original source only requires that a subSequence() of the text making it up be stored with or extracted from the node.

I still have to implement tests to validate that all nodes store the correct source position. The original tests are only concerned with the right characters being generated. I also have to implement a pegdown compatible parser extensions but that is just effort of writing the code and porting the pegdown tests to this project.

Overall, I am very pleased with the results and my initial choice of basing the new parser on commonmark-java. The flexmark-java project is still in its initial development mode. I did not get around to renaming the directories or the extensions but this will happen soon.

File commonmark-java flexmark-java intellij-markdown pegdown
commonMarkSpec 42.343ms 72.332ms 588.993ms 622.279ms
hang-pegdown 0.012ms 0.024ms 0.028ms 653.111ms
table-format 1.324ms 1.658ms 3.419ms 25.505ms
VERSION 1.234ms 1.625ms 3.582ms 49.494ms
README-SLOW 0.702ms 0.906ms 1.634ms 17.338ms
hang-pegdown2 0.013ms 0.016ms 0.027ms 1300.878ms
wrap 4.479ms 9.088ms 14.649ms 95.989ms
markdown_example 21.056ms 26.393ms 210.310ms 1085.997ms
spec 8.703ms 12.056ms 35.188ms 332.175ms
table 0.125ms 0.144ms 0.676ms 4.169ms

Ratio of performance:

File commonmark-java flexmark-java intellij-markdown pegdown
commonMarkSpec 1.00 1.71 13.91 14.70
hang-pegdown 1.00 1.98 2.30 53102.74
table-format 1.00 1.25 2.58 19.27
VERSION 1.00 1.32 2.90 40.12
README-SLOW 1.00 1.29 2.33 24.70
hang-pegdown2 1.00 1.28 2.15 102738.72
wrap 1.00 2.03 3.27 21.43
markdown_example 1.00 1.25 9.99 51.58
spec 1.00 1.39 4.04 38.17
table 1.00 1.15 5.42 33.47
–––––––– ————— ———–– —————– ———
overall 1.00 1.55 10.73 52.34
  • VERSION.md is the version log file I use for Markdown Navigator
  • commonMarkSpec.md is a 33k line file used in intellij-markdown test suite for performance evaluation.
  • spec.txt commonmark spec markdown file in the commonmark-java project
  • hang-pegdown.md is a file containing a single line of 17 characters [[[[[[[[[[[[[[[[[ which causes pegdown to go into a hyper-exponential parse time.
  • hang-pegdown2.md a file containing a single line of 18 characters [[[[[[[[[[[[[[[[[[ which causes pegdown to go into a hyper-exponential parse time.
  • wrap.md is a file I was using to test wrap on typing performance only to discover that it has nothing to do with the wrap on typing code when 0.1 seconds is taken by pegdown to parse the file. In the plugin the parsing may happen more than once: syntax highlighter pass, psi tree building pass, external annotator.
  • markdown_example.md a file with 10,000+ lines containing 500kB+ of text.

This is not an exhaustive performance test but combined with other advantages of commonmark-java over intellij-markdown I have enough data to convince me to make commonmark-java as my choice for the parser to use for the Markdown Navigator plugin.


 



Constructive comments and suggestions are welcome. I can also be reached me at vladimir@vladsch.com