I set out to replace the pegdown parser with one based on commonmark-java which I picked for two main reasons: speed and ease of extending. The details in the last blog: pegdown - Achilles heel of the Markdown Navigator plugin
As happens often, the battle plan does not survive beyond the first shot. Once I got into the commonmark-java code I realized that it is a major change not just to the parser but to everything that it touches.
The parser does all its text processing during the parsing phase so the AST contains only what is needed for HTML generation. Effectively, the AST is geared to HTML generation and does not reflect the original file elements. For example, ref links are converted to links. All links become Link nodes regardless of whether they are from references, inline, auto links or mail links.
However, the advantages far outweighed the potential effort so I decided to make the changes and see where the results land.
Having to keep track of original source position is not an easy task, especially if you want to
keep the parsing code as close as possible to original string search, compare and extract. I
opted out to use a special
CharSequence that I used in plugin formatting that keeps track of
original source positions but in a simplified form so it does not impact performance.
BasedSequence is a wrapper around the original
CharSequence which has
additional functions to map its characters back to the
base sequence. It makes all
manipulations nothing more than using a
CharSequence and getting the source position tracking
for almost free via:
I had to remove all the pre-processing from the parser so that the AST reflects the source and
not the to be generated HTML. If you don’t keep track of source then all
* are the same, but
if you do they are all unique based on their position in the source. All locations where
characters were appended as part of processing had to be changed to use a
the actual location of that character in the source. That turned out to be practically
everywhere in the parser, duh.
Some flies in the ointment came from processing partially used up leading tabs in the file. The parser simply replaces them with the number of spaces yet to be processed. The other is normalizing or removing the EOLs in the file during parsing. Since the plugin has to work with the file as is and preserve all characters this too was moved to the HTML generating phase.
Delimiter processing had to be made a tad more complex because there is a need to use actual character from the base sequence when a delimiter turns out to be just text. The original implementation has no such constraints and can just append the character to the text.
Overall the effort of conversion was less work than I feared and more than I hoped.
Four days later, I have the parser converted and passing all tests. Nothing is optimized in the new code but I decided to run the primitive benchmark to see how much performance was lost.
The new parser is only 1.6 times slower overall from the original commonmark-java parser and still about 7 times faster than intellij-markdown which is why I chose to work with commonmark-java in the first place.
Now I have an easily extensible, in theory, parser that tracks original source position with no
great effort. To map a part of a node to the original source only requires that a
subSequence() of the text making it up be stored with or extracted from the node.
I still have to implement tests to validate that all nodes store the correct source position. The original tests are only concerned with the right characters being generated. I also have to implement a pegdown compatible parser extensions but that is just effort of writing the code and porting the pegdown tests to this project.
Overall, I am very pleased with the results and my initial choice of basing the new parser on commonmark-java. The flexmark-java project is still in its initial development mode. I did not get around to renaming the directories or the extensions but this will happen soon.
Ratio of performance:
- VERSION.md is the version log file I use for Markdown Navigator
- commonMarkSpec.md is a 33k line file used in intellij-markdown test suite for performance evaluation.
- spec.txt commonmark spec markdown file in the commonmark-java project
hang-pegdown.md is a file containing a single line of 17 characters
[[[[[[[[[[[[[[[[[which causes pegdown to go into a hyper-exponential parse time.
hang-pegdown2.md a file containing a single line of 18 characters
[[[[[[[[[[[[[[[[[[which causes pegdown to go into a hyper-exponential parse time.
- wrap.md is a file I was using to test wrap on typing performance only to discover that it has nothing to do with the wrap on typing code when 0.1 seconds is taken by pegdown to parse the file. In the plugin the parsing may happen more than once: syntax highlighter pass, psi tree building pass, external annotator.
- markdown_example.md a file with 10,000+ lines containing 500kB+ of text.
This is not an exhaustive performance test but combined with other advantages of commonmark-java over intellij-markdown I have enough data to convince me to make commonmark-java as my choice for the parser to use for the Markdown Navigator plugin.
Constructive comments and suggestions are welcome. I can also be reached me at firstname.lastname@example.org
This work is licensed under a Creative Commons Attribution 4.0 International License.