Lua word count filter

Frederik Aust

2020-11-25

Using the word count filter

The aim of the rmdfiltr word count filter is to provide a more accurate estimate of the number of words in a document than can be gleaned from the R Markdown source document. Output from (inline) R chunks as well as formatted citations and references can not enter the word count, when the source document is analyzed. Hence, the word count filter is applied after the document has been knitted and while it is being processed by pandoc. At this stage, the document is represented as an abstract syntax tree (AST), a semantic nested list, and can be manipulated by applying so-called filters.

One the filters that is applied to R Markdown by default is citeproc (previously pandoc-citeproc), which formats citations and inserts references. To obtain an accurate estimate, the word count filter should therefore be applied after citeproc has been applied. To do so, it is necessary to disable the default application of citeproc, because it is always applied last, by adding the following to the documents YAML front matter:

citeproc: no

To manually apply citeproc and subsequently the rmdfiltr word count filter add the pandoc arguments to the output format of your R Markdown document as pandoc_args. Each filter returns a vector of command line arguments; they take previous arguments as args and add to them. Hence, the calls to add filters can be nested:

library("rmdfiltr")
add_citeproc_filter(args = NULL)
#> [1] "--citeproc"
add_wordcount_filter(add_citeproc_filter(args = NULL))
#> [1] "--citeproc"                                                                                                 
#> [2] "--lua-filter"                                                                                               
#> [3] "/private/var/folders/nv/mz4ffsbn045101ngdd_mx0th0000gn/T/Rtmp7tDXDt/Rinst21297336bb2/rmdfiltr/wordcount.lua"

When adding the filters to pandoc_args the R code needs to be preceded by !expr to declare it as to-be-interpreted expression.

output:
  html_document:
    pandoc_args: !expr rmdfiltr::add_wordcount_filter(rmdfiltr::add_citeproc_filter(args = NULL))

The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively.

285 words in text body
23 words in reference section

Word count filter performance

The rmdfiltr filter is and adapted combination of two other Lua-filters by John MacFarlane and contributors.

Although word counting appears to be a trivial matter, the counts of different methods often disagree. The magnitude of those disagreements depends on the complexity of the document.

To get a feeling for the performance of the word count filter, I briefly compared the estimates for two documents across several common methods. The first document, a paper by Stahl & Aust (2018) is a rather simple consisting of only text with citations and a reference section. The second document is a more complicated—it contains math, code, verbatim output, etc.

The word counts for the text body do not contain, tables or images (or their captions), or the reference section (which required some manual labor in Word, Pages, and wordcounter.net).