md2arxiv

OCaml License: MIT Status: alpha Theoretical Computer Science Automata Theory Formal Semantics

⚠️ Alpha version. Software in an early stage of development.

The accepted syntax and the output format may change without notice. Not yet tested on a wide variety of documents always check the generated .tex before a real submission.

A Markdown to LaTeX transpiler written in OCaml, designed to generate packages ready for submission on arXiv. It is not a clone of Pandoc but it is focused on a precise workflow, taking a .md with academic frontmatter (title, authors, ORCID, abstract) and producing a conforming .tex, with author block, affiliations and citations already in the right place. But it is also a didactic project that shows the whole chain of a compiler in miniature, AST, lexer, parser, semantics, target, on an understandable domain.

For the supported Markdown syntax with input/output examples for each construct, see GUIDE.md.

Una versione italiana di questo README Γ¨ disponibile in README.it.md.

Scope (important)

This tool handles the subset of Markdown used in academic writing, NOT full CommonMark.

Supported: heading (#..###), paragraphs, bold, italic, inline code, citations [@key], links [t](url), images ![alt](path), unordered lists, code blocks ```, tables with pipe |, ## Bibliografia section, YAML frontmatter.

NOT supported: raw HTML, nested lists, nested, inline (**_x_**), ordered lists, footnotes, math (the $ in text get escaped).

Build and run

Requires OCaml (β‰₯ 4.14) and dune.

dune build
dune test                      # runs the test suite
dune exec bin/main.exe -- test/sample.md -o paper.tex
# or
dune exec bin/main.exe < test/sample.md > paper.tex
# submission-ready package (paper.tex + figures/ copied):
dune exec bin/main.exe -- test/sample.md --package submission
# strict mode: warnings become fatal (useful in CI):
dune exec bin/main.exe -- test/sample.md --package submission --strict

Architecture

One module, one responsibility. The dependency is one-way.

ast.ml      the abstract syntax (the contract)
diag.ml     collector of warnings/errors with line numbers
lexer.ml    string -> inline list   (char-by-char scan, a DFA by hand)
parser.ml   string list -> block list  (explicit context as a sum type)
meta.ml     YAML frontmatter -> meta record  (with ORCID validation)
validate.ml global semantic checks on the AST (orphan citations)
emit.ml     block -> LaTeX  (the denotational semantics)
arxiv.ml    meta + blocks -> complete .tex document (arXiv preamble)
package.ml  document -> submission-ready folder (.tex + figures/)
bin/main.ml CLI, diagnostics orchestration, exit code

About the diagnostics, a single Diag.t goes through all the phases that feed it without changing their own signatures. No phase halts at the first problem but everything is collected and reported together at the end, with line numbers. See GUIDE.md Β§13.

The pipeline:

raw lines
  β”‚  Meta.split_frontmatter
  β”œβ”€β”€ frontmatter ── Meta.parse ───────────► meta ───┐
  └── body ───────── Parser.parse_lines ───► blocks ──
                          β”‚                          β”‚
                          β–Ό                          β–Ό
                     Validate.run            Arxiv.emit_document
                  (orphan citations)            (uses Emit)
                          β”‚                          β”‚
                          └──────► Diag.t β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                          Diag.report -> stderr + exit code

Every phase writes into the same Diag.t; the .tex output comes out anyway, but the exit code is 1 if there are errors (or warnings in --strict).

Contracts and invariants

The code is annotated with contracts in formal style as is customary in Eiffel/JML/Frama-C, to document the assumptions of each function and give whoever rereads it something against which to verify the body:

Where the invariant is local and low-cost it is also an assert that runs, so in this way, during dune test (and in every build without -noassert) it self-checks.

Examples: the lexer cursor stays within bounds (read_until), a normalized table row has exactly cols cells (emit.ml), a heading has level β‰₯ 1 (parse_heading).

Where instead the invariant is structural or costly it stays only as a comment (e.g. the reverse order of the accumulators, the semantics of lineno in the parser). The checks on user input are NOT asserts but they are real diagnostics (see above) so an assert disappears with -noassert, an input validation must always hold.

Why no .bib + BibTeX

The arXiv TeX environment does not run BibTeX, and we generate directly \begin{thebibliography} inline from the ## Bibliografia section. This removes the manual step (running bibtex, copying the .bbl) that other workflows require so the .tex that we produce is already self-contained and ready.

Formal semantics

The companion document semantics.pdf provides a formal treatment of the core of md2arxiv in two styles, denotational and axiomatic. The denotational semantics defines meaning functions $\mathcal{I}\llbracket\cdot\rrbracket : \mathbf{Inline} \to \mathbf{String}$ and $\mathcal{B}\llbracket\cdot\rrbracket : \mathbf{Block} \to \mathbf{String}$ by structural induction on the AST, following the compositional approach of Strachey and Scott. The emit pipeline is then characterized as the least fixed point of a continuous operator on the CPO $(\mathbf{String}_\perp, \sqsubseteq)$ ordered by prefix, via the Knaster-Tarski theorem. The axiomatic semantics specifies the stateful functions of the lexer (read_until, try_delim) as Hoare triples, formalizing the @requires/@ensures contracts already present in the source.

Roadmap

References

OCaml

Semantics of languages

Automata and computability

Compilers

License

MIT β€” see LICENSE.