LaTeX encoding: Unicode and HTML in bibliographies and body text¶
Most users don't need to know any of this. You write UTF-8, you run texmark, the PDF renders correctly. This page exists for the cases where it doesn't — to explain why it didn't, and what knob fixes it.
The problem in 60 seconds¶
pdflatex is a 1990s engine. It reads UTF-8 input fine (since LaTeX 2018-04-01 the kernel handles UTF-8 natively), but its font stack is 8-bit — each font has 256 slots and no slot exists for arbitrary Unicode codepoints. When inputenc's default mapping table doesn't know how to render a codepoint, you get:
Under -interaction=nonstopmode (which texmark uses), pdflatex doesn't halt
— it drops the character and continues. The result is silent data loss:
the rendered PDF says "calibration of 18O" instead of "calibration of δ¹⁸O",
and you don't notice unless you read the reference list carefully.
The most common source of this is bibliographies, because:
.bibfiles often contain Greek letters in titles (δ¹⁸O), primes in proxy names (uk′37), thin spaces around units (30 kyr), CrossRef-style HTML markup (<i>δ</i><sup>18</sup>O), and accented author names.bibtexcopies field bytes from.bibstraight into.bblwith no Unicode handling. So whatever was in the.bibends up in the rendered bibliography, unchanged.- The body text rarely triggers this because pandoc converts the most common
typographic Unicode (em-dash, smart quotes, ellipsis) and most users write
the rest as LaTeX commands (
$\delta^{18}$O).
What texmark does about it¶
When you build with --engine pdflatex (the default), texmark rewrites
non-ASCII codepoints to their LaTeX equivalents on the way into the build
directory. Two staging steps:
- Bibliography — when
.bibis copied intobuild/, the copy is passed throughtexmark.unicode_bib.rewrite_text, which converts each non-ASCII character to a LaTeX command (δ→\ensuremath{\delta},°→{\textdegree},<sup>18</sup>→\textsuperscript{18}, etc.). - Body
.tex— the same rewrite runs over the pandoc-generated master.tex(and any embedded-chapter.texchunks) afterbuild_texwrites them. Catches the rare case of scientific Unicode written directly in markdown body text.
The source files on disk are never touched. Only the staged copies in
build/ are rewritten. You can keep your .bib and .md as readable
UTF-8.
The rewrite is a no-op for files that contain no non-ASCII characters.
The file isn't even rewritten, so mtime is preserved and latexmk's
fingerprint cache stays valid for incremental builds.
What the conversions look like¶
| Input | Output | Source |
|---|---|---|
δ (U+03B4) |
\ensuremath{\delta} |
pylatexenc |
° (U+00B0) |
{\textdegree} |
pylatexenc |
± (U+00B1) |
\ensuremath{\pm} |
pylatexenc |
— (em-dash) |
{\textemdash} |
pylatexenc |
" " (smart quotes) |
`` '' |
pylatexenc |
¹ ² ³ (Latin-1 sup) |
\textsuperscript{1} …{2} …{3} |
overrides |
⁰ ⁴–⁹ (Unicode sup) |
\textsuperscript{N} |
overrides |
₀–₉ (Unicode sub) |
\textsubscript{N} |
overrides |
<i>X</i> |
\textit{X} |
HTML map |
<sup>X</sup> |
\textsuperscript{X} |
HTML map |
<sub>X</sub> |
\textsubscript{X} |
HTML map |
<b>X</b> / <strong>X</strong> |
\textbf{X} |
HTML map |
<em>X</em> |
\emph{X} |
HTML map |
Adjacent \textsuperscript/\textsubscript blocks are then merged:
\textsuperscript{1}\textsuperscript{8} collapses to
\textsuperscript{18}, so an isotope label like δ¹⁸O renders as one
typographically coherent piece.
Real example: a CrossRef-exported entry¶
Input (references.bib):
@article{malevich_vetter2019,
title = {Global Core Top Calibration of <i>δ</i><sup>18</sup>O in
Planktic Foraminifera to Sea Surface Temperature},
...
}
Staged copy (build/references.bib):
@article{malevich_vetter2019,
title = {Global Core Top Calibration of \textit{\ensuremath{\delta}}\textsuperscript{18}O in
Planktic Foraminifera to Sea Surface Temperature},
...
}
Rendered in the PDF: Global Core Top Calibration of δ¹⁸O in Planktic Foraminifera…
The optional dependency: pylatexenc¶
The Unicode → LaTeX mapping comes from
pylatexenc, a small pure-Python
package (~250 KB wheel, BSD-licensed) that ships a comprehensive table of
roughly 3000 Unicode codepoints with their LaTeX equivalents. It's listed in
requirements.txt and installs automatically with pip install texmark.
If pylatexenc is not installed (for example if you're using texmark
under a constrained sandbox), the staging step degrades gracefully: the
file is copied byte-for-byte and the build behaves exactly as it did before
the encoding feature existed. You can still hit the "Unicode character not
set up" failure mode, but only on chars that the underlying engine doesn't
already handle.
A small in-tree overrides table fills gaps in pylatexenc's coverage that
matter for scientific bibliographies — chiefly Unicode super/subscript
blocks (U+2074–U+2079, U+2080–U+208E) that pylatexenc doesn't
map.
Unmapped characters¶
A few rare codepoints have no clean LaTeX equivalent (private-use blocks, emoji, exotic mathematical symbols). When the rewriter encounters one, it:
- Leaves the character in place in the staged copy (silently deleting it would be worse than the existing pdflatex behavior).
- Emits a
WARNINGper offender with the file path, line number, and surrounding@entry{key}so you can hand-fix the entry, e.g.:
WARNING /path/refs.bib:117: U+1F4A9 ('💩') in @entry{joke_2024}
has no LaTeX replacement and will be dropped from the PDF under pdflatex.
Fix by editing the .bib entry or switching to engine: lualatex / xelatex.
The build continues; the offending char gets dropped from the PDF the same way it would have without this feature. You only see the warning, which can be used to triage.
How the engine changes the picture¶
By default, texmark only runs the rewrite under engine: pdflatex. Under
lualatex or xelatex both the .bib and the body .tex stage as plain
copies, because those engines render arbitrary UTF-8 natively from OpenType
fonts and you may prefer raw codepoints there for proper font shaping.
| pdflatex | lualatex / xelatex | |
|---|---|---|
| Reads UTF-8 natively | yes | yes |
| Renders arbitrary UTF-8 directly | no (8-bit fonts) | yes (OpenType fonts) |
.bib rewrite (default) |
runs | skipped |
Body .tex rewrite (default) |
runs | skipped |
| Recommended for: | speed; broadest package compatibility | full Unicode + system fonts |
Overriding the default: rewrite_unicode¶
The auto-from-engine behaviour is controlled by the rewrite_unicode knob.
It takes three values:
auto(default) — on for pdflatex, off for lualatex/xelatex.on— always rewrite, regardless of engine. Useful under lualatex/xelatex when your.bibis going throughbibtex(rather than biber) and you want HTML-tag normalization or sorting safety.off— never rewrite. Useful under pdflatex when you've already pre-cleaned your.bibor want raw codepoints to surface as-is for debugging.
CLI:
texmark sources/main.md --pdf --rewrite-unicode auto # default
texmark sources/main.md --pdf --rewrite-unicode on # force on
texmark sources/main.md --pdf --rewrite-unicode off # force off
YAML:
CLI wins over YAML, both win over the default. Companions are first-class
documents (like with engine:), so each companion's own YAML
rewrite_unicode is honoured for that companion's build.
# Default for pdflatex — fast, 8-bit, texmark normalizes Unicode for you.
engine: pdflatex
# rewrite_unicode defaults to auto → on
# Native UTF-8 rendering, raw codepoints preserved by default.
engine: lualatex
# rewrite_unicode defaults to auto → off
# Force the rewrite back on if your .bib carries CrossRef HTML markup
# that bibtex still needs help with:
# rewrite_unicode: on
pylatexenc's output (\ensuremath{\delta}, \textit{…}, etc.) is valid in
all three engines, so forcing rewrite_unicode: on under lualatex/xelatex is
always safe — the commands compile, they just override whatever native
OpenType shaping would have done for those codepoints.
How .bib formatting affects what you get¶
Things that work transparently¶
- Raw UTF-8 Greek letters, math symbols, primes, degrees, super/subscripts.
- CrossRef-exported entries with inline HTML markup (
<i>,<sup>,<sub>,<em>,<b>,<strong>). - Accented author names (é, ñ, ü, ç, etc. — pdflatex handles these via inputenc's default Latin-1 table even without our rewrite).
Things that need attention¶
- Private-use codepoints / emoji / very exotic math glyphs. You'll see
the warning. Either escape the char in the
.bibor switch engine. - Tags that are not in our HTML map (e.g.
<custom>). Left untouched so you don't get a surprise rewrite of something you actually meant. If you want one converted, add it to_HTML_TAG_MAPin texmark/unicode_bib.py (PRs welcome). - Already-LaTeX content like
{\'e}is ASCII, so the rewriter ignores it. Safe to mix LaTeX escapes with raw Unicode in the same.bib.
How main-text formatting affects what you get¶
Most scientific writing in markdown uses LaTeX-style notation for math
($\delta^{18}$O, $T_{\mathrm{sst}}$), which texmark and pandoc pass
through unchanged. The cases where the body rewrite kicks in are:
- You wrote raw Unicode in the markdown (
δ¹⁸Oinstead of$\delta^{18}$O). - Pandoc passed it through to the
.tex(true for all scientific Unicode, not just typographic). - Engine is
pdflatex.
In that case the rewrite turns your raw δ¹⁸O into
\ensuremath{\delta}\textsuperscript{18}O in the staged .tex. Same visual
result, no "not set up" error.
If you actively prefer raw Unicode in the body (e.g. to take advantage of
OpenType ligatures or kerning under lualatex), set engine: lualatex and
the body rewrite is skipped.
Backend interaction¶
The encoding work is upstream of any backend (latexmk, raw,
tectonic). By the time latexmk or tectonic sees the staged files,
they're already rewritten. So the backend choice doesn't change what the
encoding layer does — it only changes how the resulting .tex/.bib get
compiled.
One small interaction worth knowing: when texmark rewrites a .tex body
(under pdflatex), it changes the file's bytes, which invalidates latexmk's
fingerprint cache and forces a rebuild on the next pass. The mtime guard
keeps this from happening when nothing was rewritten (ASCII-only inputs
preserve their mtime), but the first build after introducing Unicode into
your source will cost one extra latexmk pass.
Cheat sheet¶
| Symptom | Likely cause | Fix |
|---|---|---|
! LaTeX Error: Unicode character … not set up for use with LaTeX |
Engine is pdflatex; codepoint is unmapped or pylatexenc is missing |
Install pylatexenc, or switch engine, or escape the char in the source |
Reference shows literal ¡i¿…¡/i¿ |
CrossRef-style HTML markup in the .bib, and you're on a texmark < v0.12.1 |
Upgrade texmark |
δ appears in PDF as a missing glyph |
Same as above, fixed by rewrite | Upgrade texmark |
WARNING … has no LaTeX replacement |
Char is genuinely unmapped | Edit .bib entry; or switch engine; or extend unicode_bib._OVERRIDES |
Reference superscripts look like two separate blocks (¹⁸ rendered as 1 then 8) |
Pre-v0.12.1 texmark, no merge step | Upgrade texmark |
| Body Unicode in markdown not converted | Engine is lualatex or xelatex |
Either let those engines handle UTF-8 natively (they do), or switch to engine: pdflatex |
Performance¶
Profiled on real inputs (avg of 10 runs):
| Input | Size | Rewrite cost |
|---|---|---|
Small ASCII (.tex snippet) |
7 KB | 0.5 ms |
| Typical 10-page paper, ASCII | 70 KB | 4 ms |
| Book-length, ASCII | 700 KB | 44 ms |
| Sparse Unicode (realistic) | 30 KB | 2 ms |
| Unicode-heavy (worst case) | 25 KB | 19 ms |
All numbers are negligible compared to a single pdflatex pass (1–3 s on a typical paper). The ASCII-only path skips the write entirely.
See also¶
- docs/preamble.md — when you need to inject your own LaTeX preamble (the encoding layer doesn't get in your way).
- texmark/unicode_bib.py — the implementation, with extensive inline comments on why each step exists.
- pylatexenc on PyPI — the underlying Unicode → LaTeX mapping table.