Making jieba-rs 2.4x faster

Back in 2019, Paul Meng wrote a post about getting jieba-rs 33% faster than cppjieba. That was a satisfying result and the performance story stayed there for a long time. Six years, roughly. Nobody complained, the numbers were good enough.

Then starting around May 2025, a few contributors started poking at the hot paths, and I got nerd-sniped into joining. Over the course of about a year, seven PRs took the core segmentation from 2.85 µs down to 1.32 µs per call on the HMM path. That's 2.16x faster. The non-HMM path did even better: 2.21 µs to 0.94 µs, or 2.35x.

None of the individual changes were especially clever. Most of them are the kind of thing you look at and think "oh, right, obviously." But they add up.

How I measured

I checked out each commit, ran cargo bench, and wrote down the median. Same machine, same input, same allocator throughout:

Apple M4 Max
128 GB RAM
macOS 26.4.1
Rust nightly, jemalloc, criterion

The test sentence:

我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。

A pretty typical Chinese sentence with some punctuation and an English acronym thrown in. Not a stress test, but representative of what a search engine indexer would feed in.

Here's the full progression:

Commit What changed no_hmm with_hmm cut_for_search
v0.7.2 Baseline 2.21 µs 2.85 µs 3.33 µs
2f06908 lazy_static!thread_local!, captures → find 2.17 µs 2.82 µs 3.35 µs
2ff9cca Reusable HMM working memory 2.11 µs 2.90 µs 3.37 µs
57b1a29 Borrow slices instead of copying 1.99 µs 2.78 µs 3.24 µs
46fdac7 thread_local HmmContext, capacity hints 2.02 µs 2.81 µs 3.27 µs
51cb0e1 Vec-backed DAG, packed word IDs, misc 1.45 µs 2.09 µs 2.59 µs
1f4e325 Kill the regex engine 1.03 µs 1.49 µs 1.98 µs
9e3965b char-keyed emit probs, precomputed ln() 0.94 µs 1.32 µs 1.81 µs

The first four commits moved the needle by maybe 10% total. Then #141 and #144 each chopped off 25-29%. The last one got another 9-13%. Classic Pareto distribution.

lazy_static!thread_local!, captures → find (PR #122)

Runji Wang

jieba-rs had four compiled regexes stored in lazy_static! blocks. That means every access goes through an atomic load to check initialization. thread_local! compiles down to a much cheaper TLS lookup, no atomics.

Same PR also fixed a few places where the code used Regex::captures() but only needed Regex::find() or Regex::is_match(). The regex crate docs literally say "prefer in this order: is_match, find, captures" because each level does more bookkeeping.

Before:

lazy_static! {
static ref RE_HAN_DEFAULT: Regex = Regex::new(r"...").unwrap();
}
// and then later
let splitter = SplitMatches::new(&RE_HAN_DEFAULT, sentence);

After:

thread_local! {
static RE_HAN_DEFAULT: Regex = Regex::new(r"...").unwrap();
}
// accessed via closure
RE_HAN_DEFAULT.with(|re_han| {
let splitter = SplitMatches::new(re_han, sentence);
// ...
})

The benchmark difference was 1-2%, which is basically noise on this micro-benchmark. But it was the right thing to do regardless, and it cleaned up the code for later changes.

Reusable HMM working memory (PR #123)

Runji Wang

The Viterbi decoder allocates three vectors on every call: v (probability matrix), prev (backpointer matrix), and best_path. For a sentence with C characters and 4 HMM states, that's three allocations of size 4*C or C each. Not huge, but it happens per sentence.

The fix was to bundle these into an HmmContext struct and pass it in:

pub(crate) struct HmmContext {
v: Vec<f64>,
prev: Vec<Option<State>>,
best_path: Vec<State>,
}

Instead of Vec::new() + push, the code does clear() + resize(). The allocator only gets called when the current sentence is longer than anything seen before on this context.

The micro-benchmark didn't show a difference because it's one sentence repeated. In a real pipeline processing thousands of sentences of varying lengths, the saved allocations matter. Anyway, this was a prerequisite for PR #127 which moved the context to thread-local storage.

Borrow slices instead of copying (PR #125)

aniaan

The SplitState enum had a method into_str() that in some code paths involved copying the matched text. After this refactor, all variants borrow directly from the input string, so iteration over the split results produces &str slices with zero copies.

The diff was only 30 lines changed in lib.rs, but it gave 5-6% on no_hmm and 3-4% on with_hmm. Turns out when your inner loop is slicing strings a thousand times, not copying matters.

thread_local HmmContext and capacity hints (PR #127)

Instead of allocating HmmContext on every cut() call (as PR #123 made possible), this moved it to thread_local! storage:

thread_local! {
static HMM_CONTEXT: RefCell<HmmContext> = RefCell::new(HmmContext::default());
}
// In cut():
HMM_CONTEXT.with(|ctx| {
let mut hmm_context = ctx.borrow_mut();
self.cut_dag_hmm(block, &mut words, &mut route, &mut dag, &mut hmm_context);
});

Same PR added Vec::with_capacity() to a few places in TextRank and TfIdf that were doing Vec::new() followed by a known number of pushes. Also tweaked the StaticSparseDAG capacity multiplier from 5 to 4 and added a minimum capacity of 32.

Benchmark impact on cut was within noise. The capacity hints probably help more on keyword extraction, which this benchmark doesn't cover.

Vec-backed DAG, packed word IDs (PR #141)

This was the first change that really moved the numbers. 17-29% across all cut modes, in a single PR. It attacked four things at once.

The biggest win was replacing HashMap<usize, usize> with Vec<usize> in StaticSparseDAG. The DAG maps byte offsets to positions in an edge array. Byte offsets are sequential non-negative integers. Using a hash map for that is like using a sledgehammer to push a thumbtack. A Vec with direct indexing does the same job without any hashing:

// Before
start_pos: HashMap<usize, usize>,
// After
start_pos: Vec<usize>, // index = byte offset, value = position in edge array

Unused slots store usize::MAX as a sentinel. The vec gets fill(NO_ENTRY) on clear() instead of HashMap::clear().

The second change was storing word IDs directly in the DAG edges. Previously, the code built the DAG during dag(), then during calc() it would re-lookup every edge in the cedar trie to get the word frequency:

// Before: redundant trie lookup in calc()
for byte_end in dag.iter_edges(byte_start) {
let wfrag = &sentence[byte_start..byte_end];
let freq = if let Some((word_id, _, _)) = self.cedar.exact_match_search(wfrag) {
self.records[word_id as usize].freq
} else {
1
};
}

After, the word ID gets packed into a u64 alongside the byte offset during construction, so calc() just reads it out:

// After: word_id comes from the edge, no second lookup
for (byte_end, word_id) in dag.iter_edges(byte_start) {
let freq = if word_id != NO_MATCH {
self.records[word_id as usize].freq
} else {
1
};
}

The encoding packs byte_end + 1 in the upper 32 bits and word_id (as i32) in the lower 32 bits of a u64. Zero is the sentinel for "no more edges."

Two smaller fixes in the same PR: replacing chars().count() == 1 with chars().nth(1).is_none() (the former iterates the whole string, the latter stops after one character), and replacing a redundant re_han.is_match(block) call with state.is_matched() since the split iterator already classified each block.

Kill the regex engine (PR #144)

I profiled the code after PR #141 and the regex engine was still taking 29% of CPU time. That seemed absurd for what the regexes were actually doing.

Here's the thing: the four regexes in the hot path (RE_HAN_DEFAULT, RE_HAN_CUT_ALL, RE_SKIP_DEFAULT, RE_SKIP_CUT_ALL) are all just character class matchers. RE_HAN_DEFAULT matches runs of CJK characters plus ASCII alphanumeric plus a few punctuation characters. RE_SKIP_DEFAULT matches whitespace. There's no alternation beyond character classes, no captures, no lookahead, no backreferences. It's "is this codepoint in one of these ranges? yes/no."

The regex crate is fast -- BurntSushi's DFA implementation is genuinely impressive -- but you're still paying for the generality of the engine. A hand-written character classifier compiled with matches! is just a bunch of range comparisons that LLVM can turn into a branch table:

fn is_cjk(c: char) -> bool {
matches!(c,
'\u{3400}'..='\u{4DBF}'
| '\u{4E00}'..='\u{9FFF}'
| '\u{F900}'..='\u{FAFF}'
| '\u{20000}'..='\u{2A6DF}'
| '\u{2A700}'..='\u{2B73F}'
| '\u{2B740}'..='\u{2B81F}'
| '\u{2B820}'..='\u{2CEAF}'
| '\u{2CEB0}'..='\u{2EBEF}'
| '\u{2F800}'..='\u{2FA1F}'
)
}
fn is_han_default(c: char) -> bool {
is_cjk(c) || c.is_ascii_alphanumeric() || matches!(c, '+' | '#' | '&' | '.' | '_' | '%' | '-')
}

The SplitByCharacterClass iterator wraps one of these classifiers and splits text into maximal runs of matched/unmatched characters. It does the same thing the regex-based SplitMatches did, just without the regex engine in between:

pub(crate) struct SplitByCharacterClass<'t, F> {
text: &'t str,
pos: usize,
classify: F,
}

The iterator walks the string one character at a time. If classify(c) is true, it extends the current matched run. If false, unmatched run. When the class flips, it yields the accumulated slice and starts a new run. Simple enough that the entire implementation is about 30 lines.

For RE_SKIP_DEFAULT (whitespace matching), I didn't even need a splitter -- the unmatched handler in tokenize just iterates character-by-character and groups \r\n pairs. For RE_SKIP_CUT_ALL, same idea with is_skip_cut_all.

The only regex left in the hot path is RE_SKIP in hmm.rs, which matches [a-zA-Z0-9]+(?:.\d+)?%?. That pattern has actual quantifiers and optional groups, so it earns its regex.

Result: 25-29% faster across all cut modes. no_hmm went from 1.45 µs to 1.03 µs. The regex crate went from 29% of CPU time to basically nothing.

char-keyed emit probs, precomputed ln() (PR #145)

After killing the regex engine, I profiled again to see what was left. Two things stood out.

First, the Viterbi inner loop. For each character in the sentence, for each of 4 HMM states, it looks up an emission probability in a phf::Map. The maps were keyed by &str -- string slices. But every single key is one character. A CJK character is 3 bytes in UTF-8. So every lookup was hashing 3 bytes as a string, comparing byte-by-byte, etc. Switching to phf::Map<char, f64> means the lookup hashes a u32 instead. Much cheaper.

This required a change in the proc macro that generates the HMM data at compile time:

// Before: keys are string slices
pub static EMIT_PROB_0: phf::Map<&'static str, f64> = ...;
// After: keys are chars
pub static EMIT_PROB_0: phf::Map<char, f64> = ...;

The second thing was (freq as f64).ln() in calc(). This gets called for every edge in the DAG, every time you segment a sentence. Word frequencies don't change after you load the dictionary, so there's no reason to recompute the logarithm each time. I added a log_freq field to Record and precompute it on construction:

struct Record {
freq: usize,
log_freq: f64, // (freq as f64).ln(), computed once
tag: Box<str>,
}

Then calc() just reads self.records[word_id].log_freq instead of calling ln(). Same for the fallback frequency of 1: I precompute log1 = 0.0f64 - logtotal once outside the loop.

The Viterbi loop itself also got simpler. Instead of managing a Peekable<impl Iterator> over byte offsets to extract character substrings:

// Before
let mut curr = char_offsets.iter().copied().peekable();
let x1 = curr.next().unwrap();
let x2 = *curr.peek().unwrap();
// ... manual peek/next dance to get each character's bytes

It now collects char_indices() up front and uses a plain index loop:

// After
let chars: Vec<(usize, char)> = sentence.char_indices().collect();
for t in 1..C {
let ch = chars[t].1;
let em_prob = params.emit_prob(*y as usize, ch);
// ...
}

The vec allocation is amortized because HmmContext reuse means we rarely actually allocate.

These changes together gave 9-13% improvement. with_hmm dropped from 1.49 µs to 1.32 µs.

Looking back

If I had to pick the two things that mattered most, it'd be "stop using HashMap for sequential integer keys" and "stop using regex for character class checks." Together those two account for most of the 2.4x speedup. The rest are small potatoes individually, but they compound.

The profiler told me where to look every time. Without it I would have guessed wrong about what was slow. I was surprised the regex engine was 29% of runtime -- the patterns look simple, and the regex crate is well-optimized. But "well-optimized for the general case" still loses to "three lines of matches!" when the general case is overkill.

I also keep learning the same lesson about allocation in Rust: Vec::new() inside a loop is a code smell in hot paths. Move it out, reuse it, resize it. The borrow checker will yell at you until you get the lifetimes right, but the allocator will thank you.

Thanks to Runji Wang and aniaan for the initial PRs that started this whole thing off.

Some rights reserved
Except where otherwise noted, content on this page is licensed under a Creative Commons Attribution-NonCommercial 4.0 International license.