<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/rss.xsl" type="text/xsl"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><generator>Typlog 3.1 (https://typlog.com)</generator><title><![CDATA[Messense Lyu]]></title><description><![CDATA[Yet another blog]]></description><link>https://messense.me/en/</link><copyright><![CDATA[Copyright 2017 Messense Lyu]]></copyright><atom:link href="https://messense.me/feed/en.xml" rel="self" type="application/rss+xml"/><atom:link href="https://pubsubhubbub.appspot.com/" rel="hub"/><pubDate>Sun, 19 Apr 2026 16:34:25 +0000</pubDate><item><title><![CDATA[Making jieba-rs 2.4x faster]]></title><guid>https://messense.me/making-jieba-rs-2-4x-faster</guid><link>https://messense.me/making-jieba-rs-2-4x-faster</link><dc:creator><![CDATA[messense]]></dc:creator><pubDate>Sun, 19 Apr 2026 14:56:57 +0000</pubDate><content:encoded><![CDATA[<p>Back in 2019, <a href="https://blog.paulme.ng/posts/2019-06-30-optimizing-jieba-rs-to-be-33percents-faster-than-cppjieba.html">Paul Meng wrote a post</a> about getting jieba-rs 33% faster than cppjieba. That was a satisfying result and the performance story stayed there for a long time. Six years, roughly. Nobody complained, the numbers were good enough.</p>
<p>Then starting around May 2025, a few contributors started poking at the hot paths, and I got nerd-sniped into joining. Over the course of about a year, seven PRs took the core segmentation from 2.85 µs down to 1.32 µs per call on the HMM path. That's 2.16x faster. The non-HMM path did even better: 2.21 µs to 0.94 µs, or 2.35x.</p>
<p>None of the individual changes were especially clever. Most of them are the kind of thing you look at and think &quot;oh, right, obviously.&quot; But they add up.</p>
<h2>How I measured</h2>
<p>I checked out each commit, ran <code>cargo bench</code>, and wrote down the median. Same machine, same input, same allocator throughout:</p>
<div class="block-code"><pre><code>Apple M4 Max
128 GB RAM
macOS 26.4.1
Rust nightly, jemalloc, criterion</code></pre></div>
<p>The test sentence:</p>
<div class="block-code"><pre><code>我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰。</code></pre></div>
<p>A pretty typical Chinese sentence with some punctuation and an English acronym thrown in. Not a stress test, but representative of what a search engine indexer would feed in.</p>
<p>Here's the full progression:</p>
<div class="block-table"><table><thead>
<tr>
  <th>Commit</th>
  <th>What changed</th>
  <th><code>no_hmm</code></th>
  <th><code>with_hmm</code></th>
  <th><code>cut_for_search</code></th>
</tr>
</thead>
<tbody>
<tr>
  <td>v0.7.2</td>
  <td>Baseline</td>
  <td>2.21 µs</td>
  <td>2.85 µs</td>
  <td>3.33 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/2f06908">2f06908</a></td>
  <td><code>lazy_static!</code> → <code>thread_local!</code>, captures → find</td>
  <td>2.17 µs</td>
  <td>2.82 µs</td>
  <td>3.35 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/2ff9cca">2ff9cca</a></td>
  <td>Reusable HMM working memory</td>
  <td>2.11 µs</td>
  <td>2.90 µs</td>
  <td>3.37 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/57b1a29">57b1a29</a></td>
  <td>Borrow slices instead of copying</td>
  <td>1.99 µs</td>
  <td>2.78 µs</td>
  <td>3.24 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/46fdac7">46fdac7</a></td>
  <td>thread_local HmmContext, capacity hints</td>
  <td>2.02 µs</td>
  <td>2.81 µs</td>
  <td>3.27 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/51cb0e1">51cb0e1</a></td>
  <td>Vec-backed DAG, packed word IDs, misc</td>
  <td>1.45 µs</td>
  <td>2.09 µs</td>
  <td>2.59 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/1f4e325">1f4e325</a></td>
  <td>Kill the regex engine</td>
  <td>1.03 µs</td>
  <td>1.49 µs</td>
  <td>1.98 µs</td>
</tr>
<tr>
  <td><a href="https://github.com/messense/jieba-rs/commit/9e3965b">9e3965b</a></td>
  <td>char-keyed emit probs, precomputed ln()</td>
  <td>0.94 µs</td>
  <td>1.32 µs</td>
  <td>1.81 µs</td>
</tr>
</tbody>
</table></div><p>The first four commits moved the needle by maybe 10% total. Then #141 and #144 each chopped off 25-29%. The last one got another 9-13%. Classic Pareto distribution.</p>
<h2><code>lazy_static!</code> → <code>thread_local!</code>, captures → find (<a href="https://github.com/messense/jieba-rs/pull/122">PR #122</a>)</h2>
<p><em><a href="https://github.com/wrj">Runji Wang</a></em></p>
<p>jieba-rs had four compiled regexes stored in <code>lazy_static!</code> blocks. That means every access goes through an atomic load to check initialization. <code>thread_local!</code> compiles down to a much cheaper TLS lookup, no atomics.</p>
<p>Same PR also fixed a few places where the code used <code>Regex::captures()</code> but only needed <code>Regex::find()</code> or <code>Regex::is_match()</code>. The regex crate docs literally say &quot;prefer in this order: <code>is_match</code>, <code>find</code>, <code>captures</code>&quot; because each level does more bookkeeping.</p>
<p>Before:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="n">lazy_static</span><span class="o">!</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="k">ref</span><span class="w"> </span><span class="n">RE_HAN_DEFAULT</span>: <span class="nc">Regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Regex</span>::<span class="n">new</span><span class="p">(</span><span class="s">r&quot;...&quot;</span><span class="p">).</span><span class="n">unwrap</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div><div class="line"><span class="c1">// and then later</span>
</div><div class="line"><span class="kd">let</span><span class="w"> </span><span class="n">splitter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SplitMatches</span>::<span class="n">new</span><span class="p">(</span><span class="o">&amp;</span><span class="n">RE_HAN_DEFAULT</span><span class="p">,</span><span class="w"> </span><span class="n">sentence</span><span class="p">);</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>After:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="fm">thread_local!</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="n">RE_HAN_DEFAULT</span>: <span class="nc">Regex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Regex</span>::<span class="n">new</span><span class="p">(</span><span class="s">r&quot;...&quot;</span><span class="p">).</span><span class="n">unwrap</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div><div class="line"><span class="c1">// accessed via closure</span>
</div><div class="line"><span class="n">RE_HAN_DEFAULT</span><span class="p">.</span><span class="n">with</span><span class="p">(</span><span class="o">|</span><span class="n">re_han</span><span class="o">|</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">splitter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SplitMatches</span>::<span class="n">new</span><span class="p">(</span><span class="n">re_han</span><span class="p">,</span><span class="w"> </span><span class="n">sentence</span><span class="p">);</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="c1">// ...</span>
</div><div class="line"><span class="p">})</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The benchmark difference was 1-2%, which is basically noise on this micro-benchmark. But it was the right thing to do regardless, and it cleaned up the code for later changes.</p>
<h2>Reusable HMM working memory (<a href="https://github.com/messense/jieba-rs/pull/123">PR #123</a>)</h2>
<p><em><a href="https://github.com/wrj">Runji Wang</a></em></p>
<p>The Viterbi decoder allocates three vectors on every call: <code>v</code> (probability matrix), <code>prev</code> (backpointer matrix), and <code>best_path</code>. For a sentence with <code>C</code> characters and 4 HMM states, that's three allocations of size <code>4*C</code> or <code>C</code> each. Not huge, but it happens per sentence.</p>
<p>The fix was to bundle these into an <code>HmmContext</code> struct and pass it in:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">struct</span> <span class="nc">HmmContext</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">v</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">f64</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">prev</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">Option</span><span class="o">&lt;</span><span class="n">State</span><span class="o">&gt;&gt;</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">best_path</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">State</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>Instead of <code>Vec::new()</code> + <code>push</code>, the code does <code>clear()</code> + <code>resize()</code>. The allocator only gets called when the current sentence is longer than anything seen before on this context.</p>
<p>The micro-benchmark didn't show a difference because it's one sentence repeated. In a real pipeline processing thousands of sentences of varying lengths, the saved allocations matter. Anyway, this was a prerequisite for PR #127 which moved the context to thread-local storage.</p>
<h2>Borrow slices instead of copying (<a href="https://github.com/messense/jieba-rs/pull/125">PR #125</a>)</h2>
<p><em><a href="https://github.com/aniaan">aniaan</a></em></p>
<p>The <code>SplitState</code> enum had a method <code>into_str()</code> that in some code paths involved copying the matched text. After this refactor, all variants borrow directly from the input string, so iteration over the split results produces <code>&amp;str</code> slices with zero copies.</p>
<p>The diff was only 30 lines changed in <code>lib.rs</code>, but it gave 5-6% on <code>no_hmm</code> and 3-4% on <code>with_hmm</code>. Turns out when your inner loop is slicing strings a thousand times, not copying matters.</p>
<h2>thread_local HmmContext and capacity hints (<a href="https://github.com/messense/jieba-rs/pull/127">PR #127</a>)</h2>
<p>Instead of allocating <code>HmmContext</code> on every <code>cut()</code> call (as PR #123 made possible), this moved it to <code>thread_local!</code> storage:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="fm">thread_local!</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="k">static</span><span class="w"> </span><span class="n">HMM_CONTEXT</span>: <span class="nc">RefCell</span><span class="o">&lt;</span><span class="n">HmmContext</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">RefCell</span>::<span class="n">new</span><span class="p">(</span><span class="n">HmmContext</span>::<span class="n">default</span><span class="p">());</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div><div class="line">
</div><div class="line"><span class="c1">// In cut():</span>
</div><div class="line"><span class="n">HMM_CONTEXT</span><span class="p">.</span><span class="n">with</span><span class="p">(</span><span class="o">|</span><span class="n">ctx</span><span class="o">|</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">hmm_context</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctx</span><span class="p">.</span><span class="n">borrow_mut</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="bp">self</span><span class="p">.</span><span class="n">cut_dag_hmm</span><span class="p">(</span><span class="n">block</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="n">words</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="n">route</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="n">dag</span><span class="p">,</span><span class="w"> </span><span class="o">&amp;</span><span class="k">mut</span><span class="w"> </span><span class="n">hmm_context</span><span class="p">);</span><span class="w"></span>
</div><div class="line"><span class="p">});</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>Same PR added <code>Vec::with_capacity()</code> to a few places in TextRank and TfIdf that were doing <code>Vec::new()</code> followed by a known number of pushes. Also tweaked the <code>StaticSparseDAG</code> capacity multiplier from 5 to 4 and added a minimum capacity of 32.</p>
<p>Benchmark impact on <code>cut</code> was within noise. The capacity hints probably help more on keyword extraction, which this benchmark doesn't cover.</p>
<h2>Vec-backed DAG, packed word IDs (<a href="https://github.com/messense/jieba-rs/pull/141">PR #141</a>)</h2>
<p>This was the first change that really moved the numbers. 17-29% across all cut modes, in a single PR. It attacked four things at once.</p>
<p>The biggest win was replacing <code>HashMap&lt;usize, usize&gt;</code> with <code>Vec&lt;usize&gt;</code> in <code>StaticSparseDAG</code>. The DAG maps byte offsets to positions in an edge array. Byte offsets are sequential non-negative integers. Using a hash map for that is like using a sledgehammer to push a thumbtack. A <code>Vec</code> with direct indexing does the same job without any hashing:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// Before</span>
</div><div class="line"><span class="n">start_pos</span>: <span class="nc">HashMap</span><span class="o">&lt;</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="kt">usize</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span>
</div><div class="line">
</div><div class="line"><span class="c1">// After</span>
</div><div class="line"><span class="n">start_pos</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="kt">usize</span><span class="o">&gt;</span><span class="p">,</span><span class="w">  </span><span class="c1">// index = byte offset, value = position in edge array</span>
</div></code></pre></div>
</div>
<p>Unused slots store <code>usize::MAX</code> as a sentinel. The vec gets <code>fill(NO_ENTRY)</code> on <code>clear()</code> instead of <code>HashMap::clear()</code>.</p>
<p>The second change was storing word IDs directly in the DAG edges. Previously, the code built the DAG during <code>dag()</code>, then during <code>calc()</code> it would re-lookup every edge in the cedar trie to get the word frequency:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// Before: redundant trie lookup in calc()</span>
</div><div class="line"><span class="k">for</span><span class="w"> </span><span class="n">byte_end</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">dag</span><span class="p">.</span><span class="n">iter_edges</span><span class="p">(</span><span class="n">byte_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">wfrag</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">&amp;</span><span class="n">sentence</span><span class="p">[</span><span class="n">byte_start</span><span class="o">..</span><span class="n">byte_end</span><span class="p">];</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">freq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="kd">let</span><span class="w"> </span><span class="nb">Some</span><span class="p">((</span><span class="n">word_id</span><span class="p">,</span><span class="w"> </span><span class="n">_</span><span class="p">,</span><span class="w"> </span><span class="n">_</span><span class="p">))</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="bp">self</span><span class="p">.</span><span class="n">cedar</span><span class="p">.</span><span class="n">exact_match_search</span><span class="p">(</span><span class="n">wfrag</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="bp">self</span><span class="p">.</span><span class="n">records</span><span class="p">[</span><span class="n">word_id</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="p">].</span><span class="n">freq</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="mi">1</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="p">};</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>After, the word ID gets packed into a <code>u64</code> alongside the byte offset during construction, so <code>calc()</code> just reads it out:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// After: word_id comes from the edge, no second lookup</span>
</div><div class="line"><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">byte_end</span><span class="p">,</span><span class="w"> </span><span class="n">word_id</span><span class="p">)</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">dag</span><span class="p">.</span><span class="n">iter_edges</span><span class="p">(</span><span class="n">byte_start</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">freq</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="n">word_id</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="n">NO_MATCH</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="bp">self</span><span class="p">.</span><span class="n">records</span><span class="p">[</span><span class="n">word_id</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="p">].</span><span class="n">freq</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="mi">1</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="p">};</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The encoding packs <code>byte_end + 1</code> in the upper 32 bits and <code>word_id</code> (as <code>i32</code>) in the lower 32 bits of a <code>u64</code>. Zero is the sentinel for &quot;no more edges.&quot;</p>
<p>Two smaller fixes in the same PR: replacing <code>chars().count() == 1</code> with <code>chars().nth(1).is_none()</code> (the former iterates the whole string, the latter stops after one character), and replacing a redundant <code>re_han.is_match(block)</code> call with <code>state.is_matched()</code> since the split iterator already classified each block.</p>
<h2>Kill the regex engine (<a href="https://github.com/messense/jieba-rs/pull/144">PR #144</a>)</h2>
<p>I profiled the code after PR #141 and the regex engine was still taking 29% of CPU time. That seemed absurd for what the regexes were actually doing.</p>
<p>Here's the thing: the four regexes in the hot path (<code>RE_HAN_DEFAULT</code>, <code>RE_HAN_CUT_ALL</code>, <code>RE_SKIP_DEFAULT</code>, <code>RE_SKIP_CUT_ALL</code>) are all just character class matchers. <code>RE_HAN_DEFAULT</code> matches runs of CJK characters plus ASCII alphanumeric plus a few punctuation characters. <code>RE_SKIP_DEFAULT</code> matches whitespace. There's no alternation beyond character classes, no captures, no lookahead, no backreferences. It's &quot;is this codepoint in one of these ranges? yes/no.&quot;</p>
<p>The regex crate is fast -- BurntSushi's DFA implementation is genuinely impressive -- but you're still paying for the generality of the engine. A hand-written character classifier compiled with <code>matches!</code> is just a bunch of range comparisons that LLVM can turn into a branch table:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="k">fn</span> <span class="nf">is_cjk</span><span class="p">(</span><span class="n">c</span>: <span class="kt">char</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="kt">bool</span> <span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="fm">matches!</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="sc">&#39;\u{3400}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{4DBF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{4E00}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{9FFF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{F900}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{FAFF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{20000}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2A6DF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{2A700}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2B73F}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{2B740}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2B81F}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{2B820}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2CEAF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{2CEB0}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2EBEF}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">        </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;\u{2F800}&#39;</span><span class="o">..=</span><span class="sc">&#39;\u{2FA1F}&#39;</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="p">)</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div><div class="line">
</div><div class="line"><span class="k">fn</span> <span class="nf">is_han_default</span><span class="p">(</span><span class="n">c</span>: <span class="kt">char</span><span class="p">)</span><span class="w"> </span>-&gt; <span class="kt">bool</span> <span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">is_cjk</span><span class="p">(</span><span class="n">c</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="n">c</span><span class="p">.</span><span class="n">is_ascii_alphanumeric</span><span class="p">()</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="fm">matches!</span><span class="p">(</span><span class="n">c</span><span class="p">,</span><span class="w"> </span><span class="sc">&#39;+&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;#&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;&amp;&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;.&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;_&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;%&#39;</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="sc">&#39;-&#39;</span><span class="p">)</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The <code>SplitByCharacterClass</code> iterator wraps one of these classifiers and splits text into maximal runs of matched/unmatched characters. It does the same thing the regex-based <code>SplitMatches</code> did, just without the regex engine in between:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="k">pub</span><span class="p">(</span><span class="k">crate</span><span class="p">)</span><span class="w"> </span><span class="k">struct</span> <span class="nc">SplitByCharacterClass</span><span class="o">&lt;&#39;</span><span class="na">t</span><span class="p">,</span><span class="w"> </span><span class="n">F</span><span class="o">&gt;</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">text</span>: <span class="kp">&amp;</span><span class="o">&#39;</span><span class="na">t</span> <span class="kt">str</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">pos</span>: <span class="kt">usize</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">classify</span>: <span class="nc">F</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The iterator walks the string one character at a time. If <code>classify(c)</code> is true, it extends the current matched run. If false, unmatched run. When the class flips, it yields the accumulated slice and starts a new run. Simple enough that the entire implementation is about 30 lines.</p>
<p>For <code>RE_SKIP_DEFAULT</code> (whitespace matching), I didn't even need a splitter -- the unmatched handler in <code>tokenize</code> just iterates character-by-character and groups <code>\r\n</code> pairs. For <code>RE_SKIP_CUT_ALL</code>, same idea with <code>is_skip_cut_all</code>.</p>
<p>The only regex left in the hot path is <code>RE_SKIP</code> in <code>hmm.rs</code>, which matches <code>[a-zA-Z0-9]+(?:.\d+)?%?</code>. That pattern has actual quantifiers and optional groups, so it earns its regex.</p>
<p>Result: 25-29% faster across all cut modes. <code>no_hmm</code> went from 1.45 µs to 1.03 µs. The regex crate went from 29% of CPU time to basically nothing.</p>
<h2>char-keyed emit probs, precomputed ln() (<a href="https://github.com/messense/jieba-rs/pull/145">PR #145</a>)</h2>
<p>After killing the regex engine, I profiled again to see what was left. Two things stood out.</p>
<p>First, the Viterbi inner loop. For each character in the sentence, for each of 4 HMM states, it looks up an emission probability in a <code>phf::Map</code>. The maps were keyed by <code>&amp;str</code> -- string slices. But every single key is one character. A CJK character is 3 bytes in UTF-8. So every lookup was hashing 3 bytes as a string, comparing byte-by-byte, etc. Switching to <code>phf::Map&lt;char, f64&gt;</code> means the lookup hashes a <code>u32</code> instead. Much cheaper.</p>
<p>This required a change in the proc macro that generates the HMM data at compile time:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// Before: keys are string slices</span>
</div><div class="line"><span class="k">pub</span><span class="w"> </span><span class="k">static</span><span class="w"> </span><span class="n">EMIT_PROB_0</span>: <span class="nc">phf</span>::<span class="n">Map</span><span class="o">&lt;&amp;&#39;</span><span class="nb">static</span><span class="w"> </span><span class="kt">str</span><span class="p">,</span><span class="w"> </span><span class="kt">f64</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">..</span><span class="p">.;</span><span class="w"></span>
</div><div class="line">
</div><div class="line"><span class="c1">// After: keys are chars</span>
</div><div class="line"><span class="k">pub</span><span class="w"> </span><span class="k">static</span><span class="w"> </span><span class="n">EMIT_PROB_0</span>: <span class="nc">phf</span>::<span class="n">Map</span><span class="o">&lt;</span><span class="kt">char</span><span class="p">,</span><span class="w"> </span><span class="kt">f64</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">..</span><span class="p">.;</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The second thing was <code>(freq as f64).ln()</code> in <code>calc()</code>. This gets called for every edge in the DAG, every time you segment a sentence. Word frequencies don't change after you load the dictionary, so there's no reason to recompute the logarithm each time. I added a <code>log_freq</code> field to <code>Record</code> and precompute it on construction:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="k">struct</span> <span class="nc">Record</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">freq</span>: <span class="kt">usize</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="n">log_freq</span>: <span class="kt">f64</span><span class="p">,</span><span class="w">   </span><span class="c1">// (freq as f64).ln(), computed once</span>
</div><div class="line"><span class="w">    </span><span class="n">tag</span>: <span class="nb">Box</span><span class="o">&lt;</span><span class="kt">str</span><span class="o">&gt;</span><span class="p">,</span><span class="w"></span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>Then <code>calc()</code> just reads <code>self.records[word_id].log_freq</code> instead of calling <code>ln()</code>. Same for the fallback frequency of 1: I precompute <code>log1 = 0.0f64 - logtotal</code> once outside the loop.</p>
<p>The Viterbi loop itself also got simpler. Instead of managing a <code>Peekable&lt;impl Iterator&gt;</code> over byte offsets to extract character substrings:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// Before</span>
</div><div class="line"><span class="kd">let</span><span class="w"> </span><span class="k">mut</span><span class="w"> </span><span class="n">curr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">char_offsets</span><span class="p">.</span><span class="n">iter</span><span class="p">().</span><span class="n">copied</span><span class="p">().</span><span class="n">peekable</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="kd">let</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">curr</span><span class="p">.</span><span class="n">next</span><span class="p">().</span><span class="n">unwrap</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="kd">let</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">*</span><span class="n">curr</span><span class="p">.</span><span class="n">peek</span><span class="p">().</span><span class="n">unwrap</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="c1">// ... manual peek/next dance to get each character&#39;s bytes</span>
</div></code></pre></div>
</div>
<p>It now collects <code>char_indices()</code> up front and uses a plain index loop:</p>
<div class="block-code" data-language="rust"><div class="highlight"><pre><span></span><code><div class="line"><span class="c1">// After</span>
</div><div class="line"><span class="kd">let</span><span class="w"> </span><span class="n">chars</span>: <span class="nb">Vec</span><span class="o">&lt;</span><span class="p">(</span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="kt">char</span><span class="p">)</span><span class="o">&gt;</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sentence</span><span class="p">.</span><span class="n">char_indices</span><span class="p">().</span><span class="n">collect</span><span class="p">();</span><span class="w"></span>
</div><div class="line"><span class="k">for</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="mi">1</span><span class="o">..</span><span class="n">C</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">ch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">chars</span><span class="p">[</span><span class="n">t</span><span class="p">].</span><span class="mi">1</span><span class="p">;</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="kd">let</span><span class="w"> </span><span class="n">em_prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">params</span><span class="p">.</span><span class="n">emit_prob</span><span class="p">(</span><span class="o">*</span><span class="n">y</span><span class="w"> </span><span class="k">as</span><span class="w"> </span><span class="kt">usize</span><span class="p">,</span><span class="w"> </span><span class="n">ch</span><span class="p">);</span><span class="w"></span>
</div><div class="line"><span class="w">    </span><span class="c1">// ...</span>
</div><div class="line"><span class="p">}</span><span class="w"></span>
</div></code></pre></div>
</div>
<p>The vec allocation is amortized because <code>HmmContext</code> reuse means we rarely actually allocate.</p>
<p>These changes together gave 9-13% improvement. <code>with_hmm</code> dropped from 1.49 µs to 1.32 µs.</p>
<h2>Looking back</h2>
<p>If I had to pick the two things that mattered most, it'd be &quot;stop using HashMap for sequential integer keys&quot; and &quot;stop using regex for character class checks.&quot; Together those two account for most of the 2.4x speedup. The rest are small potatoes individually, but they compound.</p>
<p>The profiler told me where to look every time. Without it I would have guessed wrong about what was slow. I was surprised the regex engine was 29% of runtime -- the patterns look simple, and the regex crate is well-optimized. But &quot;well-optimized for the general case&quot; still loses to &quot;three lines of <code>matches!</code>&quot; when the general case is overkill.</p>
<p>I also keep learning the same lesson about allocation in Rust: <code>Vec::new()</code> inside a loop is a code smell in hot paths. Move it out, reuse it, resize it. The borrow checker will yell at you until you get the lifetimes right, but the allocator will thank you.</p>
<p>Thanks to <a href="https://github.com/wrj">Runji Wang</a> and <a href="https://github.com/aniaan">aniaan</a> for the initial PRs that started this whole thing off.</p>
<hr /><p><a rel="payment" href="https://afdian.com/a/messense">爱发电上赞助</a></p>]]></content:encoded></item></channel></rss>