[Optimization Retrospective] What the YouDub translation module did all day, which directions worked, and which have been basically falsified
The main battlefield today actually wasn’t TTS, but the translation module.
The reason is straightforward: the dubbing module can later be swapped to a solution closer to the target timbre, but the most obvious shortcoming in the finished video is still that the subtitle translation itself doesn’t feel like a mature, polished Bilibili deliverable—especially:
- In rolling-caption scenarios, the previous line will “steal” information from the next line.
- The original English captions are fragmentary; once you rigidly translate line by line, the Chinese becomes long and awkward.
- If you over-optimize for “completing sentences,” you end up leaking later punchlines, nouns, and actions early.
- If you over-optimize for “preserving fragment boundaries,” the Chinese becomes too choppy and no longer looks like real finished subtitles.
So in this round of optimization, I shifted the focus from “fixing one specific video” to “building a generalizable translation evaluation and iteration framework”.
What I did today (specifically)
1. Set up the benchmark system first, and stop tuning by gut feel
I didn’t keep staring at a single video and fixing it line by line; instead I directly used the 90+ Bilibili/YouTube paired samples in C:\\Users\\1\\bili_yt_export\\bili_youtube_first100.csv as the benchmark.
Accordingly I did two things:
- Extended
scripts/benchmark_translation.py - Added
scripts/analyze_translation_artifact.py
The former runs the full pipeline in batch—translation + sentence splitting + dubbing-script rewrite—and outputs each case’s metrics and intermediate artifacts.
The latter lets me inspect each case in isolation, especially these layers:
source_rowsprepared_source_rowstranslated_rows_pre_splitpredicted_rowsreference_rows
This step is critical, because many of the later issues are not caused by post-processing; the LLM has already borrowed content from later lines at the translated_rows_pre_split stage.
2. The core problem is now clear: rolling-caption borrowing from later lines
The biggest gain today wasn’t some metric suddenly spiking, but nailing the main problem:
YouTube auto captions/official subtitles contain a lot of rolling-caption structures; many lines are inherently half-sentences, fragments, or cross-line continuations.
If you simply ask the model to “translate each line naturally,” it will strongly tend to borrow information from the next one or two lines into the current line, making the Chinese read smoother but spoiling the timeline by revealing things early.
This is most obvious on the two hard cases zwIqbrD6JX4 and o2V-JJpJH_I.
3. Promote fragment_guard to the default mainline
Based on the findings above, I made fragment_guard enabled by default.
Its core idea isn’t “forcing Chinese to be broken,” but explicitly constraining the model:
- The current id may only express semantics that already appear in the current source line.
- If the source line is clearly an unfinished rolling-caption fragment, it’s better for the Chinese to stay slightly open-ended than to prefill future content early.
This is currently the only change that has been stably proven effective and that I’m willing to ship as the default on the mainline.
Results confirmed to be effective so far
Mainline configuration
The current stable mainline is roughly:
- provider:
openai_context - api base:
http://192.168.10.88:8317/v1 - model:
gpt-5.4-mini - prompt profile:
auto_hybrid - temperature:
0 - fewshot:
8 fragment_guard=on- All other experimental toggles are off by default
Gains that have been confirmed
fragment_guard shows positive gains from small experiments through mid-sized samples:
- 4-case comparison:
52.432 -> 53.322 - 8-case comparison:
55.958 -> 56.058
Current 8-case mainline report:
- composite:
56.058 - chrF:
0.3707 - char F1:
0.7729 - density MAE ratio:
0.4272
This indicates that, at least on the current mainline, it’s less likely than earlier versions to pull in later text early, and the overall pacing is closer to the distribution of finished Bilibili subtitles.
Which directions have been basically falsified today
1. Enabling fragment_hints globally
It’s not that it’s useless; on the contrary, it’s very strong on some cases.
For example:
zwIqbrD6JX4in hard2 goes from54.439up to57.980VT6rLcVKhzgalso improves noticeably
But the problem is it’s unstable.
When applied to the 8-case set, the overall score drops from 56.058 to 55.296.
In other words, it’s more like a “strong medicine for specific structures,” not a mainline strategy that can be enabled by default right now.
2. auto_hybrid_v2
I made a more aggressive profile auto-selection logic, hoping to automatically switch between literal_context / bilibili_dub / bilibili_pacing for different videos.
As a result, the 8-case score fell straight to 54.375, worse than the current mainline 56.058.
The conclusion is simple: the gating logic isn’t accurate enough, so it can’t go on the mainline yet.
3. Forcing a larger full-context translation scope
I tried two directions:
- Raise the full-context threshold so more videos get translated in one shot
- Enlarge the chunk from a very small window directly to a much bigger one
It looks closer to “understand the whole piece and then translate,” but in practice there’s no stable gain.
The reason is that with more context, the model also more easily borrows content across ids, and the timeline can actually get messier.
4. Making chunk granularity extremely fine
For example, ideas like chunk_max_items=2 feel intuitively like they would reduce line bleeding, but in practice the gains are poor and it also gets noticeably slower.
The hard2 results didn’t improve quality, but latency jumped a lot—especially o2V-JJpJH_I, which drags badly.
5. Cranking “Bilibili-style prompt” to the max
I tested:
literal_contextbilibili_dubbilibili_pacingauto_hybrid
On mixed4, auto_hybrid is best, literal_context is second, and the other two more “heavy-style” profiles are actually worse.
This suggests that right now it’s not “the more Bilibili the prompt, the better,” but rather: first solve context boundaries, fragment lines, and timing alignment, and only then talk about stylized expression.
The most important shift in understanding today
I used to think the biggest problem was “the sentences aren’t translated idiomatically enough,” but later I realized that’s not it.
The more fundamental issue is:
- The prior input is fragmentary to begin with
- The fragments are also highly overlapping
- For Chinese to read naturally, you must add some tone and structure
- But once you add too much, you leak future content early
So the hardest part of the translation module isn’t “Chinese→English” or “English→Chinese” per se, but:
Under the premise of not crossing time boundaries, turn fragmented English into Chinese pacing that looks like real finished subtitles.
This is not the same problem as ordinary machine translation.
Things that still aren’t solved
Although the mainline is steadier than before, it’s still far from the target I want—especially it still hasn’t reached the level of polish of the Bilibili example you gave.
The points that are still clearly not well-solved:
- Some hard cases still borrow from later lines
- On some cases, the Chinese still feels “translationese”
- Length matching after splitting still isn’t stable enough
fragment_hintsstill hasn’t found stable gating conditions- The few-shot count and sample selection still aren’t fully tuned to optimal
Next, the directions I think are most worth continuing
What’s most worth pursuing now isn’t adding more black-magic prompts, but these three things:
1. Feature-gate fragment_hints instead of a global toggle
We already know it can be strongly effective on some cases.
Next, we should gate it based on these features:
- fragmentary source ratio
- overlap ratio
- punctuated source ratio
- short/tiny line ratio
That is, enable it only on videos with “high fragmentation and high rolling-captioning,” rather than applying it across the board.
2. Keep validating the few-shot count
A small signal that appeared at the end today: fewshot=4 showed a small net gain on hard2 for the first time:
- baseline hard2:
50.454 - fewshot=4 hard2:
50.600
The gain is small, but the direction is positive.
If mixed4 and mid8 also hold, it would suggest the current 8 few-shots may actually be a bit too noisy.
3. Continue with chunk context that “provides context only, without translation permission”
I’ve already added a version of a pre/post context window around the chunk, but it’s still experimental.
The value of this direction is:
- Give the model the ability to understand the whole segment
- While still requiring it to output only the ids of the target chunk
This is theoretically better suited than simply enlarging chunk for “understand the whole piece but don’t cross boundaries” translation.
Conclusion for the day
If I had to sum it up in one sentence:
Today’s biggest achievement wasn’t “finishing” the translation module, but clarifying the full picture of “why this problem is hard, where the main bottleneck is right now, which directions work, and which directions are no longer worth burning time on.”
At least, it’s now clear that:
- The translation problem in this project is fundamentally not ordinary MT
- Rolling-caption boundaries are the main contradiction
fragment_guardis currently the only stable positive gainfragment_hintshas potential, but must be gated- Few-shot and context strategy are still worth digging into
If we want to truly polish this tool toward “the world’s best foreign-language video translation and dubbing,” then going forward the translation module can no longer rely on shoot-from-the-hip prompt tuning; we must keep following the path of benchmark-driven, case attribution, then small-step A/B.
Today, at least, I paved that road.