A schematic diagram in 16:9 format with a white background, styled after common SCI paper schematics. The information flow is from left to right, divided into three horizontal functional bands: the semantic layer, the middle representation layer, and the acoustic layer. The semantic layer and the T2U module are located at the center of the diagram, with their size and visual weight significantly higher than other modules, emphasizing their importance in the model. ──────────────────────── Semantic Layer (Top, Core Component) Core Module: "Chinese BERT Encoder + Adapter + LoRA" Internal Text Description (Academic, Concise): Input: Chinese text sequence X Output: Context-aware semantic representation H′ Parameter-efficient fine-tuning using Adapter and LoRA. Visual Design Highlights: The module has a slightly thicker border and larger area, emphasizing its role as the core of semantic modeling. The module focuses solely on abstract semantic representations, without involving phonemes, pronunciation, or target language text. Connections (Key): A single solid arrow originates from this module, pointing directly to the T2U module below, labeled as: "Semantic Representation → Discrete Speech Unit Prediction Space" Representing the primary information flow during inference. ──────────────────────── T2U Module (Central Hub between Semantic and Acoustic Layers, Visual Center) Module Name: "T2U: Text-to-Unit & Duration Mapping" Module Positioning Description (Small text or annotation within the module): "Intermediate interface connecting semantic space and speech space" Functional Description (Not divided into sub-modules, expressed in text): Input: Semantic representation H′ Output 1: Discrete speech unit sequence Ũ Output 2: Speech unit duration sequence D̂ Modeling Meaning (SCI Style): The T2U module learns a stable mapping from the Chinese semantic space to a language-agnostic discrete speech unit space, without relying on target language text or manual phonetic rules during inference. Connections (Focus on answering "What is the intermediate connection?"): 1) Downward Solid Arrow → Acoustic Layer (indicating the use of predicted unit and duration during inference) 2) Gray Dashed Arrow from the Middle Representation Layer → T2U (indicating the supervision signal during training) ──────────────────────── Middle Representation Layer (Middle, Auxiliary but Critical) Module Chain (Horizontal Arrangement): "Raw Speech → HuBERT (Self-Supervised Speech Encoding) → k-means Clustering → Discrete Speech Units + Time Alignment" Functional Positioning Description (Annotation): This layer is only used during training.
Based on the research framework of the National Natural Scie...