A diagram illustrating the model workflow, starting from the vehicle trajectory and environmental perception scene on the left. The diagram depicts a light gray road plane with multiple colored car icons (red for the ego vehicle, blue/green for surrounding vehicles). The ego vehicle's historical trajectory is indicated by a thick solid arrow, and the dynamic influence range of surrounding vehicles is enclosed by dashed circles. Road structures, such as lane lines, are marked with thin gray lines or small icons. Overall, the system input consists of multi-modal observation data including the ego vehicle's historical path, the motion state of neighboring vehicles, and the static road topology. Next, the process enters the second stage: the feature extraction module, which is divided into two parallel paths: spatio-temporal interaction features and driver intention features. The upper path is for spatio-temporal interaction feature extraction: The ego vehicle's historical coordinates (x_t, y_t) are input into a set of three blue 3D cube blocks, representing a GRU encoder, labeled as "Temporal Interaction Feature", outputting temporal feature f_e^t. Simultaneously, surrounding vehicles and road structures are input into a set of yellow 3D cube blocks, representing a GAT graph attention network, labeled as "Spatial Interaction Feature", outputting spatial feature f_e^s. The two are concatenated through a green trapezoidal module, labeled as "Concat(f_e^t, f_e^s)", ultimately outputting external interaction feature f_e. The lower path is for driver intention feature extraction: The historical trajectory X is input into an orange rectangular box, labeled as "Destination Decoder", outputting the predicted destination D̂. Subsequently, the output passes through two parallel rectangular modules: a light blue rectangle labeled "History Feature (f_his)" and a light purple rectangle labeled "Destination Feature (f_des)", from which features are extracted by an MLP. These features are then fed into a green trapezoidal module, labeled as "Attention Fusion", internally implying the QKV attention mechanism formula, outputting the fused driver intention feature f_in. The third stage is the spatio-temporal feature fusion module: f_in and f_e are concatenated into a mixed feature h_fus, which is input into a dark blue trapezoidal module, labeled as "Mamba Fusion Module". This module utilizes the Mamba architecture to efficiently model long-range dependencies, outputting the final fused feature f_fus. Visually, this module can incorporate a wavy texture or an "M" symbol to emphasize its sequence modeling capability. The fourth stage is the condition-guided noise generation module: f_fus is input into two parallel orange trapezoidal modules, the left one labeled "Mean Decoder (μ̄_θ)" and the right one labeled "Std Decoder (σ̄_θ)". These two modules jointly define a guided noise distribution. The output connects to a rectangular box with a wavy textured background, labeled as "Guided "
With the rapid development of smart city emergency rescue te...