Bridging distributional discrepancy with temporal dynamics for video understanding