Dynamic Neural Koopman Distillation

Real-Time Robot Control Using Diffusion Models

Anonymous Author(s)

Affiliation withheld for double-blind review

Abstract

Diffusion models excel at generating diverse and multimodal trajectories for robotic planning, yet their iterative denoising process introduces latency that is incompatible with high-frequency closed-loop control.

To address this problem, we propose Dynamic Neural Koopman Distillation, a framework that distills multistep diffusion inference into a single forward pass while retaining the multimodal expressivity of the teacher model.

Specifically, we introduce a Factorized Dynamic Koopman layer that models the denoising process through a factorized latent transition with state-dependent modal gains.

We evaluate the proposed method on standard D4RL MuJoCo locomotion benchmarks and a physical Kinova manipulator, comparing against one-step baselines.

The results show that our method significantly outperforms existing one-step distillation approaches on the reported locomotion tasks, and reduces the inference latency to the millisecond regime compared with the teacher policy. Hardware experiments further demonstrate that our method enables smooth and fast closed-loop execution while maintaining task success and comparable accuracy.

Problem Setting

We consider trajectory generation for high-frequency closed-loop robot control. Let \(\mathbf{s}_k \in \mathcal{S} \subseteq \mathbb{R}^{m}\) and \(\mathbf{a}_k \in \mathcal{A} \subseteq \mathbb{R}^{n}\) denote the robot state and action at discrete time \(k\). Over a prediction horizon \(H\), a future state–action trajectory is \[ \boldsymbol{\tau} = (\mathbf{s}_1,\mathbf{a}_1,\dots,\mathbf{s}_H,\mathbf{a}_H). \] At each control step, the planner models a conditional trajectory distribution \(\pi(\boldsymbol{\tau}\mid \mathbf{c})\), where \(\mathbf{c}\) summarizes the information available for planning (e.g., the current observation, short history, and task specification).

Let \(T_{\mathrm{inf}}\) denote the time required to produce usable trajectory samples at inference, and let \(T_{\mathrm{ctrl}}\) denote the control period. For closed-loop deployment, actions must be available within the control period, i.e., \(T_{\mathrm{inf}} \le T_{\mathrm{ctrl}}\). Our goal is a one-step student that cuts \(T_{\mathrm{inf}}\) versus a multistep diffusion teacher while tracking the teacher’s conditional trajectory distribution. MuJoCo results below report return, latency, \(\sigma_{\mathrm{ep}}\), and worst-case return.

Overall Framework

Overall framework diagram
High-level view: diffusion teacher trajectories are distilled into a fast one-step Koopman student.

Simulation Environments and Results

halfcheetah-medium-expert-v2 and walker2d-medium-expert-v2 (D4RL MuJoCo; 17-dim observations, 6-dim actions). Teacher training and inference follow the public CleanDiffuser recipe: trajectory diffusion plus a return classifier for guided sampling and candidate ranking, with 20 denoising steps at rollout time.

Simulation Setup

Main Quantitative Results

Mean ± std over 5 seeds. \(\sigma_{\mathrm{ep}}\) = within-run episode variability; worst-case return = minimum seed-level episode mean. Steps = diffusion (teacher) or decision (student) steps at inference.

EnvMethodReturn ↑Steps Latency (ms) ↓\(\sigma_{\mathrm{ep}}\)Worst-case Return ↑
Walker2dDiffusion Teacher1.0295 ± 0.005120301.22 ± 2.410.01930.9759
CD0.6144 ± 0.017611.09 ± 2.330.05770.4808
CT0.4404 ± 0.009611.09 ± 2.370.06590.2591
KDM0.6057 ± 0.015610.35 ± 0.040.03330.5925
KDM-F0.3386 ± 0.007010.87 ± 0.030.03800.3267
Ours1.0973 ± 0.000211.75 ± 0.360.00041.0963
Ours (classifier selector)1.0285 ± 0.002214.57 ± 0.870.02420.9677
HalfCheetahDiffusion Teacher0.8518 ± 0.003220116.51 ± 2.390.01140.8229
CD0.4027 ± 0.010911.08 ± 2.320.02770.3217
CT0.4089 ± 0.004711.08 ± 2.310.01980.3663
KDM0.6618 ± 0.004510.34 ± 0.030.03490.6571
KDM-F0.4852 ± 0.009810.86 ± 0.030.02160.4659
Ours0.8885 ± 0.004210.82 ± 0.470.01340.8488
Ours (classifier selector)0.8599 ± 0.001011.83 ± 0.800.00810.8417

CD / CT: consistency-style one-step baselines. KDM / KDM-F: static Koopman distillation (factorized variant). Ours: proposed dynamic student.

Latency, Pareto curves, and PCA

Walker2d latency-performance Pareto figure.
Walker2d: latency vs return (Pareto-style view).
HalfCheetah latency-performance Pareto figure.
HalfCheetah: latency vs return.
Walker2d PCA teacher vs Ours.
Walker2d: PCA of teacher vs Ours.
HalfCheetah PCA teacher vs Ours.
HalfCheetah: PCA of teacher vs Ours.

Rollout videos

Teacher, Ours, KDM, KDM-F.

Walker2d-medium-expert-v2

Diffusion Teacher (20-step).
Ours (one-step).
KDM baseline.
KDM-F baseline.

HalfCheetah-medium-expert-v2

Diffusion Teacher (20-step).
Ours (one-step).
KDM baseline.
KDM-F baseline.

Latency-scaled replay

Playback speeds are scaled by measured mean decision latency so the teacher’s slower inference is visible next to the student.

Walker2d

Diffusion Teacher (latency-aware replay).
Ours (latency-aware replay).

HalfCheetah

Diffusion Teacher (latency-aware replay).
Ours (latency-aware replay).

Hardware evaluation (Kinova Gen3)

Fifty independent real-world trials per method (diffusion teacher vs Ours), with videos and deployment metrics.

Task and protocol

Hardware evaluation uses a Kinova Gen3 manipulator in a receding-horizon point-to-point obstacle-avoidance task. At each control step, the policy predicts over a horizon \(H=32\), samples \(N_{\mathrm{cand}}=64\) candidate trajectories, ranks them with a geometry-based score, applies the first action, and replans. The diffusion teacher and our one-step student share the same control interface, scene, and goal specification so latency and task performance are compared under matched conditions.

Aggregate metrics

Method Task completion ↑ Task success ↑ Completion time (s) ↓ Executed steps Inference mean (ms) ↓ Inference p95 (ms) ↓ Final goal error ↓ Min obstacle clearance (m) ↑ Measured control Hz ↑ Overrun ratio ↓
Diffusion Teacher 1.0000 ± 0.0000 100% (50/50) 314.95 ± 3.54 25195.62 ± 283.24 151.00 ± 3.73 158.27 ± 3.78 0.003093 ± 0.001421 0.046289 ± 0.012908 80.005 ± 0.025 0.5013 ± 0.0251
Ours 1.0000 ± 0.0000 100% (50/50) 85.43 ± 2.53 6834.24 ± 202.67 4.08 ± 0.20 4.73 ± 0.23 0.003458 ± 0.001745 0.048953 ± 0.012117 79.999 ± 0.019 0.5020 ± 0.0245

Mean ± std, 50 trials. Success = count/rate.

Videos

Kinova Gen3 hardware evaluation at 2× speed; paired clips support synchronized play and reset.

Diffusion Teacher (simulation).
Ours (simulation).
2x Speed
Diffusion Teacher (real-world side view, 2x replay).
2x Speed
Ours (real-world side view, 2x replay).

Front view (Kinova Gen3)

Second viewpoint for the same receding-horizon point-to-point obstacle-avoidance runs.

2x Speed
Diffusion Teacher (real-world front view, 2x replay).
2x Speed
Ours (real-world front view, 2x replay).

Still-frame comparison (Kinova Gen3)

Still frames from Kinova Gen3 hardware evaluation: diffusion teacher vs Ours.
Still-frame timeline from the hardware evaluation (teacher vs Ours; teacher final frame t=318s).

Deployment metrics (50 trials)

Aggregates from the Kinova Gen3 hardware evaluation (same task and protocol as above).

Task success rate across 50 real-world trials for teacher and Ours.
Task success rate across 50 real-world trials.
Task completion time mean and standard deviation across real-world trials.
Task completion time (mean ± std, 50 real-world trials).
Inference latency per control step showing mean and p95 for both methods.
Inference latency per control step (mean and p95 across trials).
Minimum obstacle-surface clearance mean and standard deviation across real-world trials.
Minimum obstacle-surface clearance (mean ± std safety margin).

Across the 50 trials summarized above, both methods complete the point-to-point obstacle-avoidance task successfully; Ours matches similar terminal accuracy and clearance while reducing mean per-step inference to the few-millisecond range versus the teacher (see table and plots).