Dynamic Neural Koopman Distillation

Abstract

Diffusion models excel at generating diverse and multimodal trajectories for robotic planning, yet their iterative denoising process introduces latency that is incompatible with high-frequency closed-loop control.

To address this problem, we propose Dynamic Neural Koopman Distillation, a framework that distills multistep diffusion inference into a single forward pass while retaining the multimodal expressivity of the teacher model.

Specifically, we introduce a Factorized Dynamic Koopman layer that models the denoising process through a factorized latent transition with state-dependent modal gains.

We evaluate the proposed method on standard D4RL MuJoCo locomotion benchmarks and a physical Kinova manipulator, comparing against one-step baselines.

The results show that our method significantly outperforms existing one-step distillation approaches on the reported locomotion tasks, and reduces the inference latency to the millisecond regime compared with the teacher policy. Hardware experiments further demonstrate that our method enables smooth and fast closed-loop execution while maintaining task success and comparable accuracy.

Problem Setting

We consider trajectory generation for high-frequency closed-loop robot control. Let \(\mathbf{s}_k \in \mathcal{S} \subseteq \mathbb{R}^{m}\) and \(\mathbf{a}_k \in \mathcal{A} \subseteq \mathbb{R}^{n}\) denote the robot state and action at discrete time \(k\). Over a prediction horizon \(H\), a future state–action trajectory is \[ \boldsymbol{\tau} = (\mathbf{s}_1,\mathbf{a}_1,\dots,\mathbf{s}_H,\mathbf{a}_H). \] At each control step, the planner models a conditional trajectory distribution \(\pi(\boldsymbol{\tau}\mid \mathbf{c})\), where \(\mathbf{c}\) summarizes the information available for planning (e.g., the current observation, short history, and task specification).

Let \(T_{\mathrm{inf}}\) denote the time required to produce usable trajectory samples at inference, and let \(T_{\mathrm{ctrl}}\) denote the control period. For closed-loop deployment, actions must be available within the control period, i.e., \(T_{\mathrm{inf}} \le T_{\mathrm{ctrl}}\). Our goal is a one-step student that cuts \(T_{\mathrm{inf}}\) versus a multistep diffusion teacher while tracking the teacher’s conditional trajectory distribution. MuJoCo results below report return, latency, \(\sigma_{\mathrm{ep}}\), and worst-case return.

Overall Framework

Simulation Environments and Results

halfcheetah-medium-expert-v2 and walker2d-medium-expert-v2 (D4RL MuJoCo; 17-dim observations, 6-dim actions). Teacher training and inference follow the public CleanDiffuser recipe: trajectory diffusion plus a return classifier for guided sampling and candidate ranking, with 20 denoising steps at rollout time.

Simulation Setup

Seeds: 5 random seeds (0–4) for main reported results.
Evaluation: 50 parallel environments, 10 episodes per seed, 1000-step episode cap; per-task planning horizon \(H=4\) (HalfCheetah) or \(H=32\) (Walker2d); geometry-based selection over \(N_{\mathrm{cand}}=64\) trajectories.
Return: D4RL normalized score, reported as mean ± std across seeds.
Latency: per-control-step decision time; we report mean, p95, and std (ms).
Runtime stack: Linux, PyTorch GPU inference, MuJoCo/D4RL simulation.

Main Quantitative Results

Mean ± std over 5 seeds. \(\sigma_{\mathrm{ep}}\) = within-run episode variability; worst-case return = minimum seed-level episode mean. Steps = diffusion (teacher) or decision (student) steps at inference.

Env	Method	Return ↑	Steps	Latency (ms) ↓	\(\sigma_{\mathrm{ep}}\) ↓	Worst-case Return ↑
Walker2d	Diffusion Teacher	1.0295 ± 0.0051	20	301.22 ± 2.41	0.0193	0.9759
	CD	0.6144 ± 0.0176	1	1.09 ± 2.33	0.0577	0.4808
	CT	0.4404 ± 0.0096	1	1.09 ± 2.37	0.0659	0.2591
	KDM	0.6057 ± 0.0156	1	0.35 ± 0.04	0.0333	0.5925
	KDM-F	0.3386 ± 0.0070	1	0.87 ± 0.03	0.0380	0.3267
	Ours	1.0973 ± 0.0002	1	1.75 ± 0.36	0.0004	1.0963
	Ours (classifier selector)	1.0285 ± 0.0022	1	4.57 ± 0.87	0.0242	0.9677
HalfCheetah	Diffusion Teacher	0.8518 ± 0.0032	20	116.51 ± 2.39	0.0114	0.8229
	CD	0.4027 ± 0.0109	1	1.08 ± 2.32	0.0277	0.3217
	CT	0.4089 ± 0.0047	1	1.08 ± 2.31	0.0198	0.3663
	KDM	0.6618 ± 0.0045	1	0.34 ± 0.03	0.0349	0.6571
	KDM-F	0.4852 ± 0.0098	1	0.86 ± 0.03	0.0216	0.4659
	Ours	0.8885 ± 0.0042	1	0.82 ± 0.47	0.0134	0.8488
	Ours (classifier selector)	0.8599 ± 0.0010	1	1.83 ± 0.80	0.0081	0.8417

CD / CT: consistency-style one-step baselines. KDM / KDM-F: static Koopman distillation (factorized variant). Ours: proposed dynamic student.

Latency, Pareto curves, and PCA

Walker2d latency-performance Pareto figure. — Walker2d: latency vs return (Pareto-style view).

HalfCheetah latency-performance Pareto figure. — HalfCheetah: latency vs return.

Walker2d PCA teacher vs Ours. — Walker2d: PCA of teacher vs Ours.

HalfCheetah PCA teacher vs Ours. — HalfCheetah: PCA of teacher vs Ours.

Rollout videos

Teacher, Ours, KDM, KDM-F.

Walker2d-medium-expert-v2

Diffusion Teacher (20-step).

Ours (one-step).

KDM baseline.

KDM-F baseline.

HalfCheetah-medium-expert-v2

Diffusion Teacher (20-step).

Ours (one-step).

KDM baseline.

KDM-F baseline.

Latency-scaled replay

Playback speeds are scaled by measured mean decision latency so the teacher’s slower inference is visible next to the student.

Walker2d

Diffusion Teacher (latency-aware replay).

Ours (latency-aware replay).

HalfCheetah

Diffusion Teacher (latency-aware replay).

Ours (latency-aware replay).

Hardware evaluation (Kinova Gen3)

Fifty independent real-world trials per method (diffusion teacher vs Ours), with videos and deployment metrics.

Task and protocol

Hardware evaluation uses a Kinova Gen3 manipulator in a receding-horizon point-to-point obstacle-avoidance task. At each control step, the policy predicts over a horizon \(H=32\), samples \(N_{\mathrm{cand}}=64\) candidate trajectories, ranks them with a geometry-based score, applies the first action, and replans. The diffusion teacher and our one-step student share the same control interface, scene, and goal specification so latency and task performance are compared under matched conditions.

Aggregate metrics

Method	Task completion ↑	Task success ↑	Completion time (s) ↓	Executed steps	Inference mean (ms) ↓	Inference p95 (ms) ↓	Final goal error ↓	Min obstacle clearance (m) ↑	Measured control Hz ↑	Overrun ratio ↓
Diffusion Teacher	1.0000 ± 0.0000	100% (50/50)	314.95 ± 3.54	25195.62 ± 283.24	151.00 ± 3.73	158.27 ± 3.78	0.003093 ± 0.001421	0.046289 ± 0.012908	80.005 ± 0.025	0.5013 ± 0.0251
Ours	1.0000 ± 0.0000	100% (50/50)	85.43 ± 2.53	6834.24 ± 202.67	4.08 ± 0.20	4.73 ± 0.23	0.003458 ± 0.001745	0.048953 ± 0.012117	79.999 ± 0.019	0.5020 ± 0.0245

Mean ± std, 50 trials. Success = count/rate.

Videos

Kinova Gen3 hardware evaluation at 2× speed; paired clips support synchronized play and reset.

Diffusion Teacher (simulation).

Ours (simulation).

2x Speed

Diffusion Teacher (real-world side view, 2x replay).

2x Speed

Ours (real-world side view, 2x replay).

Front view (Kinova Gen3)

Second viewpoint for the same receding-horizon point-to-point obstacle-avoidance runs.

2x Speed

Diffusion Teacher (real-world front view, 2x replay).

2x Speed

Ours (real-world front view, 2x replay).

Still-frame comparison (Kinova Gen3)

Still frames from Kinova Gen3 hardware evaluation: diffusion teacher vs Ours. — Still-frame timeline from the hardware evaluation (teacher vs Ours; teacher final frame *t=318s*).

Deployment metrics (50 trials)

Aggregates from the Kinova Gen3 hardware evaluation (same task and protocol as above).

Task success rate across 50 real-world trials for teacher and Ours. — Task success rate across 50 real-world trials.

Task completion time mean and standard deviation across real-world trials. — Task completion time (mean ± std, 50 real-world trials).

Inference latency per control step showing mean and p95 for both methods. — Inference latency per control step (mean and p95 across trials).

Minimum obstacle-surface clearance mean and standard deviation across real-world trials. — Minimum obstacle-surface clearance (mean ± std safety margin).

Across the 50 trials summarized above, both methods complete the point-to-point obstacle-avoidance task successfully; Ours matches similar terminal accuracy and clearance while reducing mean per-step inference to the few-millisecond range versus the teacher (see table and plots).