Policy runner
bar_policy.remote_policy_runner is the Python rclpy node that drives
bar::RemotePolicyController from an ONNX policy + (optionally) a LeRobot
reference dataset. This page documents the launch surface, the ONNX
metadata schema the runner reads, and the observation-term registry that
maps names from the ONNX to ObservationTerm instances.
Why "self-describing ONNX"
The training pipeline writes 13 fields into the ONNX file's
custom_metadata_map at export time — joint order, per-joint PD gains,
default pose, action scale, observation term names, body names, the
LeRobot dataset id, the policy tick period. The runner parses those
fields into a typed PolicyMetadata object and never relies on YAML to
restate them.
This matters because any mismatch between the training-time graph and the deployment-time YAML breaks the policy silently (wrong joint order or off-by-one observation slice will still produce well-shaped tensors that just behave terribly). Embedding the metadata in the ONNX file means training and deployment cannot drift independently.
Composition
| Module | Responsibility |
|---|---|
OnnxPolicyRunner | onnxruntime.InferenceSession wrapper. Single-step inference, shape contract float32 (1, N) -> float32 (1, M). |
PolicyMetadata | Frozen dataclass decoded from custom_metadata_map. Helpers like joint_stiffness_map(), default_joint_pos_map(), action_scale_map() return dict[str, float] keyed by joint name. |
term_builders.build_observation_terms | Resolves each name in observation_names to an ObservationTerm via three-stage lookup: built-in proprioception → reference-provider term → constant fallback. Unknown names raise. |
ObservationManager | Concatenates configured terms into the policy input vector each tick. Owns the reference-provider lifecycle (reset on activation, step after each record_action). |
TrackingReferenceProvider | Loads a LeRobot dataset (local path or HF repo id) and serves per-frame body-pos / body-ori / joint references on the manager's cadence. Mirrors the training-side pianist_tracking.loader.teleop_sequence layout exactly. Activated when task_type='tracking'. |
_build_reference_provider hook | RemotePolicyRunner extracts the provider-construction step as an overridable method. Base class handles tracking; downstream task packages (e.g. pianist_ros2's pianist_policy.PianoPolicyRunner) subclass and override only this hook, inheriting all ONNX loading / observation packing / action decoding / MITCommand publishing. See Task packages below. |
PolicyActionDecoder | target[name] = default[name] + action[name] * scale[name]. |
ActionMapper | Assembles a full-articulation MITCommand from any {joint_name: target_position} map. Undriven joints (those not in action_joint_names) are pinned to position=0 with the same K/D the policy ships, so the controller never holds an arbitrary pose. |
ONNX metadata schema
The 13 required keys (parsed from session.get_modelmeta().custom_metadata_map):
| Key | Type | Meaning |
|---|---|---|
task_type | "tracking" | "piano" | "locomotion" | Selects the reference provider class. |
joint_names | tuple[str, ...] | Full articulation list (17 for Lite). Indexes joint_stiffness, joint_damping, default_joint_pos. |
action_joint_names | tuple[str, ...] | The subset the policy actually emits actions for. Joints in joint_names but not here are "undriven" and get pinned to zero. |
joint_stiffness | tuple[float, ...] | Per-joint K_p, aligned to joint_names. |
joint_damping | tuple[float, ...] | Per-joint K_d, aligned to joint_names. |
default_joint_pos | tuple[float, ...] | Per-joint default position (the q_default in out = (q - q_default) * scale). |
observation_names | tuple[str, ...] | The flat observation vector, term by term. See Observation term registry. |
command_names | tuple[str, ...] | Future-use; runtime command terms (e.g. velocity targets) that aren't part of the proprio observation. |
action_scale | float | tuple[float, ...] | Per-action-joint scale. Scalar broadcasts across all joints. |
policy_dt | float | Policy tick period in seconds. The runner uses this for its create_timer unless policy_dt_override is set. |
body_names | tuple[str, ...] | Reference-tracked bodies (for motion_body_pos_b / motion_body_ori_b terms). |
dataset_repo_id | str | Default HF repo id for the reference dataset; overridable at launch. |
lookahead_steps | tuple[int, ...] | Piano-only: frame offsets the policy was trained to look ahead at. |
Observation term registry
observation_names is a CSV like
motion_body_pos_b, motion_body_ori_b, joint_pos, joint_vel, actions
(tracking task) or
target_keys, progress, key_pressed, joint_pos, joint_vel, actions
(piano task). Names match upstream pianist-tracking-mj actor terms
directly. Each name maps to one of three sources, resolved in this
order:
-
Built-in proprioception terms — instantiated directly from the metadata:
Name Term class Slice from joint_posJointPositionTerm(default, scale=1)state.joint_position - defaultjoint_velJointVelocityTerm(num_joints, scale=1)state.joint_velocityactionsLastActionTerm(action_dim)last action recorded via record_action()imu_quaternionImuFieldTerm("imu_quaternion")state.imu_quat(w, x, y, z)imu_angular_velocityImuFieldTerm(...)state.imu_gyroimu_linear_accelerationImuFieldTerm(...)state.imu_accel -
Reference-provider terms — delegated to the attached provider. The provider declares its own supported terms via
supports(name):TrackingReferenceProvider(bar_policy,task_type='tracking'):motion_body_pos_b,motion_body_ori_b.PianoCompositeProvider(pianist_policy,task_type='piano'):target_keys,target_keys_future,progress,key_pressed. (target_keys_futureis the K-frame lookahead used by the song provider; upstreampiano_env_cfgcurrently keeps it commented out, so its slot inobservation_namesis conditional on the trained policy.)
-
Constant terms — caller-supplied fallback. Useful for running without a dataset (zero-fill
motion_body_pos_b) during smoke tests.
Unknown names raise ValueError at runner startup — the build-time
guarantee is that every observation slot has a deterministic source.
Launch parameters
remote_policy_runner (entry point: remote_policy_runner = bar_policy.remote_policy_runner:main):
| Param | Default | Effect |
|---|---|---|
checkpoint_file | (empty) | Path to a local .onnx file. Mutually exclusive with wandb_run_path. |
wandb_run_path | (empty) | W&B run path (entity/project/run-id); the runner downloads the ONNX from the run. |
wandb_checkpoint_name | (empty) | Specific checkpoint filename inside the W&B run (e.g. model_2000.onnx). |
motion_file | '' | Piano-only: explicit local LeRobot directory. Wins over registry and metadata. |
registry_name | '' | Piano-only: HF repo id to use instead of the ONNX-baked one. Useful when iterating on dataset variants. |
episode_index | 0 | Reference-dataset episode index to track. |
command_topic | /remote_policy_controller/command | Where to publish MITCommand. Must match RemotePolicyController's subscription. |
joint_state_topic | /lite/joint_states | Source for proprio state (driven by joint_state_broadcaster, owner-prefixed at bringup). |
policy_dt_override | 0.0 | Force a different tick period than PolicyMetadata.policy_dt. 0.0 means use metadata. |
Argument names match mjlab's play CLI vocabulary so the same string
moves between training-time and deployment-time scripts.
Dataset resolution order
The runner picks the reference dataset source in this priority:
motion_file:=<local path>— air-gapped deployments pre-fetch viahf downloadand point here.registry_name:=<hf_repo_id>— for testing alternative datasets without re-exporting the ONNX.PolicyMetadata.dataset_repo_id— the default. First run will download from Hugging Face Hub.
Logged at startup so you always know which one was used.
Task-specific runner subclasses live downstream
RemotePolicyRunner is generic. The piano task's runner —
pianist_policy.PianoPolicyRunner in
T-K-233/pianist_ros2 —
subclasses it and overrides only _build_reference_provider() to
attach a PianoSongProvider that fetches the LeRobot dataset
declared in meta.dataset_repo_id from Hugging Face Hub. Every other
piece of the runner (ONNX load, observation pack, action decode,
MITCommand publish) is inherited; the subclass adds ~50 LOC.
New task families follow the same pattern: a sibling <task>_ros2
repo with its own <task>_policy package whose runner is a thin
RemotePolicyRunner subclass.
Per-tick flow
Any ReferenceProvider (the built-in TrackingReferenceProvider,
the downstream PianoSongProvider, or any other) is
interchangeable from the runner's perspective — the same per-tick
ordering applies regardless of which one is plugged in.
Each timer fire (at policy_dt):
- Build the observation: every
ObservationTerm.pack(state, slice)runs against the currentMITState(refreshed from/lite/joint_states) and the current reference frame (from the provider). - Run the ONNX forward pass on the flat observation vector.
- Decode the raw action via
target[name] = default[name] + action[name] * scale[name]. - Assemble the
MITCommandwith the configuredK_p/K_dand undriven joints pinned toposition=0. - Publish on
command_topic. obs_mgr.record_action(action)— refreshesLastActionTermand steps the reference provider so the dataset frame advances exactly once.
Steps 5 and 6 are intentionally ordered so the policy never observes the
post-action dataset frame on the same tick it commanded against — every
tick is (t, t+1) in the dataset, never (t+1, t+1).
See also
bar/RemotePolicyController— the controller-side half of this pipeline.- Architecture: Two parallel policy tiers — how the in-process and out-of-process tiers relate.