Skip to main content

Policy runner

bar_policy.remote_policy_runner is the Python rclpy node that drives bar::RemotePolicyController from an ONNX policy + (optionally) a LeRobot reference dataset. This page documents the launch surface, the ONNX metadata schema the runner reads, and the observation-term registry that maps names from the ONNX to ObservationTerm instances.

Why "self-describing ONNX"

The training pipeline writes 13 fields into the ONNX file's custom_metadata_map at export time — joint order, per-joint PD gains, default pose, action scale, observation term names, body names, the LeRobot dataset id, the policy tick period. The runner parses those fields into a typed PolicyMetadata object and never relies on YAML to restate them.

This matters because any mismatch between the training-time graph and the deployment-time YAML breaks the policy silently (wrong joint order or off-by-one observation slice will still produce well-shaped tensors that just behave terribly). Embedding the metadata in the ONNX file means training and deployment cannot drift independently.

Composition

ModuleResponsibility
OnnxPolicyRunneronnxruntime.InferenceSession wrapper. Single-step inference, shape contract float32 (1, N) -> float32 (1, M).
PolicyMetadataFrozen dataclass decoded from custom_metadata_map. Helpers like joint_stiffness_map(), default_joint_pos_map(), action_scale_map() return dict[str, float] keyed by joint name.
term_builders.build_observation_termsResolves each name in observation_names to an ObservationTerm via three-stage lookup: built-in proprioception → reference-provider term → constant fallback. Unknown names raise.
ObservationManagerConcatenates configured terms into the policy input vector each tick. Owns the reference-provider lifecycle (reset on activation, step after each record_action).
TrackingReferenceProviderLoads a LeRobot dataset (local path or HF repo id) and serves per-frame body-pos / body-ori / joint references on the manager's cadence. Mirrors the training-side pianist_tracking.loader.teleop_sequence layout exactly. Activated when task_type='tracking'.
_build_reference_provider hookRemotePolicyRunner extracts the provider-construction step as an overridable method. Base class handles tracking; downstream task packages (e.g. pianist_ros2's pianist_policy.PianoPolicyRunner) subclass and override only this hook, inheriting all ONNX loading / observation packing / action decoding / MITCommand publishing. See Task packages below.
PolicyActionDecodertarget[name] = default[name] + action[name] * scale[name].
ActionMapperAssembles a full-articulation MITCommand from any {joint_name: target_position} map. Undriven joints (those not in action_joint_names) are pinned to position=0 with the same K/D the policy ships, so the controller never holds an arbitrary pose.

ONNX metadata schema

The 13 required keys (parsed from session.get_modelmeta().custom_metadata_map):

KeyTypeMeaning
task_type"tracking" | "piano" | "locomotion"Selects the reference provider class.
joint_namestuple[str, ...]Full articulation list (17 for Lite). Indexes joint_stiffness, joint_damping, default_joint_pos.
action_joint_namestuple[str, ...]The subset the policy actually emits actions for. Joints in joint_names but not here are "undriven" and get pinned to zero.
joint_stiffnesstuple[float, ...]Per-joint K_p, aligned to joint_names.
joint_dampingtuple[float, ...]Per-joint K_d, aligned to joint_names.
default_joint_postuple[float, ...]Per-joint default position (the q_default in out = (q - q_default) * scale).
observation_namestuple[str, ...]The flat observation vector, term by term. See Observation term registry.
command_namestuple[str, ...]Future-use; runtime command terms (e.g. velocity targets) that aren't part of the proprio observation.
action_scalefloat | tuple[float, ...]Per-action-joint scale. Scalar broadcasts across all joints.
policy_dtfloatPolicy tick period in seconds. The runner uses this for its create_timer unless policy_dt_override is set.
body_namestuple[str, ...]Reference-tracked bodies (for motion_body_pos_b / motion_body_ori_b terms).
dataset_repo_idstrDefault HF repo id for the reference dataset; overridable at launch.
lookahead_stepstuple[int, ...]Piano-only: frame offsets the policy was trained to look ahead at.

Observation term registry

observation_names is a CSV like motion_body_pos_b, motion_body_ori_b, joint_pos, joint_vel, actions (tracking task) or target_keys, progress, key_pressed, joint_pos, joint_vel, actions (piano task). Names match upstream pianist-tracking-mj actor terms directly. Each name maps to one of three sources, resolved in this order:

  1. Built-in proprioception terms — instantiated directly from the metadata:

    NameTerm classSlice from
    joint_posJointPositionTerm(default, scale=1)state.joint_position - default
    joint_velJointVelocityTerm(num_joints, scale=1)state.joint_velocity
    actionsLastActionTerm(action_dim)last action recorded via record_action()
    imu_quaternionImuFieldTerm("imu_quaternion")state.imu_quat (w, x, y, z)
    imu_angular_velocityImuFieldTerm(...)state.imu_gyro
    imu_linear_accelerationImuFieldTerm(...)state.imu_accel
  2. Reference-provider terms — delegated to the attached provider. The provider declares its own supported terms via supports(name):

    • TrackingReferenceProvider (bar_policy, task_type='tracking'): motion_body_pos_b, motion_body_ori_b.
    • PianoCompositeProvider (pianist_policy, task_type='piano'): target_keys, target_keys_future, progress, key_pressed. (target_keys_future is the K-frame lookahead used by the song provider; upstream piano_env_cfg currently keeps it commented out, so its slot in observation_names is conditional on the trained policy.)
  3. Constant terms — caller-supplied fallback. Useful for running without a dataset (zero-fill motion_body_pos_b) during smoke tests.

Unknown names raise ValueError at runner startup — the build-time guarantee is that every observation slot has a deterministic source.

Launch parameters

remote_policy_runner (entry point: remote_policy_runner = bar_policy.remote_policy_runner:main):

ParamDefaultEffect
checkpoint_file(empty)Path to a local .onnx file. Mutually exclusive with wandb_run_path.
wandb_run_path(empty)W&B run path (entity/project/run-id); the runner downloads the ONNX from the run.
wandb_checkpoint_name(empty)Specific checkpoint filename inside the W&B run (e.g. model_2000.onnx).
motion_file''Piano-only: explicit local LeRobot directory. Wins over registry and metadata.
registry_name''Piano-only: HF repo id to use instead of the ONNX-baked one. Useful when iterating on dataset variants.
episode_index0Reference-dataset episode index to track.
command_topic/remote_policy_controller/commandWhere to publish MITCommand. Must match RemotePolicyController's subscription.
joint_state_topic/lite/joint_statesSource for proprio state (driven by joint_state_broadcaster, owner-prefixed at bringup).
policy_dt_override0.0Force a different tick period than PolicyMetadata.policy_dt. 0.0 means use metadata.

Argument names match mjlab's play CLI vocabulary so the same string moves between training-time and deployment-time scripts.

Dataset resolution order

The runner picks the reference dataset source in this priority:

  1. motion_file:=<local path> — air-gapped deployments pre-fetch via hf download and point here.
  2. registry_name:=<hf_repo_id> — for testing alternative datasets without re-exporting the ONNX.
  3. PolicyMetadata.dataset_repo_id — the default. First run will download from Hugging Face Hub.

Logged at startup so you always know which one was used.

Task-specific runner subclasses live downstream

RemotePolicyRunner is generic. The piano task's runner — pianist_policy.PianoPolicyRunner in T-K-233/pianist_ros2 — subclasses it and overrides only _build_reference_provider() to attach a PianoSongProvider that fetches the LeRobot dataset declared in meta.dataset_repo_id from Hugging Face Hub. Every other piece of the runner (ONNX load, observation pack, action decode, MITCommand publish) is inherited; the subclass adds ~50 LOC.

New task families follow the same pattern: a sibling <task>_ros2 repo with its own <task>_policy package whose runner is a thin RemotePolicyRunner subclass.

Per-tick flow

Any ReferenceProvider (the built-in TrackingReferenceProvider, the downstream PianoSongProvider, or any other) is interchangeable from the runner's perspective — the same per-tick ordering applies regardless of which one is plugged in.

Each timer fire (at policy_dt):

  1. Build the observation: every ObservationTerm.pack(state, slice) runs against the current MITState (refreshed from /lite/joint_states) and the current reference frame (from the provider).
  2. Run the ONNX forward pass on the flat observation vector.
  3. Decode the raw action via target[name] = default[name] + action[name] * scale[name].
  4. Assemble the MITCommand with the configured K_p / K_d and undriven joints pinned to position=0.
  5. Publish on command_topic.
  6. obs_mgr.record_action(action) — refreshes LastActionTerm and steps the reference provider so the dataset frame advances exactly once.

Steps 5 and 6 are intentionally ordered so the policy never observes the post-action dataset frame on the same tick it commanded against — every tick is (t, t+1) in the dataset, never (t+1, t+1).

See also