Safety pipeline

bar_ros2's safety layer is layered rather than concentrated. Three subsystems each enforce one piece of the contract; together they make sure that "the robot is in DAMPING within a tick of a fault" is the worst case, never "the robot is doing something unexpected and we didn't notice".

Safety pipeline: fault to DAMPING in ≤1 tick

Layer 1 — Hardware plugins detect transport faults

RobstrideSystem, SitoSystem, and the EtherCAT plugin observe transport-level conditions at every read() tick:

Condition	Detection
`BUS_OFF`	The kernel CAN socket couldn't be opened, or returned `ENETDOWN`. Sticky — set in `on_configure`, only cleared on the next `on_activate`.
`RX_TIMEOUT`	One or more joints haven't reported an `OperationStatus` frame in > `rx_timeout_ms` (default 200 ms ≈ 10 ticks at 50 Hz).
`TX_QUEUE_OVERRUN`	The bus library's outbound SPSC ring overflowed (RT producer faster than the I/O thread can drain to the kernel).
`MOTOR_FAULT`	A Robstride status / fault-report frame indicated a non-OK motor state.
`TEMPERATURE_LIMIT`	A specific overtemperature bit was set in a motor's fault frame.
`INVALID_FRAME`	A frame on the bus had the wrong comm-type code or DLC for the protocol.

The plugin publishes bar_msgs/SafetyStatus on /safety_status — TRANSIENT_LOCAL durability so late-joining subscribers (like rqt or a freshly-started mode_manager) immediately see the most recent value. The source field carries the bus interface name (bar_robstride/can0, etc.), so an operator can tell which bus flagged.

Each tick, the plugin rebuilds flags from currently observed conditions, not accumulated history. The exception is BUS_OFF — which can't self-recover without a configure round-trip — which sticks until activate. That choice avoids two bad failure modes:

No-sticky-anywhere: a single EMI glitch would condemn the robot to FAULT for the rest of the activation.
Sticky-everywhere: even transient drops would require an operator reset to clear, masking when the bus is actually healthy now.

Per-bit publish only happens on change, so the topic stays quiet (level=0, flags=0) for a healthy robot and emits exactly one message per state transition.

Layer 2 — Controllers validate their own commands

The hardware plugins are not the only ones who can refuse to do something. Each controller's update() returns a controller_interface::return_type that the controller_manager inspects:

Controller	Reason it might return `ERROR`
`RLPolicyController`	NaN / non-finite observation, wrong tensor size, action outside configured limits
`RemotePolicyController`	`MITCommand` joint_names don't match claimed order, array length mismatch, stale command (configurable policy)
`StandbyController`	`pose_segment_N` malformed (caught at `on_configure`, not `update`)

A non-OK return_type triggers the controller_manager's fallback_controllers mechanism — see Layer 3.

RLPolicyController and RemotePolicyController additionally have stale-command policies: if the policy's MITCommand hasn't arrived within stale_command_timeout_ms (default 100 ms = 5 ticks at 50 Hz, measured against arrival time at the subscription callback — not against MITCommand.header.stamp — so publisher clock skew is irrelevant), the controller writes a fallback pattern rather than re-using the last command. Default passive → zero stiffness / damping → motors go limp. Alternative hold → freeze at the last commanded pose. Either way the controller stays alive and active; the choice is whether to "fail compliant" or "fail rigid".

Layer 3 — controller_manager's `fallback_controllers`

Every active-policy controller is configured with fallback_controllers: [damping_controller] in bar_lite_controllers.yaml. The controller_manager interprets this as "if this controller returns ERROR, automatically deactivate it and activate the fallback".

The hierarchy is conservative to most-conservative:

RLPolicyController     → damping_controller
RemotePolicyController → damping_controller
StandbyController      → damping_controller
DampingController      → zero_torque_controller
ZeroTorqueController   → (no fallback — final fall-back)

zero_torque_controller is the unique safer-than-damping option, reserved for cases where DAMPING itself can't be applied (state interface unavailable, hardware plugin dead). It writes 0 to every interface — no risk of unintended motion regardless of state.

Layer 4 — `mode_manager` reacts to `/safety_status`

mode_manager subscribes to /safety_status. On any non-OK level:

SafetyStatus.level != OK  →  request_mode(DAMPING)

The transition is STRICT — if it fails because the command interfaces are unavailable, mode_manager requests ZERO_TORQUE instead and writes the failure reason into /control_mode.status_message.

This is belt-and-suspenders on top of Layer 3: even if a controller failed to detect its own bad command, the plugin's safety publish path triggers the FSM-level fallback. And even if the plugin missed an issue, the controller's own validation triggers the controller-manager-level fallback.

Layer 5 — RT update() discipline

A subtler "safety" layer that's worth naming: the RT update() paths follow the standard RT-safety rules.

No allocations on the tick. Every controller / hardware plugin pre-allocates buffers in on_init / on_configure. The realtime_tools::RealtimeBuffer and realtime_tools::RealtimePublisher primitives are the path for any RT-to-non-RT data movement.
No DDS-blocking calls. Publishers go through RealtimePublisher's trylock pattern — drop the message if the non-RT thread is mid-publish, rather than blocking the tick.
No exceptions across the RT boundary. A throw inside update() would unwind into the controller_manager's RT thread, which is generally not safe under PREEMPT_RT.
No logging at tick rate. Use RCLCPP_*_THROTTLE or buffer the message into a non-RT publisher.

Violating these doesn't (directly) cause a safety incident, but it causes scheduler jitter that can make the higher layers slow to react — RX_TIMEOUT trips spuriously because the read thread missed its slot, etc.

Summary

Layer	Owner	Triggers
1. Transport-level	hardware plugin	Bus / motor faults → `SafetyStatus.flags`
2. Command-validity	controller	NaN / size mismatch / stale → `return_type::ERROR`
3. Controller-manager fallback	controller_manager	`ERROR` → activate the controller's `fallback_controllers`
4. FSM auto-DAMP	`mode_manager`	`SafetyStatus.level != OK` → `request_mode(DAMPING)`
5. RT discipline	every controller / plugin	(preventative — keeps the other layers responsive)

Safety pipeline

Layer 1 — Hardware plugins detect transport faults​

Layer 2 — Controllers validate their own commands​

Layer 3 — controller_manager's fallback_controllers​

Layer 4 — mode_manager reacts to /safety_status​

Layer 5 — RT update() discipline​

Summary​

See also​