Skip to main content

Safety pipeline

bar_ros2's safety layer is layered rather than concentrated. Three subsystems each enforce one piece of the contract; together they make sure that "the robot is in DAMPING within a tick of a fault" is the worst case, never "the robot is doing something unexpected and we didn't notice".

Safety pipeline: fault to DAMPING in ≤1 tick

Layer 1 — Hardware plugins detect transport faults

RobstrideSystem, SitoSystem, and the EtherCAT plugin observe transport-level conditions at every read() tick:

ConditionDetection
BUS_OFFThe kernel CAN socket couldn't be opened, or returned ENETDOWN. Sticky — set in on_configure, only cleared on the next on_activate.
RX_TIMEOUTOne or more joints haven't reported an OperationStatus frame in > rx_timeout_ms (default 200 ms ≈ 10 ticks at 50 Hz).
TX_QUEUE_OVERRUNThe bus library's outbound SPSC ring overflowed (RT producer faster than the I/O thread can drain to the kernel).
MOTOR_FAULTA Robstride status / fault-report frame indicated a non-OK motor state.
TEMPERATURE_LIMITA specific overtemperature bit was set in a motor's fault frame.
INVALID_FRAMEA frame on the bus had the wrong comm-type code or DLC for the protocol.

The plugin publishes bar_msgs/SafetyStatus on /safety_status — TRANSIENT_LOCAL durability so late-joining subscribers (like rqt or a freshly-started mode_manager) immediately see the most recent value. The source field carries the bus interface name (bar_robstride/can0, etc.), so an operator can tell which bus flagged.

Each tick, the plugin rebuilds flags from currently observed conditions, not accumulated history. The exception is BUS_OFF — which can't self-recover without a configure round-trip — which sticks until activate. That choice avoids two bad failure modes:

  • No-sticky-anywhere: a single EMI glitch would condemn the robot to FAULT for the rest of the activation.
  • Sticky-everywhere: even transient drops would require an operator reset to clear, masking when the bus is actually healthy now.

Per-bit publish only happens on change, so the topic stays quiet (level=0, flags=0) for a healthy robot and emits exactly one message per state transition.

Layer 2 — Controllers validate their own commands

The hardware plugins are not the only ones who can refuse to do something. Each controller's update() returns a controller_interface::return_type that the controller_manager inspects:

ControllerReason it might return ERROR
RLPolicyControllerNaN / non-finite observation, wrong tensor size, action outside configured limits
RemotePolicyControllerMITCommand joint_names don't match claimed order, array length mismatch, stale command (configurable policy)
StandbyControllerpose_segment_N malformed (caught at on_configure, not update)

A non-OK return_type triggers the controller_manager's fallback_controllers mechanism — see Layer 3.

RLPolicyController and RemotePolicyController additionally have stale-command policies: if the policy's MITCommand hasn't arrived within stale_command_timeout_ms (default 100 ms = 5 ticks at 50 Hz, measured against arrival time at the subscription callback — not against MITCommand.header.stamp — so publisher clock skew is irrelevant), the controller writes a fallback pattern rather than re-using the last command. Default passive → zero stiffness / damping → motors go limp. Alternative hold → freeze at the last commanded pose. Either way the controller stays alive and active; the choice is whether to "fail compliant" or "fail rigid".

Layer 3 — controller_manager's fallback_controllers

Every active-policy controller is configured with fallback_controllers: [damping_controller] in bar_lite_controllers.yaml. The controller_manager interprets this as "if this controller returns ERROR, automatically deactivate it and activate the fallback".

The hierarchy is conservative to most-conservative:

RLPolicyController     → damping_controller
RemotePolicyController → damping_controller
StandbyController → damping_controller
DampingController → zero_torque_controller
ZeroTorqueController → (no fallback — final fall-back)

zero_torque_controller is the unique safer-than-damping option, reserved for cases where DAMPING itself can't be applied (state interface unavailable, hardware plugin dead). It writes 0 to every interface — no risk of unintended motion regardless of state.

Layer 4 — mode_manager reacts to /safety_status

mode_manager subscribes to /safety_status. On any non-OK level:

SafetyStatus.level != OK  →  request_mode(DAMPING)

The transition is STRICT — if it fails because the command interfaces are unavailable, mode_manager requests ZERO_TORQUE instead and writes the failure reason into /control_mode.status_message.

This is belt-and-suspenders on top of Layer 3: even if a controller failed to detect its own bad command, the plugin's safety publish path triggers the FSM-level fallback. And even if the plugin missed an issue, the controller's own validation triggers the controller-manager-level fallback.

Layer 5 — RT update() discipline

A subtler "safety" layer that's worth naming: the RT update() paths follow the standard RT-safety rules.

  • No allocations on the tick. Every controller / hardware plugin pre-allocates buffers in on_init / on_configure. The realtime_tools::RealtimeBuffer and realtime_tools::RealtimePublisher primitives are the path for any RT-to-non-RT data movement.
  • No DDS-blocking calls. Publishers go through RealtimePublisher's trylock pattern — drop the message if the non-RT thread is mid-publish, rather than blocking the tick.
  • No exceptions across the RT boundary. A throw inside update() would unwind into the controller_manager's RT thread, which is generally not safe under PREEMPT_RT.
  • No logging at tick rate. Use RCLCPP_*_THROTTLE or buffer the message into a non-RT publisher.

Violating these doesn't (directly) cause a safety incident, but it causes scheduler jitter that can make the higher layers slow to react — RX_TIMEOUT trips spuriously because the read thread missed its slot, etc.

Summary

LayerOwnerTriggers
1. Transport-levelhardware pluginBus / motor faults → SafetyStatus.flags
2. Command-validitycontrollerNaN / size mismatch / stale → return_type::ERROR
3. Controller-manager fallbackcontroller_managerERROR → activate the controller's fallback_controllers
4. FSM auto-DAMPmode_managerSafetyStatus.level != OKrequest_mode(DAMPING)
5. RT disciplineevery controller / plugin(preventative — keeps the other layers responsive)

See also