Recover from a fault

/safety_status reports a non-OK level and mode_manager auto-DAMPs. What to do depends on which flag tripped. This page is the operator's runbook.

Read the status first

Open a shell into the workspace so ros2 is on PATH:

cd bar_ws
pixi shell
ros2 topic echo /safety_status
# level: 2     # 0=OK, 1=WARNING, 2=FAULT, 3=CRITICAL
# flags: 8     # bitmask
# source: bar_robstride/can0
# message: ""

Translate flags:

Bit	Constant	Hex
0	`FLAG_BUS_OFF`	`0x01`
1	`FLAG_RX_TIMEOUT`	`0x02`
2	`FLAG_TX_QUEUE_OVERRUN`	`0x04`
3	`FLAG_MOTOR_FAULT`	`0x08`
4	`FLAG_TEMPERATURE_LIMIT`	`0x10`
5	`FLAG_INVALID_FRAME`	`0x20`

Multiple bits can be set simultaneously. The numeric value above (8 = 0x08) means MOTOR_FAULT alone tripped.

Per-flag recovery

`FLAG_BUS_OFF` (sticky, level=CRITICAL)

Cause: kernel CAN socket open failed, or the kernel reported ENETDOWN mid-operation.

Recovery:

Check the adapter: ip -d link show <iface>. If BUS-OFF or missing, unplug + replug the USB-to-CAN cable.
Bring the bus back up: sudo ip link set <iface> up type can bitrate 1000000.
Restart the launch — BUS_OFF is the one sticky flag in the plugin; it doesn't self-clear without a configure round-trip.

`FLAG_RX_TIMEOUT` (level=FAULT)

Cause: one or more joints went silent for longer than rx_timeout_ms (default 200 ms). Usually a motor that lost power, a cable that came loose, or a bus the kernel briefly stopped reading from.

Recovery:

Confirm motors are powered: bar bus discover --iface canN
Confirm wiring — wiggle the daisy chain connector at each motor.
Restart the launch. RX_TIMEOUT clears on on_activate so a relaunch is sufficient; no power-cycle needed.

If one specific joint repeatedly times out, that motor or its section of the bus has an issue. Substitute a known-good motor at that ID and confirm.

`FLAG_TX_QUEUE_OVERRUN` (level=WARNING)

Cause: kernel TX qdisc filling. Almost always motors-not-ACKing or USB-adapter bottlenecking.

Recovery: see Diagnose ENOBUFS for the full runbook.

`FLAG_MOTOR_FAULT` (level=FAULT)

Cause: a Robstride status / fault-report frame indicated a non-OK internal motor state. The specific sub-cause is encoded in the motor firmware's fault_bits byte, which the plugin folds into this generic flag.

Recovery:

Reconfigure (relaunch). Some motor faults clear with a re-Enable. The plugin's on_activate issues a fresh Enable to every joint.
If the fault persists, single-step diagnosis:
```
bar bus ping --iface canN --id <X> --read-status
```
The reply's fault_bits byte tells you which specific sub-cause:
- bit 0 = overtemperature (also raised as FLAG_TEMPERATURE_LIMIT)
- bit 1 = gate driver fault (usually means the motor went into overcurrent — under-spec actuator, mechanical jam)
- bit 2 = undervoltage (supply sag)
- bit 3 = overvoltage (regen during fast deceleration; can happen with a tiny PSU)
Vendor's MotorControlGUI may be needed if the motor latched and doesn't clear with a re-Enable.

`FLAG_TEMPERATURE_LIMIT` (level=FAULT)

Cause: a motor exceeded its overtemp threshold. Surfaced separately from MOTOR_FAULT because it's the most common motor fault and has a specific operator response.

Recovery:

Stop driving the joint immediately. Let it cool for several minutes.
Reduce the operating point — lower K_p, lower torque_limit, or run less load on the joint.
If the motor reaches overtemp during STANDBY interpolation, the target pose may be reaching beyond what the actuator can hold against gravity — verify the URDF lower/upper limits are correct and the standby pose YAML doesn't put the joint past them.

`FLAG_INVALID_FRAME` (level=WARNING)

Cause: a frame on the bus didn't match the Robstride protocol (wrong comm-type code, wrong DLC). Single occurrences are usually EMI glitches; recurring means another device is sharing the bus.

Recovery:

Single occurrence — ignore. The flag clears at the next tick if no further bad frames arrive.
Recurring — check for another device on the bus. CAN buses are shared; if another controller is connected the frames may collide. Disconnect the other device.

After fixing — what to do

mode_manager doesn't auto-recover from DAMPING. Even after the fault clears, the FSM stays in DAMPING until the operator manually walks back up:

# /safety_status is now OK (level=0)
# But mode_manager is still in DAMPING.
# Walk back via the gamepad:
#  L1+A or L1+B   →   STANDBY
#  R1+A           →   REMOTE
#  R1+B           →   LOCOMOTION
# Or via the std_srvs/Trigger services:
ros2 service call /bar/mode/load          std_srvs/srv/Trigger
ros2 service call /bar/mode/start_remote  std_srvs/srv/Trigger

Or via ros2 control switch_controllers directly.

The reason for not auto-recovering: a fault that just cleared may re-occur. The operator should observe the robot in DAMPING for at least a few seconds, confirm it's actually steady, then proceed.

Soft-restart in code

If you suspect transient flags are stuck:

# Force a switch through ZERO_TORQUE to reset accumulated state.
ros2 control switch_controllers \
    --deactivate damping_controller \
    --activate   zero_torque_controller

# Then activate as usual:
ros2 control switch_controllers \
    --deactivate zero_torque_controller \
    --activate   damping_controller

on_activate clears the accumulated fault_flags_ (except sticky BUS_OFF), so toggling through ZERO_TORQUE is the soft reset for non-sticky flags.

Recover from a fault

Read the status first​

Per-flag recovery​

FLAG_BUS_OFF (sticky, level=CRITICAL)​

FLAG_RX_TIMEOUT (level=FAULT)​

FLAG_TX_QUEUE_OVERRUN (level=WARNING)​

FLAG_MOTOR_FAULT (level=FAULT)​

FLAG_TEMPERATURE_LIMIT (level=FAULT)​

FLAG_INVALID_FRAME (level=WARNING)​

After fixing — what to do​

Soft-restart in code​

See also​

Read the status first

Per-flag recovery

`FLAG_BUS_OFF` (sticky, level=CRITICAL)

`FLAG_RX_TIMEOUT` (level=FAULT)

`FLAG_TX_QUEUE_OVERRUN` (level=WARNING)

`FLAG_MOTOR_FAULT` (level=FAULT)

`FLAG_TEMPERATURE_LIMIT` (level=FAULT)

`FLAG_INVALID_FRAME` (level=WARNING)

After fixing — what to do

Soft-restart in code

See also