for the complete work space check out my GitHub
Vision-Based Policy Learning for Human-Following with a GoBilda Mobile Robot
Giovanni Dal Lago
1 Introduction
This project focuses on designing and implementing a vision-based human-following system for a GoBilda mobile robot. The objective was to build a complete, end-to-end pipeline that allows the robot to perceive a person using onboard vision, learn a following behavior from demonstrations, and execute that behavior autonomously in real time.
The system integrates:
1. a computer-vision pipeline (DepthAI + ROS2) for real-time person detection, 2. a custom data-logging node for collecting teleoperated demonstrations,
3. dataset preprocessing and feature engineering,
4. supervised imitation learning using a neural network policy,
5. a ROS2 execution node that deploys the learned policy on the robot.
Figure 1: Block diagram of the full human-following pipeline, including perception, data logging, preprocessing, neural network training, and ROS2 deployment.
1
2 Computer Vision System 2.1 Software Stack
Perception is handled through the DepthAI ROS interface, which publishes detections from a MobileNet-based object detector on:
/color/mobilenet_detections
The corresponding message type is:
vision_msgs/Detection2DArray
Each detection provides bounding box information:
• bbox.center.x: horizontal pixel coordinate of the detected person,
• bbox.size y: height of the bounding box, used as a proxy for distance.
2.2 Detection Selection Logic
When multiple detections are present, the bounding box with the largest area is selected under the assumption that it corresponds to the closest and most relevant human target. Two scalar values are extracted:
• cx: lateral error signal,
• s: bounding box height, used as a distance proxy.
These two signals form the complete observation space used by the learned control policy.
3 Data Logging Node 3.1 Motivation
Imitation learning requires high-quality demonstrations. Each logged sample consists of percep- tion inputs and corresponding control commands:
inputs = (cx, s, history),
3.2 Implementation
A custom ROS2 node was implemented:
mobilenet_logger.py
outputs = (v, w)
The node:
• subscribes to /color/mobilenet detections and /gobilda/cmd vel,
2
• selects the largest bounding box,
• extracts detection features,
• logs synchronized data at 10 Hz into a CSV file.
The CSV format:
timestamp, cx, s, v, w, has_detection
3.3
Design Considerations
• Avoiding file I/O inside callbacks to prevent blocking, • timer-based logging for consistent sampling,
• frequent flushing for data integrity,
• explicit handling of frames without detections.
Dataset Collection and Storage
4
Teleoperated demonstrations were recorded to capture following, turning, stopping, and distance- regulation behaviors. The final dataset was stored as:
follow_data.csv
3
5 Data Preprocessing
5.1 Motivation
Raw image-space values (cx, s) are poorly suited for learning: • they depend on camera resolution,
• they have inconsistent scaling,
• they are noisy,
• they are not directly aligned with control semantics.
5.2 Error Normalization
Normalized control errors are computed as:
ex =cx−cxref,
es =sref −s ex,norm = ex , es,norm = es
where:
• cxref is the centered reference,
• image half width is the maximum observed deviation, • sref is the median bounding box height.
5.3 Temporal History
image half width
sref
To capture short-term dynamics, a three-step temporal window is used:
Final features:
e_x_norm_t0, e_s_norm_t0,
e_x_norm_t-1, e_s_norm_t-1,
e_x_norm_t-2, e_s_norm_t-2
5.4 Output Normalization
Teleoperation commands are discretized: vnorm ∈{0,1},
wnorm ∈{−1,0,+1} This reframes the problem as discrete action selection.
t0, t−1, t−2
4
5.5 Final Dataset
The processed dataset is stored as:
nn_dataset.csv
6 Neural Network Training
6.1 Objective: Discrete Policy Learning
Actions are defined as:
Action ID vnorm wnorm
0 0 -1
1 00 Stop
2 0 +1 Rotate right
3 1 0 Move forward
6.2 Model Architectures
Two multilayer perceptrons were evaluated using scikit-learn.
Model A
• Hidden layer: (32) • Parameters: ≈ 300
Model B
• Hidden layers: (64, 32) • Parameters: ≈ 3–5k
Both models use ReLU activations, Adam optimization, feature standardization, and an 80/20 stratified split.
6.3 Model Selection
The deeper network provided better separation between ambiguous states, particularly between forward motion and turning, and was selected as the final policy.
5
Meaning Rotate left
7 ROS2 Deployment
The trained model is loaded at runtime by a ROS2 execution node that: 1. processes live detections,
2. maintains temporal feature history,
3. predicts a discrete action,
4. maps actions to velocity commands,
5. publishes safe, saturated control signals.
A fallback spin behavior is triggered when detections are temporarily lost, ensuring robust operation.
6
8 Results
The learned policy was tested in a variety of indoor and outdoor environments. The robot demon- strated stable following behavior, smooth turning, and consistent distance regulation. Performance closely matched that of a hand-tuned proportional controller, confirming that the imitation-learning pipeline successfully captured the underlying control strategy.
9 Discussion and Future Work
Key challenges included dataset consistency and sensitivity to visual noise. Future improvements include richer sensory inputs, continuous action prediction, reinforcement learning fine-tuning, and hybrid control strategies.
10 Conclusion
This project resulted in a fully autonomous, vision-based human-following system built from the ground up. From real-time perception and data logging to neural-network training and deploy- ment on hardware, every component was designed, implemented, and integrated into a working robotic system. The close match between the learned policy and a deterministic baseline highlights the effectiveness of the pipeline and demonstrates the practical viability of imitation learning for mobile-robot control.