Autonomous Human-Following Robot using Machine Learning

This project implements a vision-based autonomous control system for a mobile robot using both deterministic and neural-network policies. A DepthAI camera provides person detections, which are used to center the target in the image and maintain a desired distance. An Extended Kalman Filter fuses onboard sensors for state estimation, while an occupancy grid map supports environmental awareness and obstacle handling. The result is a compact, fully integrated ROS2 pipeline capable of real-time human tracking and autonomous motion control. (Demo in attachments section)

Home

Questions?

Giovanni Dal Lago

Project Timeline

Sep 2025 - Dec-2025

HighlightS

Implemented vision-based person detection using DepthAI and class filtering.
Designed both neural-network policy control and deterministic control for person-following.
Integrated an Extended Kalman Filter for state estimation.
Built and updated an occupancy grid map for navigation and obstacle handling.
Developed ROS2 nodes for control, tracking, mapping, and motor command execution.

SKILLS

ROS2

Python

C++

Ubuntu

PyTorch

Sci-kit

External Links

Demo Video

demo 2

for the complete work space check out my GitHub

Vision-Based Policy Learning for Human-Following with a GoBilda Mobile Robot

Giovanni Dal Lago

1 Introduction

This project focuses on designing and implementing a vision-based human-following system for a GoBilda mobile robot. The objective was to build a complete, end-to-end pipeline that allows the robot to perceive a person using onboard vision, learn a following behavior from demonstrations, and execute that behavior autonomously in real time.

The system integrates:
1. a computer-vision pipeline (DepthAI + ROS2) for real-time person detection, 2. a custom data-logging node for collecting teleoperated demonstrations,
3. dataset preprocessing and feature engineering,
4. supervised imitation learning using a neural network policy,
5. a ROS2 execution node that deploys the learned policy on the robot.

Figure 1: Block diagram of the full human-following pipeline, including perception, data logging, preprocessing, neural network training, and ROS2 deployment.

2 Computer Vision System 2.1 Software Stack

Perception is handled through the DepthAI ROS interface, which publishes detections from a MobileNet-based object detector on:

/color/mobilenet_detections

The corresponding message type is:

vision_msgs/Detection2DArray

Each detection provides bounding box information:
• bbox.center.x: horizontal pixel coordinate of the detected person,
• bbox.size y: height of the bounding box, used as a proxy for distance.

2.2 Detection Selection Logic

When multiple detections are present, the bounding box with the largest area is selected under the assumption that it corresponds to the closest and most relevant human target. Two scalar values are extracted:

• cx: lateral error signal,
• s: bounding box height, used as a distance proxy.

These two signals form the complete observation space used by the learned control policy.

3 Data Logging Node 3.1 Motivation

Imitation learning requires high-quality demonstrations. Each logged sample consists of percep- tion inputs and corresponding control commands:

inputs = (cx, s, history),

3.2 Implementation

A custom ROS2 node was implemented:

mobilenet_logger.py

outputs = (v, w)

The node:
• subscribes to /color/mobilenet detections and /gobilda/cmd vel,

• selects the largest bounding box,
• extracts detection features,
• logs synchronized data at 10 Hz into a CSV file.

The CSV format:

timestamp, cx, s, v, w, has_detection

3.3

Design Considerations

• Avoiding file I/O inside callbacks to prevent blocking, • timer-based logging for consistent sampling,
• frequent flushing for data integrity,
• explicit handling of frames without detections.

Dataset Collection and Storage

Teleoperated demonstrations were recorded to capture following, turning, stopping, and distance- regulation behaviors. The final dataset was stored as:

follow_data.csv

5 Data Preprocessing

5.1 Motivation

Raw image-space values (cx, s) are poorly suited for learning: • they depend on camera resolution,
• they have inconsistent scaling,
• they are noisy,

• they are not directly aligned with control semantics.

5.2 Error Normalization

Normalized control errors are computed as:
ex =cx−cxref,

es =sref −s ex,norm = ex , es,norm = es

where:
• cxref is the centered reference,
• image half width is the maximum observed deviation, • sref is the median bounding box height.

5.3 Temporal History

image half width

sref

To capture short-term dynamics, a three-step temporal window is used:

Final features:

e_x_norm_t0, e_s_norm_t0,
e_x_norm_t-1, e_s_norm_t-1,
e_x_norm_t-2, e_s_norm_t-2

5.4 Output Normalization

Teleoperation commands are discretized: vnorm ∈{0,1},

wnorm ∈{−1,0,+1} This reframes the problem as discrete action selection.

t0, t−1, t−2

5.5 Final Dataset

The processed dataset is stored as:

nn_dataset.csv

6 Neural Network Training
6.1 Objective: Discrete Policy Learning

Actions are defined as:

Action ID vnorm wnorm
0 0 -1
1 00 Stop

2 0 +1 Rotate right
3 1 0 Move forward

6.2 Model Architectures

Two multilayer perceptrons were evaluated using scikit-learn.

Model A

• Hidden layer: (32) • Parameters: ≈ 300

Model B

• Hidden layers: (64, 32) • Parameters: ≈ 3–5k

Both models use ReLU activations, Adam optimization, feature standardization, and an 80/20 stratified split.

6.3 Model Selection

The deeper network provided better separation between ambiguous states, particularly between forward motion and turning, and was selected as the final policy.

Meaning Rotate left

7 ROS2 Deployment

The trained model is loaded at runtime by a ROS2 execution node that: 1. processes live detections,
2. maintains temporal feature history,
3. predicts a discrete action,

4. maps actions to velocity commands,
5. publishes safe, saturated control signals.

A fallback spin behavior is triggered when detections are temporarily lost, ensuring robust operation.

8 Results

The learned policy was tested in a variety of indoor and outdoor environments. The robot demon- strated stable following behavior, smooth turning, and consistent distance regulation. Performance closely matched that of a hand-tuned proportional controller, confirming that the imitation-learning pipeline successfully captured the underlying control strategy.

9 Discussion and Future Work

Key challenges included dataset consistency and sensitivity to visual noise. Future improvements include richer sensory inputs, continuous action prediction, reinforcement learning fine-tuning, and hybrid control strategies.

10 Conclusion

This project resulted in a fully autonomous, vision-based human-following system built from the ground up. From real-time perception and data logging to neural-network training and deploy- ment on hardware, every component was designed, implemented, and integrated into a working robotic system. The close match between the learned policy and a deterministic baseline highlights the effectiveness of the pipeline and demonstrates the practical viability of imitation learning for mobile-robot control.

| lowinertia |

Build Your Engineering Portfolio