Agnostic Crowd Counting

16 Aug 2019 - Vincent

During my penultimate year of graduation at ESIEE Paris to become an engineer, I have been interned in the Audio-visual Information Processing Lab (VIP Lab) for the CAPES/COFECUB Research Program entitled Hierarchical Graph-based Analysis of Image, Video and Multimedia Data.

My research consists in studying this paper from the Visual Geometry Group of the University of Oxford :

Lu, Erika et al. “Class-agnostic counting.” Asian Conference on Computer Vision. Springer, Cham, 2018.

Link to the paper

Link to the project web page

Link to the Github repo

I have tested this new approach on other crowd datasets (UCF-CC-50 and UCF-QNRF). The Neural Network is build with Keras on a TensorFlow backend.

Motivation
Approaches
Datasets
Generating dot annotated images
The network

Motivation

Crowd counting and crowd analysis has significant importance from safety perspective.

counting people

Approaches

To estimate the number of people in an image there are two main approaches : detection-based method and regression-based method.

Detection-based counting

A visual object detector slides along the image to detect object instances in an image like human faces.

detection-based counting

Regression-based counting

The network produces a scalar (number of objects) as output which is then compared to ground truth.

regression-based counting

Datasets

I used these two datasets :

datasets

Generating dot annotated images

The program takes dot-annotated files as inputs.

The UCF-CC-50 dataset comes with an image and its .mat associated file. This .mat file contains the location of every person’s head. In order to generate these dot-annotated files I wrote a simple program that generates an image of points taking the position (x, y) of the heads.

dot-annotated-img

dot-img

Issues with the dataset

Generating the dot annotated files in order to produce the ground truth can be a tricky task. As you can see on the image below, some of the locations are wrong. The .mat file associated to the image should locate every person’s head. Some locations are approximative (shirt instead of head), others are does not locate the person and even worse some points are out of the image itself !

issues-dots

The basics

During the training part : The original image I is mapped into an dot annotated image : the ground truth. At this step we do not use the neural network, only image processing. The same image I feeds the neural network to make a prediction that approximates the ground truth. The difference or delta between the ground truth and the prediction represents the prediction error. The weights of the DNN (deep neural network) are updated during this step.

During the testing part : an another image is feed into the DNN (using the previous weights) to produce a prediction. We compare the ground truth of this image with the prediction to compute the loss.

schematic

The network

The network consists of three modules :

embedding
matching
adapting

network

In the embedding module, the exemplar image patch and the full-resolution image are encoded into a feature vector and a dense feature map, respectively.

In the matching module, we learn a discriminative classifier to densely match the exemplar patch to instances in the image.

In the adapting module, a fraction of the trained parameters (3% of the network size) are trained to specialize the model.

Vincent Barbosa Vaz

MSc Student Majoring in Data Science and AI