The image classification models we have dealt with so far took an image and produced a categorical result, such as the class 'number' in a MNIST problem. However, in many cases we do not want just to know that a picture portrays objects - we want to be able to determine their precise location. This is exactly the point of object detection.
Image from YOLO v2 web site
Assuming we wanted to find a cat on a picture, a very naive approach to object detection would be the following:
Image from Exercise Notebook
However, this approach is far from ideal, because it only allows the algorithm to locate the object's bounding box very imprecisely. For more precise location, we need to run some sort of regression to predict the coordinates of bounding boxes - and for that, we need specific datasets.
This blog post has a great gentle introduction to detecting shapes.
You might run across the following datasets for this task:
While for image classification it is easy to measure how well the algorithm performs, for object detection we need to measure both the correctness of the class, as well as the precision of the inferred bounding box location. For the latter, we use the so-called Intersection over Union (IoU), which measures how well two boxes (or two arbitrary areas) overlap.
Figure 2 from this excellent blog post on IoU
The idea is simple - we divide the area of intersection between two figures by the area of their union. For two identical areas, IoU would be 1, while for completely disjointed areas it will be 0. Otherwise it will vary from 0 to 1. We typically only consider those bounding boxes for which IoU is over a certain value.
Suppose we want to measure how well a given class of objects $C$ is recognized. To measure it, we use Average Precision metrics, which is calculated as follows:
Image from NeuroWorkshop
The average Precision for a given class $C$ is the area under this curve. More precisely, Recall axis is typically divided into 10 parts, and Precision is averaged over all those points:
$$ AP = {1\over11}\sum_{i=0}^{10}\mbox{Precision}(\mbox{Recall}={i\over10}) $$
We shall consider only those detections, for which IoU is above a certain value. For example, in PASCAL VOC dataset typically $\mbox{IoU Threshold} = 0.5$ is assumed, while in COCO AP is measured for different values of $\mbox{IoU Threshold}$.
Image from NeuroWorkshop
The main metric for Object Detection is called Mean Average Precision, or mAP. It is the value of Average Precision, average across all object classes, and sometimes also over $\mbox{IoU Threshold}$. In more detail, the process of calculating mAP is described in this blog post), and also here with code samples.
There are two broad classes of object detection algorithms:
R-CNN uses Selective Search to generate hierarchical structure of ROI regions, which are then passed through CNN feature extractors and SVM-classifiers to determine the object class, and linear regression to determine bounding box coordinates. Official Paper
Image from van de Sande et al. ICCV’11
*Images from this blog
This approach is similar to R-CNN, but regions are defined after convolution layers have been applied.
Image from the Official Paper, arXiv, 2015
The main idea of this approach is to use neural network to predict ROIs - so-called Region Proposal Network. Paper, 2016
Image from the official paper
This algorithm is even faster than Faster R-CNN. The main idea is the following:
Image from official paper
YOLO is a realtime one-pass algorithm. The main idea is the following:
Image from official paper
Continue your learning in the following notebook:
In this lesson you took a whirlwind tour of all the various ways that object detection can be accomplished!
Read through these articles and notebooks about YOLO and try them for yourself
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。