Understanding the popular dataset format for computer vision
COCO (official website), which stands for "Common Objects in Context", is a set of challenging, high-quality datasets for computer vision, primarily next-generation neural networks. This name is also used to name a format used by these datasets.
Quoting the creators of COCO:
COCO is a large-scale object detection, segmentation, and labeling dataset. COCO has several characteristics:
- Segmentation of objects
- Recognition in context
- Segmentation of super pixel things
- 330k images (>200k tagged)
(Video) COCO Dataset Format - Complete Walkthrough- 1.5 million object instances
- 80 categories of objects
Advanced neural network libraries automatically understand the format of this dataset, for example grass. Facebook's Detectron2 (link). There are even tools created specifically for working with COCO format datasets, e.g. gram.COCO gunneryCOCOapi. Understanding how this dataset is represented will help you use and modify existing datasets and also help you create custom datasets. Specifically, we are interestednotesfiles, as the complete dataset consists of an image directory and an annotation file, which provides metadata used by machine learning algorithms.
There are actually multiple COCO datasets, each created for a specific machine learning task, with additional data. The 3 most popular tasks are:
- object detection— the template must have bounding boxes for the objects, i. my. returns the list of object classes and the coordinates of the rectangles that surround them; objects (also called “things”) are discrete, separate objects, often with parts, like humans and cars; the official dataset for this task also contains additional data for object segmentation (see below)
- object/instance targeting— the model should get not only bounding boxes for objects (instances/“things”), but also segmentation masks, i. my. polygon coordinates near the object
- segmentation of things— the model should segment objects, but not into separate objects ("things"), but into continuous background patterns like grass or sky
In computer vision, these tasks are used a lot, for example, grass. for autonomous vehicles (detection of people and other vehicles), AI-based security (human detection and/or segmentation), and object re-identification (object segmentation or background removal with object segmentation helps verify the identity of objects).
Basic structure and common elements
The file format used by COCO annotations is JSON, which has a dictionary (key-value pairs between curly braces,{…}
) as the top value. You can also have lists (ordered collections of items enclosed in parentheses,[…]
) or dictionaries nested inside.
The basic structure is as follows:
{
"Information": {…},
"licenses": […],
"images": [...],
"categories": [...],
"notes": […]
}
Let's take a look at each of these values.
"information" section
This dictionary contains metadata about the dataset. For the official COCO datasets, it is as follows:
{
"description": "COCO 2017 dataset",
"url": "http://cocodataset.org",
"version": "1.0",
"again": 2017,
"collaborator": "COCO Consortium",
"date_created": "2017/09/01"
}
As you can see, it only contains basic information, with"URL address"
value pointing to the dataset's official website (for example, the UCI repository page or on a separate domain). It's common for machine learning datasets to point to their websites for additional information, for example. gram. How and when the data were acquired.
"licenses" section
Here are links to licenses of images in the dataset, for example grass. Creative Commons licenses, with the following structure:
[
{
"url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
"identification": 1,
"name": "Attribution-NonCommercial-ShareAlike License"
},
{
"url": "http://creativecommons.org/licenses/by-nc/2.0/",
"identification": 2,
"name": "Attribution License - Non-Commercial"
},
…
]
The important thing to note here is the"I WENT"
field — each image in"images"
the dictionary must have your license 'id' specified.
When using images, make sure you don't infringe on their license; the full text can be found in the URL.
If you decide to create your own dataset, please assign the appropriate license to each image; if you are not sure, it is better not to use the image.
“images” section
Possibly second most important, this dictionary contains metadata about images:
{
"license": 3,
"file_name": "000000391895.jpg",
"coco_url": "http://images.cocodataset.org/train2017/000000391895.jpg",
"height": 360,
"ancho": 640,
"date_captured": "2013–11–14 11:18:45",
"flickr_url": "http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg",
"id": 391895
}
Let's go through the fields one by one:
"license"
: the image license ID of the"licenses"
section"File name"
: the name of the file in the images directory"coco_url"
,"flickr_url"
: URLs to the copy of the image hosted online"height"
,"ancho"
: image size, very useful in low-level languages like C where getting array size is impossible or difficult"data_captured"
: when the photo was taken
The most important field is"I WENT"
field. This is the number used in"notes"
to identify the image, so if you wanted e. gram. to identify the annotations for the given image file you would have to check the"I WENT"
to the appropriate image document in"images"
and then cross-reference it to"notes"
.
In the official COCO dataset, the"I WENT"
it is the same as"File name"
(after removing the leading zeros). Note that this may not necessarily be the case for custom COCO datasets. This is not an enforced rule, e.g. gram. datasets created from private photos may have the names of original photos that have nothing in common with"I WENT"
.
"categories" section
This section is a little different for the object detection and segmentation task and the thing segmentation task.
Object detection/Object targeting:
[
{"supercategory": "person", "id": 1, "name": "person"},
{"supercategory": "vehicle", "id": 2, "name": "bicycle"},
{"supercategory": "vehicle", "id": 3, "name": "car"},
…
{"supercategory": "interior", "id": 90, "name": "toothbrush"}
]
They are classes of objects that can be detected in images ("categories"
in COCO they are another name for classes, which you might know from supervised machine learning).
Each category has a unique"I WENT"
and must be in the range [1, number of categories]. Categories are also grouped into "supercategories" which you can use in your programs, for example grass. to detect vehicles in general, when it doesn't matter if it's a bicycle, car or truck.
Segmentation of things:
[
{"supercategory": "textile", "id": 92, "name": "banner"},
{"supercategory": "textile", "id": 93, "name": "cobertor"},
…
{"supercategory": "other", "id": 183, "name": "other"}
]
The number of categories starts high to avoid conflicts with object segmentation, as sometimes these tasks can be performed together (so-calledpanoptic segmentationtask, which also has a very challenging COCO dataset). IDs 92-182 are actual background elements, while ID 183 represents all other background textures that don't have separate classes.
"notes" section
This is the most important section of the dataset, containing vital information for each task for a specific COCO dataset.
{
"segmentation":
[[
239,97,
260,24,
222.04,
…
]],
"area": 2765.1486500000005,
"crowd": 0,
"image_id": 558840,
"box":
[
199,84,
200,46,
77.71,
70,88
],
"category_id": 58,
"identification": 156
}
"segmentation"
: a list of segmentation mask pixels; this is a flat list of pairs, so you should take the first and second values (x and y in the image), then the third and fourth, etc. get the coordinates; note that these are not image indices as they are float numbers - they are created and compressed by tools like the COCO annotator from raw pixel coordinates"area"
: number of pixels inside the segmentation mask"team"
: if the annotation is for a single object (value 0) or for several objects close to each other (value 1); for material segmentation, this field is always 0 and is ignored"image_id"
: the 'id' field of the 'images' dictionary;Notice:this value should be used to cross-reference the image with other dictionaries, not with the"I WENT"
¡campo!"bbox"
: bounding box, i. my. the coordinates (top left x, top left y, width, height) of the rectangle around the object; is very useful for extracting individual objects from images, as in many languages like Python this can be done by accessing the array of images likeobjeto_recortado = imagen[bbox[0]:bbox[0] + bbox[2], bbox[1]:bbox[1] + bbox[3]]
"ID from category"
: object class, corresponding to the"I WENT"
field in"categories"
"I WENT"
: unique identifier for the annotation;Notice:this is just an annotation ID, this does not point to the specific image in other dictionaries!
When working with crowd images ("crowd": 1
), he"segmentation"
part might be a little different:
"segmentation":
{
"accounts": [179,27,392,41,…,55,20],
"size": [426640]
}
This is because, for a large number of pixels, explicitly enumerating all the pixels to create a segmentation mask would take up too much space. Instead, COCO uses run-length encoding (RLE) custom compression, which is very efficient as the segmentation masks are binary and RLEs for just 0s and 1s can reduce the size several times.
We explored the COCO dataset format for the most popular tasks: object detection, object segmentation, and thing segmentation. The official COCO datasets are high quality, large and suitable for entry-level projects, production environments and high-end research. I hope this article helped you understand how to interpret and work with this format for your machine learning applications.