iCubWorld
Welcome to iCubWorld
How many objects can iCub recognize? This question led us in 2013 to start the iCubWorld project, with the main goal of benchmarking the development of the visual recognition capabilities of the iCub robot. The project is the result of a collaboration between the Istituto Italiano di Tecnologia (IIT) - iCub Facility, the University of Genoa - DIBRIS - SlipGURU research group, and the Laboratory for Computational and Statistical Learning.
iCubWorld datasets are collections of images recording the visual experience of iCub while observing objects in its typical environment, a laboratory or an office. The acquisition setting is devised to allow a natural human-robot interaction, where a teacher verbally provides the label of the object of interest and shows it to the robot, by holding it in the hand; the iCub can either track the object while the teacher moves it, or take it in its hand.
Since 2013, we published four iCubWorld releases of increasing size, aimed at investigating complementary aspects of robotic visual recognition. These image collections allow for extensive analysis of the behaviour of recognition systems when trained in different conditions, offering a faithful and reproducible benchmark for the performance that we can expect from the real system.
Acquisition Setup
Images in iCubWorld datasets are annotated with the label of the object represented and a bounding box around it. We developed a Human-Robot-Interaction application to acquire annotated images by exploiting the real-world context and the interaction with the robot. This setup allows to build large annotated datasets in a fast and natural way.
The only external supervision is in the form of a human teacher providing verbally the label of the object that is going to be acquired. The teacher approaches the robot and shows the object in his/her hand; during the acquisition, localization of the object in the visual field of the robot is performed by exploting self-supervision techniques.
Two acquisition modalities are possible: human or robot mode.
HUMAN MODE
The teacher moves the object holding it in the hand and the robot tracks it by exploiting either motion or depth cues.
ROBOT MODE
The robot takes the object in the hand and focuses on it by using knowledge of its own kinematics.
We record incoming frames from the robot cameras, togheter with the information on the bounding box.
Datasets
iCubWorld is an ongoing project. Latest release is iCubWorld Transformations, described in detail in the following.
Follow this link
for previous releases.
iCubWorld Transformations
Previous Releases
Code
This section is under construction.We are working to provide as soon as possible documentation and support for the following code:
- iCub application to acquire iCubWorld releases
- MATLAB code to automatically format the acquired images in a directory tree (similar to the one we released for iCubWorld Transformations)
- MATLAB functions providing utilities to train Caffe deep networks on arbitrary subsets of the acquired dataset, by setting model and back-propagation hyperparameters (e.g. layer learning rates, solver type) programmatically through configuration files
Publications
A list of publications related to iCubWorld.People
The list of people who have worked on the iCubWorld project.Credits
The list of people who have supported and contributed to the iCubWorld project.iCubWorld Transformations
No. categories | 20 |
No. objects/category | 10 |
Acquisitions per object | 4 isolated transf. + mixed sequence, 2 different days |
No. frames/acquisition | 150-200 |
Tracking cue | Depth |
No. acquired cameras | 2 |
ACQUISITION DETAILS
Each object was acquired in Human Mode (see Acquisition Section) while undergoing 4 isolated viewpoint transformations:
2D ROTATION
The human rotated the object in front of the robot, parallel to the camera plane, keeping it at the same distance and position.
3D ROTATION
Like 2D ROTATION, but with the human applying out-of-plane rotations to the object and changing the face visible to the robot.
SCALING
The human moved the hand holding the object back and forth, thus changing the object's scale with respect to the cameras.
BACKGROUND CHANGE
The human moved in a semi-circle around the iCub, keeping approximately the same distance and pose of the object in the hand with respect to the cameras. The background changes dramatically, while the object appearance remains the same.
A sequence called MIX was also aquired with the human moving the object naturally in front of the robot, as a person would do when showing a new item to a child. In this sequence all nuisances in all combinations can appear.
Two sessions were performed in different days, leading to two collection of images differing for little changes in the environment and light condition.
Each acquisition lasted for 20 seconds (40 seconds for MIX), recording images of size 640x480 from the left and right iCub's cameras at approximately 10 fps.
For each image, the object's centroid and enclosing bounding box, provided by a depth segmentation routine (see Acquisition Section and related paper for details on this procedure), were recorded and made available with the images. See below for details on the format of the annotations.
Together with full images, we release a dataset version where the images have been pre-cropped around the object's centroid to a size of 256x256.
DOWNLOAD - IMAGES AND AUTOMATIC ANNOTATIONS
The 20 categories in the dataset are 'body lotion', 'book', 'cellphone', 'flower', 'glass', 'hairbrush', 'hair clip', 'mouse', 'mug', 'oven glove', 'pencil case', 'perfume', 'remote', 'ring binder', 'soap dispenser', 'soda bottle', 'sprayer', 'squeezer', 'sunglasses', 'wallet'.
To facilitate download, we splitted and released the images in 4 parts:
- PART1: book, cellphone, mouse, pencilcase, ringbinder.
- PART2: hairbrush, hairclip, perfume, sunglasses, wallet.
- PART3: flower, glass, mug, remote, soapdispenser.
- PART4: bodylotion, ovenglove, sodabottle, sprayer, squeezer.
DOWNLOAD - IMAGES AND MANUAL ANNOTATIONS
We also make available manual annotations for a subset of around 6K images in the dataset, in order to provide a human annotated test set to validate detection methods. While annotating, we adopted the policy such that an object must be annotated if at least around 25-50% of its total shape is visible (i.e., not cut out from the image or occluded).
We randomly chose one object instance from each category in the dataset and annotated 150 frames from the MIX sequence acquired from one iCub's camera (left) in one (the first) acquistion day available for that object.
The 20 selected objects are: 'ringbinder4', 'ringbinder6', 'ringbinder5', 'ringbinder7', 'flower7', 'flower5', 'flower2', 'flower9', 'perfume1', 'hairclip2', 'hairclip6', 'hairclip8', 'hairclip9', 'hairbrush3', 'sunglasses7', 'sodabottle2', 'sodabottle3', 'sodabottle4', 'sodabottle5', 'soapdispenser5', 'ovenglove7', 'remote7', 'mug1', 'mug3', 'mug4' , 'mug9', 'glass8', 'bodylotion8', 'bodylotion5', 'bodylotion2', 'bodylotion4', 'book6', 'book4', 'book9', 'book1', 'cellphone1', 'mouse9', 'pencilcase5', 'pencilcase3', 'pencilcase1', 'pencilcase6', 'wallet6', 'wallet7', 'wallet10', 'wallet2', 'sprayer6', 'sprayer8', 'sprayer9', 'sprayer2', 'squeezer5'.
DOWNLOAD - ADDITIONAL SEQUENCES AND MANUAL ANNOTATIONS
As a further test set, we make available 4 image sequences depicting subsets of the objects in the dataset randomly positioned on a table (TABLE), on the floor (FLOOR1 and FLOOR2) and on a shelf (SHELF).
The sequences overall comprise a set of around 300 images, which we manually annotated following the same policy as for the 3K images part of the dataset (see previous section).
The sequences come with various challenges: some of them (TABLE, SHELF) may contain objects not part of the dataset (like a laptop or a monitor), hence not to be detected, and every sequence depicts objects on a different background and in different light conditions.
The objects contained in each sequence are listed below:
- TABLE: perfume1, sodabottle2, ovenglove7, mug1, flower7, sprayer6, ringbinder4, remote7, squeezer5, pencilcase5, soapdispenser5.
- FLOOR1: cellphone1, mouse9, ringbinder4, flower7, sprayer6, mug1, remote7, hairbrush3, hairclip2, wallet6, perfume1, soapdispenser5, sodabottle2, squeezer5.
- FLOOR2: cellphone1, mouse9, ringbinder4, flower7, sprayer6, mug1, hairbrush3, wallet6, perfume1, soapdispenser5, sodabottle2, squeezer5.
- SHELF: ovenglove7, mug1, squeezer5, ringbinder4, pencilcase5, flower7, sprayer6, soapdispenser5, sodabottle2, perfume1.
NEW RELEASE!! DOWNLOAD - TABLE-TOP SEQUENCES AND MANUAL ANNOTATIONS
The data acquired is split in 2 sets of sequences. In each set we considered a different table cloth: (i) pink/white pois (POIS) and (ii) white (WHITE). For each set we split the 21 objects in 5 groups, and we acquire 2 sequences for each group for the WHITE set, and 1 sequence for each group for the POIS set, gathering a total of 2K images for the WHITE set and 1K images for the POIS set.
For each sequence, the robot is placed in front of the objects and executes a set of pre-scripted exploratory movements to acquire images depicting the objects from different perspectives, scales, and viewpoints. We used a table top segmentation procedure to gather the ground truth of the object locations and labels, and we manually refined them following the same policy as for the 3K images part of the dataset (see previous section).
This dataset has been used for a part of the experimental evaluation in a paper presented at Humanoids 2019 .
Humanoids 2017 subset
A first subset of the manually annotated image set was released and used in a paper presented at Humanoids 2017 . In particular: We randomly chose one object instance from each category in the dataset and annotated 150 frames from the MIX sequence acquired from one iCub's camera (left) in one (the first) acquistion day available for that object.
IROS 2016 subset
A first subset of this dataset was released and used in a paper presented at IROS 2016 . In particular: 15 categories and in-plane transformations (2D ROTATION, SCALING, BACKGROUND CHANGE), with images cropped to 256x256. For convenience, we make this subset available for download below.
iCubWorld28
No. categories | 7 |
No. objects/category | 4 |
Acquisitions per object | 4 different days, train & test |
No. frames/acquisition | 150 |
Tracking cue | Independent Motion |
No. acquired cameras | 1 |
Each object was acquired in human mode (see Acquisition Section), repeating the procedure twice to get separate train and test sequences, and 4 acquisition sessions were performed in 4 different days. Images acquired in different days differ for little changes in the environment and in the light conditions.
For each acquisition, around 150 images of size 320x240 were recorded from one iCub camera during 15-20 seconds. A square bounding box of size 128x128 was then cropped around the object. The images and their crops are available.We also provide a manually segmented version of the dataset acquired on Day 4. The segmentation in this case was performed by clicking on the image and dragging a rectangular shape to enclose the object represented. We used this version to measure the recognition performance gap due to the automatic segmentation procedure based on motion with respect to an 'ideal' segmentation.
iCubWorld 1.0
No. categories | 10 |
No. objects/category | 4 |
Acquisitions per object | human mode (train) 4 test sets avaialable |
No. frames/acquisition | 200 |
Tracking cue | Independent Motion |
No. acquired cameras | 1 |
For training, we collected 3 objects per category. Each object has been acquired in human mode (see Acquisition Section), recording 200 images of size 640x480 from one iCub camera, subsequently cropped to a bounding box of size 160x160 around the object.
We provide the following 4 test sets:
DEMONSTRATOR
1 known instance per category acquired in human mode with a different demonstrator with respect to the training
CATEGORIZATION
1 new instance per category acquired in the same setting as for the training
ROBOT
1 known or new instance per category acquired in robot mode (the bounding box in this case is 320x320)
BACKGROUND
we select 10 images per category where the classifiers perform 99% of accuracy and provide the segmentation mask of the objects to test whether the classifier recognizes the object or the background
Hello iCubWorld
No. categories | 7 |
No. objects/category | 1 |
Acquisitions per object | human & robot mode, train & test |
No. frames/acquisition | 500 |
Tracking cue | Independent Motion |
No. acquired cameras | 1 |
Each object was acquired in human and robot mode (see Acquisition Section), repeating the procedure twice to get separate train and test sequences. For each acquisition, 500 images of size 320x240 were recorded from one iCub camera. A square bounding box of size 80x80 (human mode) and 160x160 (robot mode) was then cropped around the object and saved.