Skip to content

Processing datasets

Some of the datasets require special treatment or the download of extraction works only on Linux. These are described below in the Processsing requirements column. We also fixed labels in some datasets as described in the Dataset modifications column.

Dataset Publication Processing requirements Dataset modifications
AAUZebrafish publication link Kaggle required
AerialCattle2017 publication link
AmvrakikosTurtles publication link Kaggle required
ATRW publication link
BelugaID
BirdIndividualID publication link Manual download Few images removed
CatIndividualImages Kaggle required
Chicks4FreeID
CTai publication link Labels fixed
CZoo publication link
CowDataset
Cows2021 publication link Labels fixed
DogFaceNet publication link
Drosophila publication link Few images removed
ELPephants publication link Manual download
FriesianCattle2015 publication link Labels fixed
FriesianCattle2017 publication link
Giraffes publication link Only linux: downloading
GiraffeZebraID publication link
HappyWhale publication link Kaggle required + terms Species fixed
HumpbackWhaleID Kaggle required + terms Unknown animals renamed
HyenaID2022
IPanda50 publication link Few image renamed
LeopardID2022 Unknown animals renamed
LionData publication link
MacaqueFaces publication link
MPDD
NDD20 publication link
NOAARightWhale Kaggle required + terms
NyalaData publication link
OpenCows2020 publication link
PolarBearVidID publication link
ReunionTurtles publication link Kaggle required
SealID publication link Download with single use token
SeaStarReID2023 publication link
SeaTurtleID2022 publication link Kaggle required
SMALST publication link Only linux: extracting
SouthernProvinceTurtles publication link Kaggle required
StripeSpotter publication link Only linux: extracting
WhaleSharkID publication link
ZakynthosTurtles publication link Kaggle required
ZindiTurtleRecall

Manual download and extracting

Kaggle

Some datasets are stored on Kaggle. To use our automatic download method, follow the steps described in the Installation and Authentication sections.

AAUZebrafish

Kaggle requirements need to be satisfied.

AmvrakikosTurtles

Kaggle requirements need to be satisfied.

BirdIndividualID

The dataset is stored on Google drive but needs to be downloaded manually due to its size. After downloading it, place it into folder ``BirdIndividualID'' and run

datasets.BirdIndividualID.extract('data/BirdIndividualID')

to extract it. Do not extract it manually because there is some postprocessing involved.

CatIndividualImages

Kaggle requirements need to be satisfied.

ELPephants

The dataset is stored privately and needs to acknowledge license terms before downloading. After doing so, extract it manually or run

datasets.ELPephants.extract('data/ELPephants')

Giraffes

Downloading works only on Linux. Download it manually from the FTP server by using some client such as FileZilla.

HappyWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

HumpbackWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

IPanda50

IPanda50 may fail to download files because of Google Drive quotas. If this happens, download three zip files manually as described in this Github repository. Then either extract them manually or run

datasets.IPanda50.extract('data/IPanda50')

NOAARightWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

ReunionTurtles

Kaggle requirements need to be satisfied.

SealID

SealID requires a one-time token for download. Please go their download website, click the Data tab, then three dots next to the Download button and copy the URL link. Then use

url = '' # Paste the URL here
datasets.SealID.get_data('data/SealID', url=url)

SeaTurtleID2022

Kaggle requirements need to be satisfied.

SMALST

Extracting works only on Linux. Use

datasets.SMALST.download('data/SMALST')

to download the dataset and then extract it manually.

SouthernProvinceTurtles

Kaggle requirements need to be satisfied.

StripeSpotter

Extracting works only on Linux. Use

datasets.StripeSpotter.download('data/StripeSpotter')

to download the dataset and then extract it manually.

ZakynthosTurtles

Kaggle requirements need to be satisfied.

Dataset modifications

BirdIndividualID

We removed images containing multiple birds, where it not indicated which bird is which. We also removed a few cropped images due to different folder structure.

CTai

There were several misspelled labels, which we corrected. Mainly the correct forms read Akrouba (instead of Akouba), Fredy (Freddy), Ibrahim (Ibrahiim), Lilou (Liliou), Wapi (Wapii) and Woodstock (Woodstiock). Also the identity Adult was not a proper identity and we replaced it with self.unknown_name.

Cows2021

We ignored all images not in the Identification folder because they were focused on detection. We also ignored all images in the Identification/Train because some folders contained images of different individuals.

We merged identity 105 into 29 and 164 into 148 as they depicted the same individuals. We performed the following correction:

Image name Old identity New identity
image_0001226_2020-02-11_12-43-7_roi_001.jpg 137 107

Drosophila

We removed a few images due to different folder structure.

FriesianCattle2015

Images in the folder Cows-training were removed as they are identical to images in the folder Cows-testing.

Multiple individuals were removed as the images are copies of images of different individuals. Namely we removed 19 (duplicate of 15), 20 (18), 21 (17), 22 (16), 23 (11), 24 (14), 25 (13), 26 (12), 27 (11), 28 (13), 29 (12), 31 (17), 32 (16), 33 (18) and 37 (30).

HappyWhale

We fixed the typos in species bottlenose_dolpin and kiler_whale.

HumpbackWhaleID

We replaced the new_whale by self.unknown_name.

IPanda50

We renamed imaged containing non-ASCII characters.

LeopardID2022

We replaced the ____ by self.unknown_name.