Processing datasets

Some of the datasets require special treatment or the download of extraction works only on Linux. These are described below in the Processsing requirements column. We also fixed labels in some datasets as described in the Dataset modifications column.

Dataset	Publication	Processing requirements	Dataset modifications
AAUZebrafish		Kaggle required
AerialCattle2017
AmvrakikosTurtles		Kaggle required
ATRW
BelugaID
BirdIndividualID		Manual download	Few images removed
CatIndividualImages		Kaggle required
CzechLynx	TBD	Kaggle required
Chicks4FreeID
CTai			Labels fixed
CZoo
CowDataset
Cows2021			Labels fixed
DogFaceNet
Drosophila			Few images removed
ELPephants		Manual download
FriesianCattle2015			Labels fixed
FriesianCattle2017
Giraffes		Only linux: downloading
GiraffeZebraID
HappyWhale		Kaggle required + terms	Species fixed
HumpbackWhaleID		Kaggle required + terms	Unknown animals renamed
HyenaID2022
IPanda50			Few image renamed
LeopardID2022			Unknown animals renamed
LionData
MacaqueFaces
MPDD
MultiCamCows2024
NDD20
NOAARightWhale		Kaggle required + terms
NyalaData
OpenCows2020
PolarBearVidID
PrimFace
ReunionTurtles		Kaggle required
SealID		Download with single use token
SeaStarReID2023
SeaTurtleID2022		Kaggle required
SMALST		Only linux: extracting
SouthernProvinceTurtles		Kaggle required
StripeSpotter		Only linux: extracting
WhaleSharkID
ZakynthosTurtles		Kaggle required
ZindiTurtleRecall

Manual download and extracting

Kaggle

Some datasets are stored on Kaggle. To use our automatic download method, follow the steps described in the Installation and Authentication sections.

AAUZebrafish

Kaggle requirements need to be satisfied.

AmvrakikosTurtles

Kaggle requirements need to be satisfied.

BirdIndividualID

The dataset is stored on Google drive but needs to be downloaded manually due to its size. After downloading it, place it into folder ``BirdIndividualID'' and run

datasets.BirdIndividualID.extract('data/BirdIndividualID')

to extract it. Do not extract it manually because there is some postprocessing involved.

CatIndividualImages

Kaggle requirements need to be satisfied.

CzechLynx

Kaggle requirements need to be satisfied.

ELPephants

The dataset is stored privately and needs to acknowledge license terms before downloading. After doing so, extract it manually or run

datasets.ELPephants.extract('data/ELPephants')

Giraffes

Downloading works only on Linux. Download it manually from the FTP server by using some client such as FileZilla.

HappyWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

HumpbackWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

IPanda50

IPanda50 may fail to download files because of Google Drive quotas. If this happens, download three zip files manually as described in this Github repository. Then either extract them manually or run

datasets.IPanda50.extract('data/IPanda50')

NOAARightWhale

Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.

ReunionTurtles

Kaggle requirements need to be satisfied.

SealID

SealID requires a one-time token for download. Please go their download website, click the Data tab, then three dots next to the Download button and copy the URL link. Then use

url = '' # Paste the URL here
datasets.SealID.get_data('data/SealID', url=url)

SeaTurtleID2022

Kaggle requirements need to be satisfied.

SMALST

Extracting works only on Linux. Use

datasets.SMALST.download('data/SMALST')

to download the dataset and then extract it manually.

SouthernProvinceTurtles

Kaggle requirements need to be satisfied.

StripeSpotter

Extracting works only on Linux. Use

datasets.StripeSpotter.download('data/StripeSpotter')

to download the dataset and then extract it manually.

ZakynthosTurtles

Kaggle requirements need to be satisfied.

Dataset modifications

BirdIndividualID

We removed images containing multiple birds, where it not indicated which bird is which. We also removed a few cropped images due to different folder structure.

CTai

There were several misspelled labels, which we corrected. Mainly the correct forms read Akrouba (instead of Akouba), Fredy (Freddy), Ibrahim (Ibrahiim), Lilou (Liliou), Wapi (Wapii) and Woodstock (Woodstiock). Also the identity Adult was not a proper identity and we replaced it with self.unknown_name.

Cows2021

We ignored all images not in the Identification folder because they were focused on detection. We also ignored all images in the Identification/Train because some folders contained images of different individuals.

We merged identity 105 into 29 and 164 into 148 as they depicted the same individuals. We performed the following correction:

Image name	Old identity	New identity
image_0001226_2020-02-11_12-43-7_roi_001.jpg	137	107

Drosophila

We removed a few images due to different folder structure.

FriesianCattle2015

Images in the folder Cows-training were removed as they are identical to images in the folder Cows-testing.

Multiple individuals were removed as the images are copies of images of different individuals. Namely we removed 19 (duplicate of 15), 20 (18), 21 (17), 22 (16), 23 (11), 24 (14), 25 (13), 26 (12), 27 (11), 28 (13), 29 (12), 31 (17), 32 (16), 33 (18) and 37 (30).

HappyWhale

We fixed the typos in species bottlenose_dolpin and kiler_whale.

HumpbackWhaleID

We replaced the new_whale by self.unknown_name.

IPanda50

We renamed imaged containing non-ASCII characters.

LeopardID2022

We replaced the ____ by self.unknown_name.