Processing datasets
Some of the datasets require special treatment or the download of extraction works only on Linux. These are described below in the Processsing requirements
column. We also fixed labels in some datasets as described in the Dataset modifications
column.
Dataset | Publication | Processing requirements | Dataset modifications |
---|---|---|---|
AAUZebrafish | Kaggle required | ||
AerialCattle2017 | |||
AmvrakikosTurtles | Kaggle required | ||
ATRW | |||
BelugaID | |||
BirdIndividualID | Manual download | Few images removed | |
CatIndividualImages | Kaggle required | ||
Chicks4FreeID | |||
CTai | Labels fixed | ||
CZoo | |||
CowDataset | |||
Cows2021 | Labels fixed | ||
DogFaceNet | |||
Drosophila | Few images removed | ||
ELPephants | Manual download | ||
FriesianCattle2015 | Labels fixed | ||
FriesianCattle2017 | |||
Giraffes | Only linux: downloading | ||
GiraffeZebraID | |||
HappyWhale | Kaggle required + terms | Species fixed | |
HumpbackWhaleID | Kaggle required + terms | Unknown animals renamed | |
HyenaID2022 | |||
IPanda50 | Few image renamed | ||
LeopardID2022 | Unknown animals renamed | ||
LionData | |||
MacaqueFaces | |||
MPDD | |||
NDD20 | |||
NOAARightWhale | Kaggle required + terms | ||
NyalaData | |||
OpenCows2020 | |||
PolarBearVidID | |||
ReunionTurtles | Kaggle required | ||
SealID | Download with single use token | ||
SeaStarReID2023 | |||
SeaTurtleID2022 | Kaggle required | ||
SMALST | Only linux: extracting | ||
SouthernProvinceTurtles | Kaggle required | ||
StripeSpotter | Only linux: extracting | ||
WhaleSharkID | |||
ZakynthosTurtles | Kaggle required | ||
ZindiTurtleRecall |
Manual download and extracting
Kaggle
Some datasets are stored on Kaggle. To use our automatic download method, follow the steps described in the Installation and Authentication sections.
AAUZebrafish
Kaggle requirements need to be satisfied.
AmvrakikosTurtles
Kaggle requirements need to be satisfied.
BirdIndividualID
The dataset is stored on Google drive but needs to be downloaded manually due to its size. After downloading it, place it into folder ``BirdIndividualID'' and run
datasets.BirdIndividualID.extract('data/BirdIndividualID')
to extract it. Do not extract it manually because there is some postprocessing involved.
CatIndividualImages
Kaggle requirements need to be satisfied.
ELPephants
The dataset is stored privately and needs to acknowledge license terms before downloading. After doing so, extract it manually or run
datasets.ELPephants.extract('data/ELPephants')
Giraffes
Downloading works only on Linux. Download it manually from the FTP server by using some client such as FileZilla.
HappyWhale
Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.
HumpbackWhale
Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.
IPanda50
IPanda50 may fail to download files because of Google Drive quotas. If this happens, download three zip files manually as described in this Github repository. Then either extract them manually or run
datasets.IPanda50.extract('data/IPanda50')
NOAARightWhale
Kaggle requirements need to be satisfied. Also you need to go to the competition website, the Data tab and agree with terms.
ReunionTurtles
Kaggle requirements need to be satisfied.
SealID
SealID requires a one-time token for download. Please go their download website, click the Data tab, then three dots next to the Download button and copy the URL
link. Then use
url = '' # Paste the URL here
datasets.SealID.get_data('data/SealID', url=url)
SeaTurtleID2022
Kaggle requirements need to be satisfied.
SMALST
Extracting works only on Linux. Use
datasets.SMALST.download('data/SMALST')
to download the dataset and then extract it manually.
SouthernProvinceTurtles
Kaggle requirements need to be satisfied.
StripeSpotter
Extracting works only on Linux. Use
datasets.StripeSpotter.download('data/StripeSpotter')
to download the dataset and then extract it manually.
ZakynthosTurtles
Kaggle requirements need to be satisfied.
Dataset modifications
BirdIndividualID
We removed images containing multiple birds, where it not indicated which bird is which. We also removed a few cropped images due to different folder structure.
CTai
There were several misspelled labels, which we corrected. Mainly the correct forms read Akrouba (instead of Akouba), Fredy (Freddy), Ibrahim (Ibrahiim), Lilou (Liliou), Wapi (Wapii) and Woodstock (Woodstiock). Also the identity Adult was not a proper identity and we replaced it with self.unknown_name
.
Cows2021
We ignored all images not in the Identification
folder because they were focused on detection. We also ignored all images in the Identification/Train
because some folders contained images of different individuals.
We merged identity 105 into 29 and 164 into 148 as they depicted the same individuals. We performed the following correction:
Image name | Old identity | New identity |
---|---|---|
image_0001226_2020-02-11_12-43-7_roi_001.jpg | 137 | 107 |
Drosophila
We removed a few images due to different folder structure.
FriesianCattle2015
Images in the folder Cows-training
were removed as they are identical to images in the folder Cows-testing
.
Multiple individuals were removed as the images are copies of images of different individuals. Namely we removed 19 (duplicate of 15), 20 (18), 21 (17), 22 (16), 23 (11), 24 (14), 25 (13), 26 (12), 27 (11), 28 (13), 29 (12), 31 (17), 32 (16), 33 (18) and 37 (30).
HappyWhale
We fixed the typos in species bottlenose_dolpin and kiler_whale.
HumpbackWhaleID
We replaced the new_whale
by self.unknown_name
.
IPanda50
We renamed imaged containing non-ASCII characters.
LeopardID2022
We replaced the ____
by self.unknown_name
.