How to work with datasets
The library represents wildlife re-identification datasets and manages any operations on them such as downloads, conversions to dataframes, splitting to training and testing sets, and printing dataset summary or its metadata. We import first the required modules
from wildlife_datasets import analysis, datasets, loader
Downloading datasets
The majority of datasets used in this library can be downloaded fully automatically.
datasets.MacaqueFaces.get_data('data/MacaqueFaces')
Some of the datasets require special handling as described in a special page.
Working with one dataset
When a dataset is already downloaded, it can be loaded
d = datasets.MacaqueFaces('data/MacaqueFaces')
Since this a subclass of the DatasetFactory
parent class, it inherits all the methods and attributes listed in its documentation. Its main component is the pandas dataframe of all samples
d.df
image_id ... date
0 0 ... 2014-07-03
1 1 ... 2014-07-03
2 2 ... 2014-08-06
3 3 ... 2014-08-06
4 4 ... 2014-06-12
... ... ... ...
6275 6275 ... 2014-02-19
6276 6276 ... 2014-02-19
6277 6277 ... 2014-03-21
6278 6278 ... 2014-02-19
6279 6279 ... 2014-03-21
[6280 rows x 4 columns]
The dataset can be graphically visualized by the grid plot
d.plot_grid()
or its basic numerical statistics can be printed
analysis.display_statistics(d.df)
Number of identitites 34
Number of all animals 6280
Number of animals with one image 0
Number of unidentified animals 0
Images span 1.4 years
or metadata displayed
d.summary
{
'licenses': 'Other'
'licenses_url': 'https://github.com/clwitham/MacaqueFaces/blob/master/license.md'
'url': 'https://github.com/clwitham/MacaqueFaces'
'publication_url': 'https://www.sciencedirect.com/science/article/pii/S0165027017302637'
'cite': 'witham2018automated'
'animals': {'rhesus macaque'}
'animals_simple': 'macaques'
'real_animals': True
'year': 2018
'reported_n_total': 6460.0
'reported_n_identified': 6460.0
'reported_n_photos': 6460.0
'reported_n_individuals': 34.0
'wild': False
'clear_photos': True
'pose': 'single'
'unique_pattern': False
'from_video': True
'cropped': True
'span': '1.4 years'
'size': 12.0
}
Working with multiple datasets
Since the above-mentioned way of creating the datasets always recreates the dataframe, it will be slow for larger datasets. For this reason, we provide an alternative way
d = loader.load_dataset(datasets.MacaqueFaces, 'data', 'dataframes')
This function first checks whether dataframes/MacaqueFaces.pkl
exists. If so, it will load the dataframe stored there, otherwise, it will create this file. Therefore, the first call of this function may be slow but the following calls are fast.
All exported datasets can be listed by
datasets.names_all
[
wildlife_datasets.datasets.datasets.AAUZebraFish
wildlife_datasets.datasets.datasets.AerialCattle2017
wildlife_datasets.datasets.datasets.AmvrakikosTurtles
wildlife_datasets.datasets.datasets.ATRW
wildlife_datasets.datasets.datasets.BelugaIDv2
wildlife_datasets.datasets.datasets.BirdIndividualID
wildlife_datasets.datasets.datasets.BirdIndividualIDSegmented
wildlife_datasets.datasets.datasets.CatIndividualImages
wildlife_datasets.datasets.datasets.CTai
wildlife_datasets.datasets.datasets.CZoo
wildlife_datasets.datasets.datasets.Chicks4FreeID
wildlife_datasets.datasets.datasets.CowDataset
wildlife_datasets.datasets.datasets.Cows2021v2
wildlife_datasets.datasets.datasets.DogFaceNet
wildlife_datasets.datasets.datasets.Drosophila
wildlife_datasets.datasets.datasets.ELPephants
wildlife_datasets.datasets.datasets.FriesianCattle2015v2
wildlife_datasets.datasets.datasets.FriesianCattle2017
wildlife_datasets.datasets.datasets.GiraffeZebraID
wildlife_datasets.datasets.datasets.Giraffes
wildlife_datasets.datasets.datasets.HappyWhale
wildlife_datasets.datasets.datasets.HumpbackWhaleID
wildlife_datasets.datasets.datasets.HyenaID2022
wildlife_datasets.datasets.datasets.IPanda50
wildlife_datasets.datasets.datasets.LeopardID2022
wildlife_datasets.datasets.datasets.LionData
wildlife_datasets.datasets.datasets.MacaqueFaces
wildlife_datasets.datasets.datasets.MPDD
wildlife_datasets.datasets.datasets.NDD20v2
wildlife_datasets.datasets.datasets.NOAARightWhale
wildlife_datasets.datasets.datasets.NyalaData
wildlife_datasets.datasets.datasets.OpenCows2020
wildlife_datasets.datasets.datasets.PolarBearVidID
wildlife_datasets.datasets.datasets.ReunionTurtles
wildlife_datasets.datasets.datasets.SealID
wildlife_datasets.datasets.datasets.SealIDSegmented
wildlife_datasets.datasets.datasets.SeaStarReID2023
wildlife_datasets.datasets.datasets.SeaTurtleID2022
wildlife_datasets.datasets.datasets.SeaTurtleIDHeads
wildlife_datasets.datasets.datasets.SMALST
wildlife_datasets.datasets.datasets.SouthernProvinceTurtles
wildlife_datasets.datasets.datasets.StripeSpotter
wildlife_datasets.datasets.datasets.WhaleSharkID
wildlife_datasets.datasets.datasets.ZakynthosTurtles
wildlife_datasets.datasets.datasets.ZindiTurtleRecall
]
Warning
The following code needs to have all datasets downloaded. If you have downloaded only some of them, select the appropriate subset of datasets.names_all
.
To work with all provided datasets, we can easily put the load_dataset
call into a loop
ds = []
for dataset_name in datasets.names_all:
d = loader.load_dataset(dataset_name, 'data', 'dataframes')
ds.append(d)
or equivalently by
ds = loader.load_datasets(datasets.names_all, 'data', 'dataframes')