Skip to content

How to work with datasets

The library represents wildlife re-identification datasets and manages any operations on them such as downloads, conversions to dataframes, splitting to training and testing sets, and printing dataset summary or its metadata. We import first the required modules

from wildlife_datasets import analysis, datasets, loader

Downloading datasets

The majority of datasets used in this library can be downloaded fully automatically.

datasets.MacaqueFaces.get_data('data/MacaqueFaces')

Some of the datasets require special handling as described in a special page.

Working with one dataset

When a dataset is already downloaded, it can be loaded

d = datasets.MacaqueFaces('data/MacaqueFaces')

Since this a subclass of the DatasetFactory parent class, it inherits all the methods and attributes listed in its documentation. Its main component is the pandas dataframe of all samples

d.df
      image_id  ...        date
0            0  ...  2014-07-03
1            1  ...  2014-07-03
2            2  ...  2014-08-06
3            3  ...  2014-08-06
4            4  ...  2014-06-12
...        ...  ...         ...
6275      6275  ...  2014-02-19
6276      6276  ...  2014-02-19
6277      6277  ...  2014-03-21
6278      6278  ...  2014-02-19
6279      6279  ...  2014-03-21

[6280 rows x 4 columns]

The dataset can be graphically visualized by the grid plot

d.plot_grid()

or its basic numerical statistics can be printed

analysis.display_statistics(d.df)
Number of identitites            34
Number of all animals            6280
Number of animals with one image 0
Number of unidentified animals   0
Images span                      1.4 years

or metadata displayed

d.summary
{
 'licenses': 'Other'
 'licenses_url': 'https://github.com/clwitham/MacaqueFaces/blob/master/license.md'
 'url': 'https://github.com/clwitham/MacaqueFaces'
 'publication_url': 'https://www.sciencedirect.com/science/article/pii/S0165027017302637'
 'cite': 'witham2018automated'
 'animals': {'rhesus macaque'}
 'animals_simple': 'macaques'
 'real_animals': True
 'year': 2018
 'reported_n_total': 6460.0
 'reported_n_identified': 6460.0
 'reported_n_photos': 6460.0
 'reported_n_individuals': 34.0
 'wild': False
 'clear_photos': True
 'pose': 'single'
 'unique_pattern': False
 'from_video': True
 'cropped': True
 'span': '1.4 years'
 'size': 12.0
}

Working with multiple datasets

Since the above-mentioned way of creating the datasets always recreates the dataframe, it will be slow for larger datasets. For this reason, we provide an alternative way

d = loader.load_dataset(datasets.MacaqueFaces, 'data', 'dataframes')

This function first checks whether dataframes/MacaqueFaces.pkl exists. If so, it will load the dataframe stored there, otherwise, it will create this file. Therefore, the first call of this function may be slow but the following calls are fast.

All exported datasets can be listed by

datasets.names_all
[
 wildlife_datasets.datasets.datasets.AAUZebraFish
 wildlife_datasets.datasets.datasets.AerialCattle2017
 wildlife_datasets.datasets.datasets.AmvrakikosTurtles
 wildlife_datasets.datasets.datasets.ATRW
 wildlife_datasets.datasets.datasets.BelugaIDv2
 wildlife_datasets.datasets.datasets.BirdIndividualID
 wildlife_datasets.datasets.datasets.BirdIndividualIDSegmented
 wildlife_datasets.datasets.datasets.CatIndividualImages
 wildlife_datasets.datasets.datasets.CTai
 wildlife_datasets.datasets.datasets.CZoo
 wildlife_datasets.datasets.datasets.Chicks4FreeID
 wildlife_datasets.datasets.datasets.CowDataset
 wildlife_datasets.datasets.datasets.Cows2021v2
 wildlife_datasets.datasets.datasets.DogFaceNet
 wildlife_datasets.datasets.datasets.Drosophila
 wildlife_datasets.datasets.datasets.ELPephants
 wildlife_datasets.datasets.datasets.FriesianCattle2015v2
 wildlife_datasets.datasets.datasets.FriesianCattle2017
 wildlife_datasets.datasets.datasets.GiraffeZebraID
 wildlife_datasets.datasets.datasets.Giraffes
 wildlife_datasets.datasets.datasets.HappyWhale
 wildlife_datasets.datasets.datasets.HumpbackWhaleID
 wildlife_datasets.datasets.datasets.HyenaID2022
 wildlife_datasets.datasets.datasets.IPanda50
 wildlife_datasets.datasets.datasets.LeopardID2022
 wildlife_datasets.datasets.datasets.LionData
 wildlife_datasets.datasets.datasets.MacaqueFaces
 wildlife_datasets.datasets.datasets.MPDD
 wildlife_datasets.datasets.datasets.NDD20v2
 wildlife_datasets.datasets.datasets.NOAARightWhale
 wildlife_datasets.datasets.datasets.NyalaData
 wildlife_datasets.datasets.datasets.OpenCows2020
 wildlife_datasets.datasets.datasets.PolarBearVidID
 wildlife_datasets.datasets.datasets.ReunionTurtles
 wildlife_datasets.datasets.datasets.SealID
 wildlife_datasets.datasets.datasets.SealIDSegmented
 wildlife_datasets.datasets.datasets.SeaStarReID2023
 wildlife_datasets.datasets.datasets.SeaTurtleID2022
 wildlife_datasets.datasets.datasets.SeaTurtleIDHeads
 wildlife_datasets.datasets.datasets.SMALST
 wildlife_datasets.datasets.datasets.SouthernProvinceTurtles
 wildlife_datasets.datasets.datasets.StripeSpotter
 wildlife_datasets.datasets.datasets.WhaleSharkID
 wildlife_datasets.datasets.datasets.ZakynthosTurtles
 wildlife_datasets.datasets.datasets.ZindiTurtleRecall
]

Warning

The following code needs to have all datasets downloaded. If you have downloaded only some of them, select the appropriate subset of datasets.names_all.

To work with all provided datasets, we can easily put the load_dataset call into a loop

ds = []
for dataset_name in datasets.names_all:
    d = loader.load_dataset(dataset_name, 'data', 'dataframes')
    ds.append(d)

or equivalently by

ds = loader.load_datasets(datasets.names_all, 'data', 'dataframes')