Skip to content

How to add new datasets

Adding new datasets to WildlifeDatasets is easy. It is sufficient to create a subclass of WildlifeDataset with the create_catalogue method. A simple example is

import pandas as pd
from wildlife_datasets.datasets import WildlifeDataset

class Test(WildlifeDataset):
    def create_catalogue(self) -> pd.DataFrame:
        df = pd.DataFrame({
            'image_id': [1, 2, 3, 4],
            'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
            'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
        })
        return df

The class is then created by Test('.'). The empty argument should point to the location where the images are stored. The dataframe can then be accessed by

Test('.').df
   image_id identity      path
0         1    Lukas  img1.jpg
1         2    Vojta  img2.jpg
2         3    Lukas  img3.jpg
3         4    Vojta  img4.jpg

The dataframe df must satisfy some requirements.

Info

Instead of returning df it is better to return self.finalize_catalogue(df). This function will perform multiple checks to verify the created dataframe. However, in this case, this check would fail because the specified file paths do not exist.

To incorporate the new dataset into the list of all available datasets, the init script must be appropriately modified.

Optional: including metadata

The metadata can be added by adding a dictionary as a class attribute. Its description is in a separate file.

import pandas as pd
from wildlife_datasets.datasets import WildlifeDataset

summary = {
    'reported_n_total': 4,
    'reported_n_individuals': 2,
}

class Test(WildlifeDataset):
    summary = summary

    def create_catalogue(self) -> pd.DataFrame:
        df = pd.DataFrame({
            'image_id': [1, 2, 3, 4],
            'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
            'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
        })
        return df

The metadata can be accessed by

Test('.').summary
{'reported_n_total': 4, 'reported_n_individuals': 2}

Optional: including download

Adding the possibility to download is achieved by adding class methods _download and _extract. The simplest way is to use the predefined classes DownloadKaggle, DownloadURL and DownloadHuggingFace. In the multiple inheritence, WildlifeDatasets must always be inherited as last. Examples can be taken from existing datasets.

Kaggle supports free storage for competitions and datasets with fast download. The web address have two possible styles:

https://www.kaggle.com/competitions/animal-clef-2025
https://www.kaggle.com/datasets/wildlifedatasets/seaturtleid2022

It is sufficient to provided kaggle_url and kaggle_type as in the two examples below. The loaded class DownloadKaggle will add all the required functions for downloads and it is sufficient to write the create_catalogue function mentioned above.

from wildlife_datasets.datasets import DownloadKaggle, WildlifeDataset

class AnimalCLEF2025(DownloadKaggle, WildlifeDataset):    
    kaggle_url = 'animal-clef-2025'
    kaggle_type = 'competitions'

class SeaTurtleID2022(DownloadKaggle, WildlifeDataset):
    kaggle_url = 'wildlifedatasets/seaturtleid2022'
    kaggle_type = 'datasets'

Downloads from URL

Datasets may be stored at a private server. When there is a single file to download and extract, it is stored in the url and archive attributes. The latter is usually the last part of the former. Whenever multiple files need to be downloaded, they are saved in the downloads attribute as the examples below show.

from wildlife_datasets.datasets import DownloadURL, WildlifeDataset

class CTai(DownloadURL, WildlifeDataset):
    url = 'https://github.com/cvjena/chimpanzee_faces/archive/refs/heads/master.zip'
    archive = 'master.zip'

class MacaqueFaces(DownloadURL, WildlifeDataset):
    downloads = [
        ('https://github.com/clwitham/MacaqueFaces/raw/master/ModelSet/MacaqueFaces.zip', 'MacaqueFaces.zip'),
        ('https://github.com/clwitham/MacaqueFaces/raw/master/ModelSet/MacaqueFaces_ImageInfo.csv', 'MacaqueFaces_ImageInfo.csv'),
    ]

When the dataset is saved to HuggingFace, it is sufficient to provide the hf_url attribute. However, due to different way of storing the images for these datasets, it may be necessary to overwrite the method get_image as showed, for example, in this file.

from wildlife_datasets.datasets import DownloadHuggingFace, WildlifeDataset

class Chicks4FreeID(DownloadHuggingFace, WildlifeDataset):
    hf_url = 'dariakern/Chicks4FreeID'

Optional: integrating into package

New datasets may be integrated into the core package by pull requests on the Github repo. In such a case, the dataset should be freely downloadable and both download script and metadata should be provided. The added dataset should be placed into this folder.