Skip to content

How to add new datasets

Adding new datasets is relatively easy. It is sufficient to create a subclass of DatasetFactory with the create_catalogue method. A simple example is

import pandas as pd
from wildlife_datasets import datasets

class Test(datasets.DatasetFactory):
    def create_catalogue(self) -> pd.DataFrame:
        df = pd.DataFrame({
            'image_id': [1, 2, 3, 4],
            'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
            'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
        })
        return df

The class is then created by Test('.'). The empty argument should point to the location where the images are stored. The dataframe can then be accessed by

Test('.').df
   image_id identity      path
0         1    Lukas  img1.jpg
1         2    Vojta  img2.jpg
2         3    Lukas  img3.jpg
3         4    Vojta  img4.jpg

The dataframe df must satisfy some requirements.

Info

Instead of returning df it is better to return self.finalize_catalogue(df). This function will perform multiple checks to verify the created dataframe. However, in this case, this check would fail because the specified file paths do not exist.

To incorporate the new dataset into the list of all available datasets, the init script must be appropriately modified.

Optional: including metadata

The metadata can be added by saving them in a csv file (such as mysummary.csv). Their full description is in a separate file. Then they can be loaded into the class definition as a class attribute.

import pandas as pd
from wildlife_datasets import datasets

class Test(datasets.DatasetFactory):
    summary = datasets.Summary('docs/csv/mysummary.csv')['Test']

    def create_catalogue(self) -> pd.DataFrame:
        df = pd.DataFrame({
            'image_id': [1, 2, 3, 4],
            'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
            'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
        })
        return df

The metadata can be accessed by

Test('.').summary
{'licenses': 'Attribution 4.0 International (CC BY 4.0)', 'licenses_url': 'https://creativecommons.org/licenses/by/4.0/', 'animals': {'humans'}, 'real_animals': True, 'year': 2022, 'reported_n_total': 4, 'reported_n_identified': 4, 'reported_n_photos': 4, 'reported_n_individuals': 2, 'wild': True, 'clear_photos': True, 'pose': 'single', 'unique_pattern': False, 'from_video': False, 'cropped': False, 'span': '1 day'}

Optional: including download

Adding the possibility to download is achieved by adding two class methods. Examples can be taken from existing datasets.

import pandas as pd
from wildlife_datasets import datasets

class Test(datasets.DatasetFactory):
    summary = datasets.Summary('docs/csv/mysummary.csv')['Test']

    @classmethod
    def _download(cls):
        pass

    @classmethod
    def _extract(cls):
        pass

    def create_catalogue(self) -> pd.DataFrame:
        df = pd.DataFrame({
            'image_id': [1, 2, 3, 4],
            'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
            'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
        })
        return df

Optional: integrating into package

New datasets may be integrated into the core package by pull requests on the Github repo. In such a case, the dataset should be freely downloadable and both download script and metadata should be provided. The fuctions should be included in the following files: