How to add new datasets
Adding new datasets is relatively easy. It is sufficient to create a subclass of DatasetFactory
with the create_catalogue
method. A simple example is
import pandas as pd
from wildlife_datasets import datasets
class Test(datasets.DatasetFactory):
def create_catalogue(self) -> pd.DataFrame:
df = pd.DataFrame({
'image_id': [1, 2, 3, 4],
'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
})
return df
The class is then created by Test('.')
. The empty argument should point to the location where the images are stored. The dataframe can then be accessed by
Test('.').df
image_id identity path
0 1 Lukas img1.jpg
1 2 Vojta img2.jpg
2 3 Lukas img3.jpg
3 4 Vojta img4.jpg
The dataframe df
must satisfy some requirements.
Info
Instead of returning df
it is better to return self.finalize_catalogue(df)
. This function will perform multiple checks to verify the created dataframe. However, in this case, this check would fail because the specified file paths do not exist.
To incorporate the new dataset into the list of all available datasets, the init script must be appropriately modified.
Optional: including metadata
The metadata can be added by saving them in a csv file (such as mysummary.csv). Their full description is in a separate file. Then they can be loaded into the class definition as a class attribute.
import pandas as pd
from wildlife_datasets import datasets
class Test(datasets.DatasetFactory):
summary = datasets.Summary('docs/csv/mysummary.csv')['Test']
def create_catalogue(self) -> pd.DataFrame:
df = pd.DataFrame({
'image_id': [1, 2, 3, 4],
'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
})
return df
The metadata can be accessed by
Test('.').summary
{'licenses': 'Attribution 4.0 International (CC BY 4.0)', 'licenses_url': 'https://creativecommons.org/licenses/by/4.0/', 'animals': {'humans'}, 'real_animals': True, 'year': 2022, 'reported_n_total': 4, 'reported_n_identified': 4, 'reported_n_photos': 4, 'reported_n_individuals': 2, 'wild': True, 'clear_photos': True, 'pose': 'single', 'unique_pattern': False, 'from_video': False, 'cropped': False, 'span': '1 day'}
Optional: including download
Adding the possibility to download is achieved by adding two class methods. Examples can be taken from existing datasets.
import pandas as pd
from wildlife_datasets import datasets
class Test(datasets.DatasetFactory):
summary = datasets.Summary('docs/csv/mysummary.csv')['Test']
@classmethod
def _download(cls):
pass
@classmethod
def _extract(cls):
pass
def create_catalogue(self) -> pd.DataFrame:
df = pd.DataFrame({
'image_id': [1, 2, 3, 4],
'identity': ['Lukas', 'Vojta', 'Lukas', 'Vojta'],
'path': ['img1.jpg', 'img2.jpg', 'img3.jpg', 'img4.jpg'],
})
return df
Optional: integrating into package
New datasets may be integrated into the core package by pull requests on the Github repo. In such a case, the dataset should be freely downloadable and both download script and metadata should be provided. The fuctions should be included in the following files:
DatasetFactory
definition in datasets.py.- Metadata as an extension to the existing summary.csv.