Skip to content

Reference datasets

This file describes methods associated with dataset creation and metadata.

WildlifeDataset

Base class for creating datasets.

Attributes:

Name Type Description
df DataFrame

A full dataframe of the data.

summary dict

Summary of the dataset.

root str

Root directory for the data.

update_wrong_labels bool

Whether fix_labels should be called.

unknown_name str

Name of the unknown class.

outdated_dataset bool

Tracks whether dataset was replaced by a new version.

determined_by_df bool

Specifies whether dataset is completely determined by its dataframe.

saved_to_system_folder bool

Specifies whether dataset is saved to system (hidden) folders.

remove_columns bool

Specifies whether constant columns are removed in finalize_catalogue.

check_files bool

Specifies whether files should be checks for existence in finalize_catalogue.

transform Callable

Applied transform when loading the image.

img_load str

Applied transform when loading the image.

labels_string List[str]

List of labels in strings.

load_label bool

Whether dataset[k] should return only image or also identity.

factorize_label bool

Whether labels are returned factorized (intergers) or original (possibly strings).

col_path str

Column name containing image paths.

col_label str

Column name containing individual animal names (labels).

__getitem__(idx)

Load an image with iloc idx with transforms self.transform and self.img_load applied.

Parameters:

Name Type Description Default
idx int

Index of the image.

required

Returns:

Type Description
Image | tuple[Image, int | str]

Loaded image.

__init__(root=None, df=None, update_wrong_labels=True, transform=None, img_load='full', remove_unknown=False, remove_columns=False, check_files=True, load_label=False, factorize_label=False, col_path='path', col_label='identity', **kwargs)

Initializes the class.

If df is specified, it copies it. Otherwise, it creates it by the create_catalogue method.

Parameters:

Name Type Description Default
root Optional[str]

Root directory for the data.

None
df Optional[DataFrame]

A full dataframe of the data.

None
update_wrong_labels bool

Whether fix_labels should be called.

True
transform Optional[Callable]

Applied transform when loading the image.

None
img_load str

Applied transform when loading the image.

'full'
remove_unknown bool

Whether unknown identities should be removed.

False
remove_columns bool

Whether constant columns are removed in finalize_catalogue.

False
check_files bool

Whether files should be checks for existence in finalize_catalogue.

True
load_label bool

Whether dataset[k] should return only image or also identity.

False
factorize_label bool

Whether labels are returned factorized (intergers) or original (possibly strings).

False
col_path str

Column name containing image paths.

'path'
col_label str

Column name containing individual animal names (labels).

'identity'

apply_segmentation(img, idx)

Applies segmentation or bounding box when loading an image.

Parameters:

Name Type Description Default
img Image

Loaded image.

required
idx int

Index of the image.

required

Returns:

Type Description
Image

Loaded image.

check_files_exist(col=None)

Checks if paths in a given column exist.

Parameters:

Name Type Description Default
col Optional[Series | str]

A column of a dataframe.

None

check_files_names(col=None)

Checks if paths contain characters which may cause issues.

Parameters:

Name Type Description Default
col Optional[Series | str]

A column of a dataframe.

None

check_required_columns(df=None)

Check if all required columns are present.

Parameters:

Name Type Description Default
df Optional[DataFrame]

A full dataframe of the data.

None

check_types_column(col, col_name, allowed_types)

Checks if the column col is in the format allowed_types.

Parameters:

Name Type Description Default
col Series

Column to be checked.

required
col_name str

Column name used only for raising exceptions.

required
allowed_types List[str]

List of strings with allowed values: int (all values must be integers), str (strings), list (lists), list_numeric (lists with numeric values), date (dates as tested by pd.to_datetime).

required

check_types_columns(df=None)

Checks if columns are in correct formats.

The format are specified in requirements, which is list of tuples. The first value is the name of the column and the second value is a list of formats. The column must be at least one of the formats.

Parameters:

Name Type Description Default
df Optional[DataFrame]

A full dataframe of the data.

None

check_unique_id(df=None)

Checks if values in the id column are unique.

Parameters:

Name Type Description Default
df Optional[DataFrame]

A full dataframe of the data.

None

create_catalogue()

Creates the dataframe.

Raises:

Type Description
NotImplementedError

Needs to be implemented by subclasses.

display_name() classmethod

Returns name of the dataset without the v2 ending.

Returns:

Type Description
str

Name of the dataset.

download(root, force=False, **kwargs) classmethod

Downloads the data. Wrapper around cls._download.

Parameters:

Name Type Description Default
root str

Where the data should be stored.

required
force bool

It the root exists, whether it should be overwritten.

False

extract(root, **kwargs) classmethod

Extract the data. Wrapper around cls._extract.

Parameters:

Name Type Description Default
root str

Where the data should be stored.

required

finalize_catalogue(df=None)

Reorders the dataframe and check file paths.

Reorders the columns and removes constant columns. Checks if columns are in correct formats. Checks if ids are unique and if all files exist.

Parameters:

Name Type Description Default
df Optional[DataFrame]

A full dataframe of the data.

None

Returns:

Type Description
DataFrame

A full dataframe of the data, slightly modified.

fix_labels(df)

Fixes labels in dataframe.

Automatically called in finalize_catalogue.

fix_labels_remove_identity(df, identities_to_remove, col='identity')

Removes all instances of identities.

Parameters:

Name Type Description Default
df DataFrame

A full dataframe of the data.

required
identities_to_remove List

List of identities to remove.

required
col str

Column to remove from.

'identity'

Returns:

Type Description
DataFrame

A full dataframe of the data.

fix_labels_replace_identity(df, replace_identity, col='identity')

Replaces all instances of identities.

Parameters:

Name Type Description Default
df DataFrame

A full dataframe of the data.

required
replace_identity List[Tuple]

List of (old_identity, new_identity)

required
col str

Column to replace in.

'identity'

Returns:

Type Description
DataFrame

A full dataframe of the data.

fix_labels_replace_images(df, replace_identity, col='identity')

Replaces specified images with specified identities.

It looks for a subset of image_name in df[self.col_path]. It may cause problems with os.path.sep.

Parameters:

Name Type Description Default
df DataFrame

A full dataframe of the data.

required
replace_identity List[Tuple]

List of (image_name, old_identity, new_identity).

required
col str

Column to replace in.

'identity'

Returns:

Type Description
DataFrame

A full dataframe of the data.

get_data(root, force=False, **kwargs) classmethod

Downloads and extracts the data. Wrapper around cls._download and cls._extract.

Parameters:

Name Type Description Default
root str

Where the data should be stored.

required
force bool

It the root exists, whether it should be overwritten.

False

get_image(idx)

Load an image with iloc idx.

Parameters:

Name Type Description Default
idx int

Index of the image.

required

Returns:

Type Description
Image

Loaded image.

get_subset(idx)

Returns a subset of the class.

Parameters:

Name Type Description Default
idx Union[List[int], List[bool]]

Indices in the dataframe of the subset.

required

Returns:

Type Description
WildlifeDataset

The subset class.

load_image(path)

Load an image with path.

Parameters:

Name Type Description Default
path str

Path to the image.

required

Returns:

Type Description
Image

Loaded image.

plot_grid(n_rows=5, n_cols=8, offset=10, img_min=100, rotate=True, keep_aspect_ratios=True, header_cols=None, idx=None, background_color=(0, 0, 0), keep_transform=False, **kwargs)

Plots a grid of size (n_rows, n_cols) with images from the dataframe.

Parameters:

Name Type Description Default
n_rows int

The number of rows in the grid.

5
n_cols int

The number of columns in the grid.

8
offset int

The offset between images.

10
img_min float

The minimal size of the plotted images.

100
rotate bool

Rotates the images to have the same orientation.

True
keep_aspect_ratios bool

Whether aspect ratios are kept for images.

True
header_cols Optional[Sequence[str]]

List of headers for each column.

None
idx Optional[Union[Sequence[bool], Sequence[int]]]

List of indices to plot. None plots random images. Index -1 plots an empty image.

None
background_color Tuple[int, int, int]

Background color of the grid.

(0, 0, 0)
keep_transform bool

Whether self.transform is applied.

False

remove_constant_columns(df=None)

Removes columns with a single unique value.

Parameters:

Name Type Description Default
df Optional[DataFrame]

A full dataframe of the data.

None

Returns:

Type Description
DataFrame

A full dataframe of the data, slightly modified.

reorder_df(df)

Reorders rows and columns in the dataframe.

Rows are sorted based on id. Columns are reorder based on the default_order list.

Parameters:

Name Type Description Default
df DataFrame

A full dataframe of the data.

required

Returns:

Type Description
DataFrame

A full dataframe of the data, slightly modified.

Utils

bbox_segmentation(bbox)

Convert bounding box to segmentation.

Parameters:

Name Type Description Default
bbox List[float]

Bounding box in the form [x, y, w, h].

required

Returns:

Type Description
list[float]

Segmentation mask in the form [x1, y1, x2, y2, ...].

create_id(string_col)

Creates unique ids from string based on MD5 hash.

Parameters:

Name Type Description Default
string_col Series

List of ids.

required

Returns:

Type Description
Series

List of encoded ids.

crop_black(img)

Crops black borders from an image.

Parameters:

Name Type Description Default
img Image

Image to be cropped.

required

Returns:

Type Description
Image

Cropped image.

crop_white(img)

Crops white borders from an image.

Parameters:

Name Type Description Default
img Image

Image to be cropped.

required

Returns:

Type Description
Image

Cropped image.

data_directory(dir)

Changes context such that data directory is used as current work directory. Data directory is created if it does not exist.

find_file_types(root)

Finds all counted file extensions in lowercase in folder and subfolders.

Parameters:

Name Type Description Default
root str

The root folder where to look for data.

required

Returns:

Type Description
Series

Dataframe of counts of the extensions.

find_images(root, img_extensions=('.png', '.jpg', '.jpeg', '.tiff', '.bmp'))

Finds all image files in folder and subfolders.

Parameters:

Name Type Description Default
root str

The root folder where to look for images.

required
img_extensions Tuple[str, ...]

Image extensions to look for, by default ('.png', '.jpg', '.jpeg').

('.png', '.jpg', '.jpeg', '.tiff', '.bmp')

Returns:

Type Description
DataFrame

Dataframe of relative paths of the images.

is_annotation_bbox(segmentation, bbox, tol=0)

Checks whether segmentation is bounding box.

Parameters:

Name Type Description Default
segmentation List[float]

Segmentation mask in the form [x1, y1, x2, y2, ...].

required
bbox List[float]

Bounding box in the form [x, y, w, h].

required
tol float

Tolerance for difference.

0

Returns:

Type Description
bool

True if segmentation is bounding box within tolerance.

load_image(path, max_size=None)

Loads an image.

Parameters:

Name Type Description Default
path str

Path of the image.

required
max_size Optional[int]

Maximal size of the image or None (no restriction).

None

Returns:

Type Description
Image

Loaded image.

segmentation_bbox(segmentation)

Convert segmentation to bounding box.

Parameters:

Name Type Description Default
segmentation List[float]

Segmentation mask in the form [x1, y1, x2, y2, ...].

required

Returns:

Type Description
list[float]

Bounding box in the form [x, y, w, h].