Reference datasets
This file describes methods associated with dataset creation and metadata.
WildlifeDataset
Base class for creating datasets.
Attributes:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
A full dataframe of the data. |
summary |
dict
|
Summary of the dataset. |
root |
str
|
Root directory for the data. |
update_wrong_labels |
bool
|
Whether |
unknown_name |
str
|
Name of the unknown class. |
outdated_dataset |
bool
|
Tracks whether dataset was replaced by a new version. |
determined_by_df |
bool
|
Specifies whether dataset is completely determined by its dataframe. |
saved_to_system_folder |
bool
|
Specifies whether dataset is saved to system (hidden) folders. |
remove_columns |
bool
|
Specifies whether constant columns are removed in |
check_files |
bool
|
Specifies whether files should be checks for existence in |
transform |
Callable
|
Applied transform when loading the image. |
img_load |
str
|
Applied transform when loading the image. |
labels_string |
List[str]
|
List of labels in strings. |
load_label |
bool
|
Whether dataset[k] should return only image or also identity. |
factorize_label |
bool
|
Whether labels are returned factorized (intergers) or original (possibly strings). |
col_path |
str
|
Column name containing image paths. |
col_label |
str
|
Column name containing individual animal names (labels). |
__getitem__(idx)
Load an image with iloc idx with transforms self.transform and self.img_load applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the image. |
required |
Returns:
| Type | Description |
|---|---|
Image | tuple[Image, int | str]
|
Loaded image. |
__init__(root=None, df=None, update_wrong_labels=True, transform=None, img_load='full', remove_unknown=False, remove_columns=False, check_files=True, load_label=False, factorize_label=False, col_path='path', col_label='identity', **kwargs)
Initializes the class.
If df is specified, it copies it. Otherwise, it creates it
by the create_catalogue method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
Optional[str]
|
Root directory for the data. |
None
|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
update_wrong_labels
|
bool
|
Whether |
True
|
transform
|
Optional[Callable]
|
Applied transform when loading the image. |
None
|
img_load
|
str
|
Applied transform when loading the image. |
'full'
|
remove_unknown
|
bool
|
Whether unknown identities should be removed. |
False
|
remove_columns
|
bool
|
Whether constant columns are removed in |
False
|
check_files
|
bool
|
Whether files should be checks for existence in |
True
|
load_label
|
bool
|
Whether dataset[k] should return only image or also identity. |
False
|
factorize_label
|
bool
|
Whether labels are returned factorized (intergers) or original (possibly strings). |
False
|
col_path
|
str
|
Column name containing image paths. |
'path'
|
col_label
|
str
|
Column name containing individual animal names (labels). |
'identity'
|
apply_segmentation(img, idx)
Applies segmentation or bounding box when loading an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
img
|
Image
|
Loaded image. |
required |
idx
|
int
|
Index of the image. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Loaded image. |
check_files_exist(col=None)
Checks if paths in a given column exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col
|
Optional[Series | str]
|
A column of a dataframe. |
None
|
check_files_names(col=None)
Checks if paths contain characters which may cause issues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col
|
Optional[Series | str]
|
A column of a dataframe. |
None
|
check_required_columns(df=None)
Check if all required columns are present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
check_types_column(col, col_name, allowed_types)
Checks if the column col is in the format allowed_types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col
|
Series
|
Column to be checked. |
required |
col_name
|
str
|
Column name used only for raising exceptions. |
required |
allowed_types
|
List[str]
|
List of strings with allowed values:
|
required |
check_types_columns(df=None)
Checks if columns are in correct formats.
The format are specified in requirements, which is list
of tuples. The first value is the name of the column
and the second value is a list of formats. The column
must be at least one of the formats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
check_unique_id(df=None)
Checks if values in the id column are unique.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
create_catalogue()
Creates the dataframe.
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Needs to be implemented by subclasses. |
display_name()
classmethod
Returns name of the dataset without the v2 ending.
Returns:
| Type | Description |
|---|---|
str
|
Name of the dataset. |
download(root, force=False, **kwargs)
classmethod
Downloads the data. Wrapper around cls._download.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Where the data should be stored. |
required |
force
|
bool
|
It the root exists, whether it should be overwritten. |
False
|
extract(root, **kwargs)
classmethod
Extract the data. Wrapper around cls._extract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Where the data should be stored. |
required |
finalize_catalogue(df=None)
Reorders the dataframe and check file paths.
Reorders the columns and removes constant columns. Checks if columns are in correct formats. Checks if ids are unique and if all files exist.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data, slightly modified. |
fix_labels(df)
Fixes labels in dataframe.
Automatically called in finalize_catalogue.
fix_labels_remove_identity(df, identities_to_remove, col='identity')
Removes all instances of identities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
A full dataframe of the data. |
required |
identities_to_remove
|
List
|
List of identities to remove. |
required |
col
|
str
|
Column to remove from. |
'identity'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data. |
fix_labels_replace_identity(df, replace_identity, col='identity')
Replaces all instances of identities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
A full dataframe of the data. |
required |
replace_identity
|
List[Tuple]
|
List of (old_identity, new_identity) |
required |
col
|
str
|
Column to replace in. |
'identity'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data. |
fix_labels_replace_images(df, replace_identity, col='identity')
Replaces specified images with specified identities.
It looks for a subset of image_name in df[self.col_path].
It may cause problems with os.path.sep.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
A full dataframe of the data. |
required |
replace_identity
|
List[Tuple]
|
List of (image_name, old_identity, new_identity). |
required |
col
|
str
|
Column to replace in. |
'identity'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data. |
get_data(root, force=False, **kwargs)
classmethod
Downloads and extracts the data. Wrapper around cls._download and cls._extract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
Where the data should be stored. |
required |
force
|
bool
|
It the root exists, whether it should be overwritten. |
False
|
get_image(idx)
Load an image with iloc idx.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
int
|
Index of the image. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Loaded image. |
get_subset(idx)
Returns a subset of the class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idx
|
Union[List[int], List[bool]]
|
Indices in the dataframe of the subset. |
required |
Returns:
| Type | Description |
|---|---|
WildlifeDataset
|
The subset class. |
load_image(path)
Load an image with path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the image. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Loaded image. |
plot_grid(n_rows=5, n_cols=8, offset=10, img_min=100, rotate=True, keep_aspect_ratios=True, header_cols=None, idx=None, background_color=(0, 0, 0), keep_transform=False, **kwargs)
Plots a grid of size (n_rows, n_cols) with images from the dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_rows
|
int
|
The number of rows in the grid. |
5
|
n_cols
|
int
|
The number of columns in the grid. |
8
|
offset
|
int
|
The offset between images. |
10
|
img_min
|
float
|
The minimal size of the plotted images. |
100
|
rotate
|
bool
|
Rotates the images to have the same orientation. |
True
|
keep_aspect_ratios
|
bool
|
Whether aspect ratios are kept for images. |
True
|
header_cols
|
Optional[Sequence[str]]
|
List of headers for each column. |
None
|
idx
|
Optional[Union[Sequence[bool], Sequence[int]]]
|
List of indices to plot. None plots random images. Index -1 plots an empty image. |
None
|
background_color
|
Tuple[int, int, int]
|
Background color of the grid. |
(0, 0, 0)
|
keep_transform
|
bool
|
Whether |
False
|
remove_constant_columns(df=None)
Removes columns with a single unique value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Optional[DataFrame]
|
A full dataframe of the data. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data, slightly modified. |
reorder_df(df)
Reorders rows and columns in the dataframe.
Rows are sorted based on id.
Columns are reorder based on the default_order list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
A full dataframe of the data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A full dataframe of the data, slightly modified. |
Utils
bbox_segmentation(bbox)
Convert bounding box to segmentation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bbox
|
List[float]
|
Bounding box in the form [x, y, w, h]. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
Segmentation mask in the form [x1, y1, x2, y2, ...]. |
create_id(string_col)
Creates unique ids from string based on MD5 hash.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
string_col
|
Series
|
List of ids. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
List of encoded ids. |
crop_black(img)
Crops black borders from an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
img
|
Image
|
Image to be cropped. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Cropped image. |
crop_white(img)
Crops white borders from an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
img
|
Image
|
Image to be cropped. |
required |
Returns:
| Type | Description |
|---|---|
Image
|
Cropped image. |
data_directory(dir)
Changes context such that data directory is used as current work directory. Data directory is created if it does not exist.
find_file_types(root)
Finds all counted file extensions in lowercase in folder and subfolders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
The root folder where to look for data. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Dataframe of counts of the extensions. |
find_images(root, img_extensions=('.png', '.jpg', '.jpeg', '.tiff', '.bmp'))
Finds all image files in folder and subfolders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
str
|
The root folder where to look for images. |
required |
img_extensions
|
Tuple[str, ...]
|
Image extensions to look for, by default ('.png', '.jpg', '.jpeg'). |
('.png', '.jpg', '.jpeg', '.tiff', '.bmp')
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Dataframe of relative paths of the images. |
is_annotation_bbox(segmentation, bbox, tol=0)
Checks whether segmentation is bounding box.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segmentation
|
List[float]
|
Segmentation mask in the form [x1, y1, x2, y2, ...]. |
required |
bbox
|
List[float]
|
Bounding box in the form [x, y, w, h]. |
required |
tol
|
float
|
Tolerance for difference. |
0
|
Returns:
| Type | Description |
|---|---|
bool
|
True if segmentation is bounding box within tolerance. |
load_image(path, max_size=None)
Loads an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path of the image. |
required |
max_size
|
Optional[int]
|
Maximal size of the image or None (no restriction). |
None
|
Returns:
| Type | Description |
|---|---|
Image
|
Loaded image. |
segmentation_bbox(segmentation)
Convert segmentation to bounding box.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
segmentation
|
List[float]
|
Segmentation mask in the form [x1, y1, x2, y2, ...]. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
Bounding box in the form [x, y, w, h]. |