Reference splitting functions

This file describes methods associated with dataset splitting.

Balanced split

Base class for splitting datasets into training and testing sets.

Implements methods from this paper. Its subclasses need to implement the split method. It should perform balanced splits separately for all classes. Its children are IdentitySplit and TimeAwareSplit. IdentitySplit has children ClosedSetSplit, OpenSetSplit and DisjointSetSplit. TimeAwareSplit has children TimeProportionSplit and TimeCutoffSplit.

`compute_clusters(df, features, n_max_cluster=5, eps_min=0.01, eps_max=0.5, eps_step=0.01, min_samples=2)`

Computes clusters for a random re-split of an already existing split.

It runs DBSCAN with increasing eps (cluster radius) until the clusters are smaller than n_max_cluster.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required
`features`	`ndarray`	An array of features with the same length as `df`.	required
`n_max_cluster`	`int`	Maximal size of cluster before `eps` stops increasing.	`5`
`eps_min`	`float`	Lower bound for epsilon.	`0.01`
`eps_max`	`float`	Upper bound for epsilon.	`0.5`
`eps_step`	`float`	Step for epsilon.	`0.01`
`min_samples`	`int`	Minimal cluster size.	`2`

Returns:

Type	Description
`ndarray`	List of clusters.

`initialize_lcg()`

Returns the random number generator.

Returns:

Type	Description
`Lcg`	The random number generator.

`modify_df(df)`

Prepares dataframe for splits.

Removes identities specified in self.identity_skip (usually unknown identities).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain columns `identity` and `date`.	required

Returns:

Type	Description
`DataFrame`	Modified dataframe of the data.

`resplit_by_clusters(df, clusters, idx_train)`

Creates a random re-split of an already existing split.

The re-split is based on clusters which collect similar images. Then it puts of similar images into the training set. The rest is randomly split into training and testing sets. The re-split mimics the split as the training set contains the same number of samples for EACH individual. The same goes for the testing set.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required
`clusters`	`ndarray`	An array of clusters with the same length as `df`.	required
`idx_train`	`ndarray`	Labels of the training set.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	List of labels of the training and testing sets.

`resplit_by_features(df, features, idx_train, save_clusters_prefix=None, **kwargs)`

Creates a random re-split of an already existing split.

The re-split is based on similarity of features. It runs DBSCAN as described in compute_clusters and performs the clustering as described in resplit_by_clusters.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required
`features`	`ndarray`	An array of features with the same length as `df`.	required
`idx_train`	`ndarray`	Labels of the training set.	required
`save_clusters_prefix`	`Optional[bool]`	File name prefix for saving clusters.	`None`
`**kwargs`	`type`	See kwargs in `compute_clusters`.	`{}`

Returns:

Type	Description
`tuple[ndarray, ndarray]`	List of labels of the training and testing sets.

`resplit_random(df, idx_train, idx_test)`

Creates a random re-split of an already existing split.

The re-split mimics the split as the training set contains the same number of samples for EACH individual. The same goes for the testing set.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain columns `identity` and `date`.	required
`idx_train`	`ndarray`	Labels of the training set.	required
`idx_test`	`ndarray`	Labels of the testing set.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	List of labels of the training and testing sets.

`set_col_label(col_label)`

Sets col_label to desired value

Parameters:

Name	Type	Description	Default
`col_label`	`str`	Desired value for col_label.	required

`split(*args, **kwargs)`

Splitting method which needs to be implemented by subclasses.

It splits the dataframe df into labels idx_train and idx_test. The subdataset is obtained by df.loc[idx_train] (not iloc).

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of splits. Each split is list of labels of the training and testing sets.

Identity split

Bases: BalancedSplit

Base class for ClosedSetSplit, OpenSetSplit and DisjointSetSplit.

`general_split(df, individual_train, individual_test)`

General-purpose split into the training and testing sets.

It puts all samples of individual_train into the training set and all samples of individual_test into the testing set. The splitting is performed for each individual separately. The split will result in at least one sample in both the training and testing sets. If only one sample is available for an individual, it will be in the training set.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required
`individual_train`	`List[str]`	Individuals to be only in the training test.	required
`individual_test`	`List[str]`	Individuals to be only in the testing test.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	List of labels of the training and testing sets.

Closed-set split

Bases: IdentitySplit

Closed-set splitting method into training and testing sets.

All individuals are both in the training and testing set. The only exception is that individuals with only one sample are in the training set. Implementation of this paper.

`init(ratio_train, **kwargs)`

Initializes the class.

Parameters:

Name	Type	Description	Default
`ratio_train`	`float`	Approximate size of the training set.	required
`**kwargs`	`type`	See kwargs `seed`, `identity_skip` and `col_label` of the parent class.	`{}`

`split(df)`

Implementation of the base splitting method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of splits. Each split is list of labels of the training and testing sets.

Open-set split

Bases: IdentitySplit

Open-set splitting method into training and testing sets.

Some individuals are in the testing but not in the training set. Implementation of this paper.

`init(ratio_train, ratio_class_test=None, n_class_test=None, open_in_test=True, **kwargs)`

Initializes the class.

The user must provide exactly one from ratio_class_test and n_class_test. The latter specifies the number of individuals to be only in the testing set. The former specified the ratio of samples of individuals (not individuals themselves) to be only in the testing set.

Parameters:

Name	Type	Description	Default
`ratio_train`	`float`	Approximate size of the training set.	required
`ratio_class_test`	`float`	Approximate ratio of samples of individuals only in the testing set.	`None`
`n_class_test`	`int`	Number of individuals only in the testing set.	`None`
`open_in_test`	`str`	Whether the unique identifies will be in test (default) or train set.	`True`
`**kwargs`	`type`	See kwargs `seed`, `identity_skip` and `col_label` of the parent class.	`{}`

`split(df)`

Implementation of the base splitting method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of splits. Each split is list of labels of the training and testing sets.

Disjoint-set split

Bases: IdentitySplit

Disjoint-set splitting method into training and testing sets.

No individuals are in both the training and testing sets. Implementation of this paper.

`init(ratio_class_test=None, n_class_test=None, **kwargs)`

Initializes the class.

The user must provide exactly one from ratio_class_test and n_class_test. The latter specifies the number of individuals to be only in the testing set. The former specified the ratio of samples of individuals (not individuals themselves) to be only in the testing set.

Parameters:

Name	Type	Description	Default
`ratio_class_test`	`float`	Approximate ratio of samples of individuals only in the testing set.	`None`
`n_class_test`	`int`	Number of individuals only in the testing set.	`None`
`**kwargs`	`type`	See kwargs `seed`, `identity_skip` and `col_label` of the parent class.	`{}`

`split(df)`

Implementation of the base splitting method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain column `identity`.	required

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of splits. Each split is list of labels of the training and testing sets.

Time-aware split

Bases: BalancedSplit

Base class for TimeProportionSplit and TimeCutoffSplit.

`modify_df(df)`

Prepares dataframe for splits.

Removes identities specified in self.identity_skip (usually unknown identities). Convert the date column into a unified format. Add the year column.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain columns `identity` and `date`.	required

Returns:

Type	Description
`DataFrame`	Modified dataframe of the data.

Time-proportion split

Bases: TimeAwareSplit

Time-proportion non-random splitting method into training and testing sets.

For each individual, it extracts unique observation dates and puts half to the training to the testing set. Ignores individuals with only one observation date. Implementation of this paper.

`init(ratio=0.5, **kwargs)`

Initializes the class.

Parameters:

Name	Type	Description	Default
`ratio`	`float`	The fraction of dates going to the training set.	`0.5`
`**kwargs`	`type`	See kwargs `seed`, `identity_skip` and `col_label` of the parent class.	`{}`

`split(df)`

Implementation of the base splitting method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain columns `identity` and `date`.	required

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of splits. Each split is list of labels of the training and testing sets.

Time-cutoff split

Bases: TimeAwareSplit

Time-cutoff non-random splitting method into training and testing sets.

Puts all individuals observed before year into the training test. Puts all individuals observed during year into the testing test. Ignores all individuals observed after year. Implementation of this paper.

`init(year, test_one_year_only=True, **kwargs)`

Initializes the class.

Parameters:

Name	Type	Description	Default
`year`	`int`	Splitting year.	required
`test_one_year_only`	`bool`	Whether the test set is `df['year'] == year` or `df['year'] >= year`.	`True`
`**kwargs`	`type`	See kwargs `seed`, `identity_skip` and `col_label` of the parent class.	`{}`

`split(df)`

Implementation of the base splitting method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe of the data. It must contain columns `identity` and `date`.	required

Returns:

Type	Description
`list[tuple[ndarray, ndarray]]`	List of labels of the training and testing sets.

Lcg

Linear congruential generator for generating random numbers.

Copied from StackOverflow. It is machine-, distribution- and package version-independent. It has some drawbacks (check the link above) but perfectly sufficient for our application.

Attributes:

Name	Type	Description
`state`	`int`	Random state of the LCG.

`init(seed, iterate=0)`

Initialization function for LCG.

Parameters:

Name	Type	Description	Default
`seed`	`int`	Initial random seed.	required
`iterate`	`int`	Number of initial random iterations.	`0`

`random()`

Generates a new random integer from the current state.

Returns:

Type	Description
`int`	New random integer.

`random_permutation(n)`

Generates a random permutation of range(n).

Parameters:

Name	Type	Description	Default
`n`	`int`	Length of the sequence to be permuted.	required

Returns:

Type	Description
`ndarray`	Permuted sequence.

`random_shuffle(x)`

Generates a random shuffle of x.

Parameters:

Name	Type	Description	Default
`x`	`ndarray`	Array to be permuted.	required

Returns:

Type	Description
`ndarray`	Shuffled array.

Reference splitting functions

Balanced split

compute_clusters(df, features, n_max_cluster=5, eps_min=0.01, eps_max=0.5, eps_step=0.01, min_samples=2)

initialize_lcg()

modify_df(df)

resplit_by_clusters(df, clusters, idx_train)

resplit_by_features(df, features, idx_train, save_clusters_prefix=None, **kwargs)

resplit_random(df, idx_train, idx_test)

set_col_label(col_label)

split(*args, **kwargs)

Identity split

general_split(df, individual_train, individual_test)

Closed-set split

__init__(ratio_train, **kwargs)

split(df)

Open-set split

__init__(ratio_train, ratio_class_test=None, n_class_test=None, open_in_test=True, **kwargs)

split(df)

Disjoint-set split

__init__(ratio_class_test=None, n_class_test=None, **kwargs)

split(df)

Time-aware split

modify_df(df)

Time-proportion split

__init__(ratio=0.5, **kwargs)

split(df)

Time-cutoff split

__init__(year, test_one_year_only=True, **kwargs)

split(df)

Lcg

__init__(seed, iterate=0)

random()

random_permutation(n)

random_shuffle(x)

`compute_clusters(df, features, n_max_cluster=5, eps_min=0.01, eps_max=0.5, eps_step=0.01, min_samples=2)`

`initialize_lcg()`

`modify_df(df)`

`resplit_by_clusters(df, clusters, idx_train)`

`resplit_by_features(df, features, idx_train, save_clusters_prefix=None, **kwargs)`

`resplit_random(df, idx_train, idx_test)`

`set_col_label(col_label)`

`split(*args, **kwargs)`

`general_split(df, individual_train, individual_test)`

`init(ratio_train, **kwargs)`

`split(df)`

`init(ratio_train, ratio_class_test=None, n_class_test=None, open_in_test=True, **kwargs)`

`split(df)`

`init(ratio_class_test=None, n_class_test=None, **kwargs)`

`split(df)`

`modify_df(df)`

`init(ratio=0.5, **kwargs)`

`split(df)`

`init(year, test_one_year_only=True, **kwargs)`

`split(df)`

`init(seed, iterate=0)`

`random()`

`random_permutation(n)`

`random_shuffle(x)`