Skip to content

Reference splitting functions

This file describes methods associated with dataset splitting.

Balanced split

Base class for splitting datasets into training and testing sets.

Implements methods from this paper. Its subclasses need to implement the split method. It should perform balanced splits separately for all classes. Its children are IdentitySplit and TimeAwareSplit. IdentitySplit has children ClosedSetSplit, OpenSetSplit and DisjointSetSplit. TimeAwareSplit has children TimeProportionSplit and TimeCutoffSplit.

compute_clusters(df, features, n_max_cluster=5, eps_min=0.01, eps_max=0.5, eps_step=0.01, min_samples=2)

Computes clusters for a random re-split of an already existing split.

It runs DBSCAN with increasing eps (cluster radius) until the clusters are smaller than n_max_cluster.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required
features ndarray

An array of features with the same length as df.

required
n_max_cluster int

Maximal size of cluster before eps stops increasing.

5
eps_min float

Lower bound for epsilon.

0.01
eps_max float

Upper bound for epsilon.

0.5
eps_step float

Step for epsilon.

0.01
min_samples int

Minimal cluster size.

2

Returns:

Type Description
ndarray

List of clusters.

initialize_lcg()

Returns the random number generator.

Returns:

Type Description
Lcg

The random number generator.

modify_df(df)

Prepares dataframe for splits.

Removes identities specified in self.identity_skip (usually unknown identities).

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain columns identity and date.

required

Returns:

Type Description
DataFrame

Modified dataframe of the data.

resplit_by_clusters(df, clusters, idx_train)

Creates a random re-split of an already existing split.

The re-split is based on clusters which collect similar images. Then it puts of similar images into the training set. The rest is randomly split into training and testing sets. The re-split mimics the split as the training set contains the same number of samples for EACH individual. The same goes for the testing set.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required
clusters ndarray

An array of clusters with the same length as df.

required
idx_train ndarray

Labels of the training set.

required

Returns:

Type Description
tuple[ndarray, ndarray]

List of labels of the training and testing sets.

resplit_by_features(df, features, idx_train, save_clusters_prefix=None, **kwargs)

Creates a random re-split of an already existing split.

The re-split is based on similarity of features. It runs DBSCAN as described in compute_clusters and performs the clustering as described in resplit_by_clusters.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required
features ndarray

An array of features with the same length as df.

required
idx_train ndarray

Labels of the training set.

required
save_clusters_prefix Optional[bool]

File name prefix for saving clusters.

None
**kwargs type

See kwargs in compute_clusters.

{}

Returns:

Type Description
tuple[ndarray, ndarray]

List of labels of the training and testing sets.

resplit_random(df, idx_train, idx_test)

Creates a random re-split of an already existing split.

The re-split mimics the split as the training set contains the same number of samples for EACH individual. The same goes for the testing set.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain columns identity and date.

required
idx_train ndarray

Labels of the training set.

required
idx_test ndarray

Labels of the testing set.

required

Returns:

Type Description
tuple[ndarray, ndarray]

List of labels of the training and testing sets.

set_col_label(col_label)

Sets col_label to desired value

Parameters:

Name Type Description Default
col_label str

Desired value for col_label.

required

split(*args, **kwargs)

Splitting method which needs to be implemented by subclasses.

It splits the dataframe df into labels idx_train and idx_test. The subdataset is obtained by df.loc[idx_train] (not iloc).

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of splits. Each split is list of labels of the training and testing sets.

Identity split

Bases: BalancedSplit

Base class for ClosedSetSplit, OpenSetSplit and DisjointSetSplit.

general_split(df, individual_train, individual_test)

General-purpose split into the training and testing sets.

It puts all samples of individual_train into the training set and all samples of individual_test into the testing set. The splitting is performed for each individual separately. The split will result in at least one sample in both the training and testing sets. If only one sample is available for an individual, it will be in the training set.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required
individual_train List[str]

Individuals to be only in the training test.

required
individual_test List[str]

Individuals to be only in the testing test.

required

Returns:

Type Description
tuple[ndarray, ndarray]

List of labels of the training and testing sets.

Closed-set split

Bases: IdentitySplit

Closed-set splitting method into training and testing sets.

All individuals are both in the training and testing set. The only exception is that individuals with only one sample are in the training set. Implementation of this paper.

__init__(ratio_train, **kwargs)

Initializes the class.

Parameters:

Name Type Description Default
ratio_train float

Approximate size of the training set.

required
**kwargs type

See kwargs seed, identity_skip and col_label of the parent class.

{}

split(df)

Implementation of the base splitting method.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of splits. Each split is list of labels of the training and testing sets.

Open-set split

Bases: IdentitySplit

Open-set splitting method into training and testing sets.

Some individuals are in the testing but not in the training set. Implementation of this paper.

__init__(ratio_train, ratio_class_test=None, n_class_test=None, open_in_test=True, **kwargs)

Initializes the class.

The user must provide exactly one from ratio_class_test and n_class_test. The latter specifies the number of individuals to be only in the testing set. The former specified the ratio of samples of individuals (not individuals themselves) to be only in the testing set.

Parameters:

Name Type Description Default
ratio_train float

Approximate size of the training set.

required
ratio_class_test float

Approximate ratio of samples of individuals only in the testing set.

None
n_class_test int

Number of individuals only in the testing set.

None
open_in_test str

Whether the unique identifies will be in test (default) or train set.

True
**kwargs type

See kwargs seed, identity_skip and col_label of the parent class.

{}

split(df)

Implementation of the base splitting method.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of splits. Each split is list of labels of the training and testing sets.

Disjoint-set split

Bases: IdentitySplit

Disjoint-set splitting method into training and testing sets.

No individuals are in both the training and testing sets. Implementation of this paper.

__init__(ratio_class_test=None, n_class_test=None, **kwargs)

Initializes the class.

The user must provide exactly one from ratio_class_test and n_class_test. The latter specifies the number of individuals to be only in the testing set. The former specified the ratio of samples of individuals (not individuals themselves) to be only in the testing set.

Parameters:

Name Type Description Default
ratio_class_test float

Approximate ratio of samples of individuals only in the testing set.

None
n_class_test int

Number of individuals only in the testing set.

None
**kwargs type

See kwargs seed, identity_skip and col_label of the parent class.

{}

split(df)

Implementation of the base splitting method.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain column identity.

required

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of splits. Each split is list of labels of the training and testing sets.

Time-aware split

Bases: BalancedSplit

Base class for TimeProportionSplit and TimeCutoffSplit.

modify_df(df)

Prepares dataframe for splits.

Removes identities specified in self.identity_skip (usually unknown identities). Convert the date column into a unified format. Add the year column.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain columns identity and date.

required

Returns:

Type Description
DataFrame

Modified dataframe of the data.

Time-proportion split

Bases: TimeAwareSplit

Time-proportion non-random splitting method into training and testing sets.

For each individual, it extracts unique observation dates and puts half to the training to the testing set. Ignores individuals with only one observation date. Implementation of this paper.

__init__(ratio=0.5, **kwargs)

Initializes the class.

Parameters:

Name Type Description Default
ratio float

The fraction of dates going to the training set.

0.5
**kwargs type

See kwargs seed, identity_skip and col_label of the parent class.

{}

split(df)

Implementation of the base splitting method.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain columns identity and date.

required

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of splits. Each split is list of labels of the training and testing sets.

Time-cutoff split

Bases: TimeAwareSplit

Time-cutoff non-random splitting method into training and testing sets.

Puts all individuals observed before year into the training test. Puts all individuals observed during year into the testing test. Ignores all individuals observed after year. Implementation of this paper.

__init__(year, test_one_year_only=True, **kwargs)

Initializes the class.

Parameters:

Name Type Description Default
year int

Splitting year.

required
test_one_year_only bool

Whether the test set is df['year'] == year or df['year'] >= year.

True
**kwargs type

See kwargs seed, identity_skip and col_label of the parent class.

{}

split(df)

Implementation of the base splitting method.

Parameters:

Name Type Description Default
df DataFrame

A dataframe of the data. It must contain columns identity and date.

required

Returns:

Type Description
list[tuple[ndarray, ndarray]]

List of labels of the training and testing sets.

Lcg

Linear congruential generator for generating random numbers.

Copied from StackOverflow. It is machine-, distribution- and package version-independent. It has some drawbacks (check the link above) but perfectly sufficient for our application.

Attributes:

Name Type Description
state int

Random state of the LCG.

__init__(seed, iterate=0)

Initialization function for LCG.

Parameters:

Name Type Description Default
seed int

Initial random seed.

required
iterate int

Number of initial random iterations.

0

random()

Generates a new random integer from the current state.

Returns:

Type Description
int

New random integer.

random_permutation(n)

Generates a random permutation of range(n).

Parameters:

Name Type Description Default
n int

Length of the sequence to be permuted.

required

Returns:

Type Description
ndarray

Permuted sequence.

random_shuffle(x)

Generates a random shuffle of x.

Parameters:

Name Type Description Default
x ndarray

Array to be permuted.

required

Returns:

Type Description
ndarray

Shuffled array.