How to use splitting functions

The crucial part of machine learning is training a method on a training set and evaluating it on a separate testing set. We provide a default split for each dataset. However, it is possible to create additional custom splits. The following splits are implemented:

closed-set split. The default split where the testing set does not contain any new individuals. For each sample from the testing set, the goal is to assign some individual from the training set. The name follows from the fact that the population is closed.
open-set split. The split where the testing set contains new individuals. For each sample from the testing set, the goal is to assign some individual from the training set or predict that the sample contain a new individual. The name follows from the fact that the population is open.
disjoint-set split. The split where there are no individuals in both the training and testing sets. For each sample from the testing set, the goal is to assign some individual from the testing set. The name follows from the fact that the two populations are disjoint.

The splits are usually created randomly but in situations where the dataset contains timestamps, it is also possible to create the split based on timestamps. Due to our representation of the random number generator, the splits are both machine- and system-independent. We followed the presentation from this paper.

We assume that we have already downloaded the MacaqueFaces dataset. Then we load the dataset and the dataframe.

from wildlife_datasets import datasets, splits

dataset = datasets.MacaqueFaces('data/MacaqueFaces')
df = dataset.df

Splits based on identities

Splits on identities perform the split for each individual separetely. All these splits are random.

Closed-set split

The most common split is the closed-set split, where each individual has samples in both the training and testing sets.

splitter = splits.ClosedSetSplit(0.8)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware closed-set
Samples: train/test/unassigned/total = 5024/1256/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 80.00%
Fraction of test set only = 0.00%

This code generates a split, where the training set contains approximately 80% of all samples. Even though this split contains precisely 80%, since the split is done separately for each individual, the real ratio of the training set may be different. The outputs of the spliller are labels, not indices, and, therefore, we access the training and testing sets by

df_train, df_test = df.loc[idx_train], df.loc[idx_test]

Open-set split

In the open-set split, there are some individuals with all their samples only in the testing set. For the remaining individuals, the closed-set split is performed.

splitter = splits.OpenSetSplit(0.8, 0.1)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware open-set
Samples: train/test/unassigned/total = 5028/1252/0/6280
Classes: train/test/unassigned/total = 30/34/0/34
Samples: train only/test only        = 0/760
Classes: train only/test only/joint  = 0/4/30

Fraction of train set     = 80.06%
Fraction of test set only = 12.10%

This code generates a split, where approximately 10% of samples are put directly into the testing set. It also specifies that the training set should contain 50% of all samples. As in the previous (and all following) cases, the numbers are only approximate with the actual ratios being 8.92% and 80%. The other possibility to create this split is to prescribe the number of individuals (instead of ratio of samples) which go directly into the testing set.

splitter = splits.OpenSetSplit(0.8, n_class_test=5)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware open-set
Samples: train/test/unassigned/total = 5025/1255/0/6280
Classes: train/test/unassigned/total = 29/34/0/34
Samples: train only/test only        = 0/940
Classes: train only/test only/joint  = 0/5/29

Fraction of train set     = 80.02%
Fraction of test set only = 14.97%

Disjoint-set split

For the disjoint-set split, each individual has all samples either in the training or testing set but never in both. Similarly as in the open-set split, we can create the split either by

splitter = splits.DisjointSetSplit(0.2)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware disjoint-set
Samples: train/test/unassigned/total = 4980/1300/0/6280
Classes: train/test/unassigned/total = 27/7/0/34
Samples: train only/test only        = 4980/1300
Classes: train only/test only/joint  = 27/7/0

Fraction of train set     = 79.30%
Fraction of test set only = 20.70%

or

splitter = splits.DisjointSetSplit(n_class_test=10)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware disjoint-set
Samples: train/test/unassigned/total = 4420/1860/0/6280
Classes: train/test/unassigned/total = 24/10/0/34
Samples: train only/test only        = 4420/1860
Classes: train only/test only/joint  = 24/10/0

Fraction of train set     = 70.38%
Fraction of test set only = 29.62%

The first method put approximately 20% of the samples to the testing set, while the second method puts 10 classes to the testing set.

Splits based on time

Splits based on time create some cutoff time and put everything before the cutoff time into the training set and everything after the cutoff time into the training set. Therefore, this splits are not random but deterministic. These splits also ignore all samples without timestamps.

Time-proportion split

Time-proportion split counts on how many days was each individual observed. Then it puts all samples corresponding to the first half of the observation days to the training set and all remaining to the testing set. It ignores all individuals observed only on one day. Since all individuals are both in the training and testing set, it leads to the closed-set split.

splitter = splits.TimeProportionSplit()
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-proportion closed-set
Samples: train/test/unassigned/total = 3731/2549/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 59.41%
Fraction of test set only = 0.00%

Even though the split is non-random, it still required the seed because of the random resplit.

Time-cutoff split

While the time-proportion day selected a different cutoff day for each individual, the time-cutoff split creates one cutoff year for the whole dataset. All samples taken before the cutoff year go the training set, while all samples taken during the cutoff year go to the testing set. Since some individuals may be present only in the testing set, this split is usually an open-set split.

splitter = splits.TimeCutoffSplit(2015)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-cutoff closed-set
Samples: train/test/unassigned/total = 3745/2535/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 59.63%
Fraction of test set only = 0.00%

It is also possible to place all samples taken during or after the cutoff year to the testing set by

splitter = splits.TimeCutoffSplit(2015, test_one_year_only=False)
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-cutoff closed-set
Samples: train/test/unassigned/total = 3745/2535/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 59.63%
Fraction of test set only = 0.00%

It is also possible to create all possible time-cutoff splits for different years by

splitter = splits.TimeCutoffSplitAll()
for idx_train, idx_test in splitter.split(df):
    splits.analyze_split(df, idx_train, idx_test)

Split: time-cutoff closed-set
Samples: train/test/unassigned/total = 3745/2535/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 59.63%
Fraction of test set only = 0.00%

Random resplit

Since the splits based on time are not random, it is also possible to create similar random splits by

idx_train, idx_test = splitter.resplit_random(df, idx_train, idx_test)
splits.analyze_split(df, idx_train, idx_test)

Split: time-unaware closed-set
Samples: train/test/unassigned/total = 3745/2535/0/6280
Classes: train/test/unassigned/total = 34/34/0/34
Samples: train only/test only        = 0/0
Classes: train only/test only/joint  = 0/0/34

Fraction of train set     = 59.63%
Fraction of test set only = 0.00%

For each individual the number of samples in the training set will be the same for the original and new splits. Since the new splits are random, they do not utilize the time at all.