Skip to content

Default splits

The dataset already contains a default split. However, this split can be used only the the closed-set setting. This is the most common case in machine learning, where the identities (classes) in the training and testing sets coincide. Researcher may want to perform experiments on other splits such as the open-set or disjoint-set splits. For comparability of methods, here we list the recommended way of creating splits. We note that the resulting splits are machine-independent and that it is possible to generate additional splits by using different seeds or proportion of the two sets.

Splits based on identities

For a proper description, see its own page.

Closed-set split

splitter = splits.ClosedSetSplit(0.8)
for idx_train, idx_test in splitter.split(df):
    df_train, df_test = df.loc[idx_train], df.loc[idx_test]

Open-set split

splitter = splits.OpenSetSplit(0.8, 0.1)
for idx_train, idx_test in splitter.split(df):
    df_train, df_test = df.loc[idx_train], df.loc[idx_test]

Disjoint-set split

splitter = splits.DisjointSetSplit(0.2)
for idx_train, idx_test in splitter.split(df):
    df_train, df_test = df.loc[idx_train], df.loc[idx_test]

Splits based on time

For a proper description, see its own page.

Time-proportion split

splitter = splits.TimeProportionSplit()
for idx_train, idx_test in splitter.split(df):
    df_train, df_test = df.loc[idx_train], df.loc[idx_test]

Time-cutoff split

splitter = splits.TimeCutoffSplitAll()
for idx_train, idx_test in splitter.split(df):
    df_train, df_test = df.loc[idx_train], df.loc[idx_test]