Reference splitting functions
This file describes methods associated with dataset splitting.
Balanced split
Base class for splitting datasets into training and testing sets.
Implements methods from this paper.
Its subclasses need to implement the split
method.
It should perform balanced splits separately for all classes.
Its children are IdentitySplit
and TimeAwareSplit
.
IdentitySplit
has children ClosedSetSplit
, OpenSetSplit
and DisjointSetSplit
.
TimeAwareSplit
has children TimeProportionSplit
and TimeCutoffSplit
.
Source code in wildlife_datasets/splits/balanced_split.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
initialize_lcg()
Returns the random number generator.
Returns:
Type | Description |
---|---|
Lcg
|
The random number generator. |
Source code in wildlife_datasets/splits/balanced_split.py
20 21 22 23 24 25 26 27 |
|
resplit_by_features(df, features, idx_train, n_max_cluster=5, eps_min=0.01, eps_max=0.5, eps_step=0.01, min_samples=2, save_clusters_prefix=None)
Creates a random re-split of an already existing split.
The re-split is based on similarity of features.
It runs DBSCAN with increasing eps (cluster radius) until
the clusters are smaller than n_max_cluster
.
Then it puts of similar images into the training set.
The rest is randomly split into training and testing sets.
The re-split mimics the split as the training set contains
the same number of samples for EACH individual.
The same goes for the testing set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain column |
required |
features |
ndarray
|
An array of features with the same length as |
required |
idx_train |
ndarray
|
Labels of the training set. |
required |
n_max_cluster |
int
|
Maximal size of cluster before |
5
|
eps_min |
float
|
Lower bound for epsilon. |
0.01
|
eps_max |
float
|
Upper bound for epsilon. |
0.5
|
eps_step |
float
|
Step for epsilon. |
0.01
|
min_samples |
int
|
Minimal cluster size. |
2
|
save_clusters_prefix |
Optional[bool]
|
File name prefix for saving clusters. |
None
|
Returns:
Type | Description |
---|---|
Tuple[ndarray, ndarray]
|
List of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/balanced_split.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
split(*args, **kwargs)
Splitting method which needs to be implemented by subclasses.
It splits the dataframe df
into labels idx_train
and idx_test
.
The subdataset is obtained by df.loc[idx_train]
(not iloc
).
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of splits. Each split is list of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/balanced_split.py
29 30 31 32 33 34 35 36 37 38 39 |
|
Identity split
Bases: BalancedSplit
Base class for ClosedSetSplit
, OpenSetSplit
and DisjointSetSplit
.
Source code in wildlife_datasets/splits/identity_split.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
general_split(df, individual_train, individual_test)
General-purpose split into the training and testing sets.
It puts all samples of individual_train
into the training set
and all samples of individual_test
into the testing set.
The splitting is performed for each individual separately.
The split will result in at least one sample in both the training and testing sets.
If only one sample is available for an individual, it will be in the training set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain column |
required |
individual_train |
List[str]
|
Individuals to be only in the training test. |
required |
individual_test |
List[str]
|
Individuals to be only in the testing test. |
required |
Returns:
Type | Description |
---|---|
Tuple[ndarray, ndarray]
|
List of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/identity_split.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
modify_df(df)
Prepares dataframe for splits.
Removes identities specified in self.identity_skip
(usually unknown identities).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain columns |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Modified dataframe of the data. |
Source code in wildlife_datasets/splits/identity_split.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Closed-set split
Bases: IdentitySplit
Closed-set splitting method into training and testing sets.
All individuals are both in the training and testing set. The only exception is that individuals with only one sample are in the training set. Implementation of this paper.
Source code in wildlife_datasets/splits/identity_split.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
__init__(ratio_train, seed=666, identity_skip='unknown')
Initializes the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio_train |
float
|
Approximate size of the training set. |
required |
seed |
int
|
Initial seed for the LCG random generator. |
666
|
identity_skip |
str
|
Name of the identities to ignore. |
'unknown'
|
Source code in wildlife_datasets/splits/identity_split.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
split(df)
Implementation of the base splitting method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain column |
required |
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of splits. Each split is list of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/identity_split.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
Open-set split
Bases: IdentitySplit
Open-set splitting method into training and testing sets.
Some individuals are in the testing but not in the training set. Implementation of this paper.
Source code in wildlife_datasets/splits/identity_split.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
__init__(ratio_train, ratio_class_test=None, n_class_test=None, seed=666, identity_skip='unknown')
Initializes the class.
The user must provide exactly one from ratio_class_test
and n_class_test
.
The latter specifies the number of individuals to be only in the testing set.
The former specified the ratio of samples of individuals (not individuals themselves)
to be only in the testing set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio_train |
float
|
Approximate size of the training set. |
required |
ratio_class_test |
float
|
Approximate ratio of samples of individuals only in the testing set. |
None
|
n_class_test |
int
|
Number of individuals only in the testing set. |
None
|
seed |
int
|
Initial seed for the LCG random generator. |
666
|
identity_skip |
str
|
Name of the identities to ignore. |
'unknown'
|
Source code in wildlife_datasets/splits/identity_split.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
split(df)
Implementation of the base splitting method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain column |
required |
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of splits. Each split is list of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/identity_split.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
Disjoint-set split
Bases: IdentitySplit
Disjoint-set splitting method into training and testing sets.
No individuals are in both the training and testing sets. Implementation of this paper.
Source code in wildlife_datasets/splits/identity_split.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
|
__init__(ratio_class_test=None, n_class_test=None, seed=666, identity_skip='unknown')
Initializes the class.
The user must provide exactly one from ratio_class_test
and n_class_test
.
The latter specifies the number of individuals to be only in the testing set.
The former specified the ratio of samples of individuals (not individuals themselves)
to be only in the testing set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio_class_test |
float
|
Approximate ratio of samples of individuals only in the testing set. |
None
|
n_class_test |
int
|
Number of individuals only in the testing set. |
None
|
seed |
int
|
Initial seed for the LCG random generator. |
666
|
identity_skip |
str
|
Name of the identities to ignore. |
'unknown'
|
Source code in wildlife_datasets/splits/identity_split.py
239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
|
split(df)
Implementation of the base splitting method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain column |
required |
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of splits. Each split is list of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/identity_split.py
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
|
Time-aware split
Bases: BalancedSplit
Base class for TimeProportionSplit
and TimeCutoffSplit
.
Source code in wildlife_datasets/splits/time_aware_split.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
modify_df(df)
Prepares dataframe for splits.
Removes identities specified in self.identity_skip
(usually unknown identities).
Convert the date
column into a unified format.
Add the year
column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain columns |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Modified dataframe of the data. |
Source code in wildlife_datasets/splits/time_aware_split.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
resplit_random(df, idx_train, idx_test, year_max=np.inf)
Creates a random re-split of an already existing split.
The re-split mimics the split as the training set contains
the same number of samples for EACH individual.
The same goes for the testing set.
The re-split samples may be drawn only from df['year'] <= year_max
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain columns |
required |
idx_train |
ndarray
|
Labels of the training set. |
required |
idx_test |
ndarray
|
Labels of the testing set. |
required |
year_max |
int
|
Considers only entries with |
inf
|
Returns:
Type | Description |
---|---|
Tuple[ndarray, ndarray]
|
List of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/time_aware_split.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
Time-proportion split
Bases: TimeAwareSplit
Time-proportion non-random splitting method into training and testing sets.
For each individual, it extracts unique observation dates and puts half to the training to the testing set. Ignores individuals with only one observation date. Implementation of this paper.
Source code in wildlife_datasets/splits/time_aware_split.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
__init__(ratio=0.5, seed=666, identity_skip='unknown')
Initializes the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ratio |
float
|
The fraction of dates going to the testing set. |
0.5
|
seed |
int
|
Initial seed for the LCG random generator. |
666
|
identity_skip |
str
|
Name of the identities to ignore. |
'unknown'
|
Source code in wildlife_datasets/splits/time_aware_split.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
split(df)
Implementation of the base splitting method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain columns |
required |
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of splits. Each split is list of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/time_aware_split.py
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
Time-cutoff split
Bases: TimeAwareSplit
Time-cutoff non-random splitting method into training and testing sets.
Puts all individuals observed before year
into the training test.
Puts all individuals observed during year
into the testing test.
Ignores all individuals observed after year
.
Implementation of this paper.
Source code in wildlife_datasets/splits/time_aware_split.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
__init__(year, test_one_year_only=True, seed=666, identity_skip='unknown')
Initializes the class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
year |
int
|
Splitting year. |
required |
test_one_year_only |
bool
|
Whether the test set is |
True
|
seed |
int
|
Initial seed for the LCG random generator. |
666
|
identity_skip |
str
|
Name of the identities to ignore. |
'unknown'
|
Source code in wildlife_datasets/splits/time_aware_split.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|
split(df)
Implementation of the base splitting method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
A dataframe of the data. It must contain columns |
required |
Returns:
Type | Description |
---|---|
List[Tuple[ndarray, ndarray]]
|
List of labels of the training and testing sets. |
Source code in wildlife_datasets/splits/time_aware_split.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
Lcg
Linear congruential generator for generating random numbers.
Copied from StackOverflow. It is machine-, distribution- and package version-independent. It has some drawbacks (check the link above) but perfectly sufficient for our application.
Attributes:
Name | Type | Description |
---|---|---|
state |
int
|
Random state of the LCG. |
Source code in wildlife_datasets/splits/lcg.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
__init__(seed, iterate=0)
Initialization function for LCG.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
seed |
int
|
Initial random seed. |
required |
iterate |
int
|
Number of initial random iterations. |
0
|
Source code in wildlife_datasets/splits/lcg.py
16 17 18 19 20 21 22 23 24 25 |
|
random()
Generates a new random integer from the current state.
Returns:
Type | Description |
---|---|
int
|
New random integer. |
Source code in wildlife_datasets/splits/lcg.py
27 28 29 30 31 32 33 34 35 |
|
random_permutation(n)
Generates a random permutation of range(n)
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n |
int
|
Length of the sequence to be permuted. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Permuted sequence. |
Source code in wildlife_datasets/splits/lcg.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
random_shuffle(x)
Generates a random shuffle of x
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
ndarray
|
Array to be permuted. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Shuffled array. |
Source code in wildlife_datasets/splits/lcg.py
53 54 55 56 57 58 59 60 61 62 63 |
|