sparselm.model_selection#

Classes implementing parameters selection beyond GridsearchCV.

class sparselm.model_selection.GridSearchCV(estimator, param_grid, *, opt_selection_method='max_score', scoring='neg_root_mean_squared_error', n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]#

Bases: GridSearchCV

Exhaustive search over specified parameter values for an estimator.

Same as GridSearchCV but allows to use one standard error rule on all non-negative numerical hyper-parameters, in order to get a robust sparse estimation. Same documentation as scikit-learn’s GridSearchCV.

An additional class variable opt_selection named opt_selection is added to allow switching hyper params selection mode. Currently, supports “max_score” (default), which is to maximize the score; also supports “one_std_score”, which is to apply one-standard-error rule to the score.

Parameters:
  • estimator (Estimator) – A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

  • param_grid (dict or list[dict]) – Dictionary representing grid of hyper-parameters with their names as keys and possible values. If given as a list of multiple dicts, will search on multiple grids in parallel.

  • opt_selection_method (str, default="max_score") – The method to select optimal hyper params. Default to “max_score”, which means to maximize the score. Can also choose “one_std_score”, which means to apply one standard error rule on scores.

  • (str (scoring) –

  • callable

  • list

  • dict (tuple or) –

:param : :param default=”neg_root_mean_squared_error”): Strategy to evaluate the performance of the cross-validated

model on the test set. If scoring represents a single score, one can use: - a single string (see The scoring parameter: defining model evaluation rules); - a callable (see Defining your scoring strategy from metric functions) that returns a single value. If scoring represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables a values. See Specifying multiple metrics for evaluation for an example. In sparse-lm, using “neg_root_mean_squared_error” is default in contrast to r2_score used in scikit-learn.

Parameters:
  • n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • refit (bool, str, or callable, default=True) – Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this instance. Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer. See scoring parameter to know more about multiple metric evaluation.

  • cv (int, cross-validation generator or an iterable, default=None) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a (Stratified)KFold, - CV splitter, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Refer User Guide for the various cross-validation strategies that can be used here. Notice that if cv is not specified, KFold(5) will be used, and your training set will not be shuffled in each train-test split. This can be dangerous if your training set has internal relations, for example, structures with similar composition are close to each other in feature rows. In this case, you should use RepeatedKFold as cv instead.

  • verbose (int, default=0) – Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.

  • pre_dispatch (int, or str, default='2*n_jobs') –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

    • An int, giving the exact number of total jobs that are spawned

    • A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

  • error_score ('raise' or numeric, default=np.nan) – Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

  • return_train_score (bool, default=False) – If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However, computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

fit(X, y=None, *, groups=None, **fit_params)[source]#

Run fit with all sets of parameters.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,) –

  • default=None) – Target relative to X for classification or regression; None for unsupervised learning.

  • groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • **fit_params – Parameters passed to the fit method of the estimator. If a fit parameter is an array-like whose length is equal to num_samples then it will be split across CV groups along with X and y. For example, the sample_weight parameter is split because len(sample_weights) = len(X).

Returns:

Instance of fitted estimator.

Return type:

self(GridSearch)

property classes_#

Class labels.

Only available when refit=True and the estimator is a classifier.

decision_function(X)#

Call decision_function on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports decision_function.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_score – Result of the decision function for X based on the estimator with the best found parameters.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes * (n_classes-1) / 2)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

inverse_transform(Xt)#

Call inverse_transform on the estimator with the best found params.

Only available if the underlying estimator implements inverse_transform and refit=True.

Parameters:

Xt (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

X – Result of the inverse_transform function for Xt based on the estimator with the best found parameters.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

property n_features_in_#

Number of features seen during fit.

Only available when refit=True.

predict(X)#

Call predict on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – The predicted labels or values for X based on the estimator with the best found parameters.

Return type:

ndarray of shape (n_samples,)

predict_log_proba(X)#

Call predict_log_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict_log_proba.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – Predicted class log-probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute classes_.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes)

predict_proba(X)#

Call predict_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict_proba.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – Predicted class probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute classes_.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes)

score(X, y=None)#

Return the score on the given data, if the estimator has been refit.

This uses the score defined by scoring where provided, and the best_estimator_.score method otherwise.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,), default=None) – Target relative to X for classification or regression; None for unsupervised learning.

Returns:

score – The score defined by scoring if provided, and the best_estimator_.score method otherwise.

Return type:

float

score_samples(X)#

Call score_samples on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports score_samples.

New in version 0.24.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of the underlying estimator.

Returns:

y_score – The best_estimator_.score_samples method.

Return type:

ndarray of shape (n_samples,)

set_fit_request(*, groups='$UNCHANGED$')#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:
  • groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for groups parameter in fit.

  • self (GridSearchCV) –

Returns:

self – The updated object.

Return type:

object

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

transform(X)#

Call transform on the estimator with the best found parameters.

Only available if the underlying estimator supports transform and refit=True.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

XtX transformed in the new space based on the estimator with the best found parameters.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

class sparselm.model_selection.LineSearchCV(estimator, param_grid, *, opt_selection_method='max_score', n_iter=None, scoring='neg_root_mean_squared_error', n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)[source]#

Bases: BaseSearchCV

Implements line search.

In line search, we do 1 dimensional grid searches on each hyper-param up to a certain number of iterations. Each search will generate a GridSearchCV object.

Initialize a LineSearch.

Parameters:
  • estimator (Estimator) – A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.

  • param_grid (list[tuple]) – List of tuples with parameters names (str) as first element and lists of parameter settings to try as the second element. In LineSearch, the hyper-params given first will be searched first in a cycle. Multiple grids search is NOT supported!

  • opt_selection_method (list(str) or str, default="max_score") – The method to select optimal hyper params. Default to “max_score”, which means to maximize score score. Can also choose “one_std_score”, which means to apply one standard error rule on score scores. In line search, this argument can also be given as a list of str. This will allow different selection methods for the corresponding hyper-params in param_grid. For example, a good practice when using L2L0 estimator could be opt_selection_method = [“one_std_score”, “max_score”] for “alpha” and “l0_ratio”, respectively.

  • n_iter (int, default=None) – Number of iterations to perform. One iteration means a 1D search on one hyper-param, and we scan one hyper-param at a time in the order of param_grid. n_iter must be at least as large as the number of hyper-params. Default is 2 * number of hyper-params.

  • scoring (str, callable, list, tuple or dict) – Strategy to evaluate the performance of the cross-validated model on the test set.

  • default="neg_root_mean_squared_error") – Strategy to evaluate the performance of the cross-validated model on the test set. If scoring represents a single score, one can use: - a single string (see The scoring parameter: defining model evaluation rules); - a callable (see Defining your scoring strategy from metric functions) that returns a single value. If scoring represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric names and the values are the metric scores; - a dictionary with metric names as keys and callables a values. See Specifying multiple metrics for evaluation for an example. In sparse-lm, using “neg_root_mean_squared_error” is default in contrast to r2_score used in scikit-learn.

  • n_jobs (int, default=None) – Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • refit (bool, str, or callable, default=True) – Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a str denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, refit can be set to a function which returns the selected best_index_ given cv_results_. In that case, the best_estimator_ and best_params_ will be set according to the returned best_index_ while the best_score_ attribute will not be available. The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this instance. Also for multiple metric evaluation, the attributes best_index_, best_score_ and best_params_ will only be available if refit is set and all of them will be determined w.r.t this specific scorer. See scoring parameter to know more about multiple metric evaluation.

  • cv (int, cross-validation generator or an iterable, default=None) – Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a (Stratified)KFold, - CV splitter, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. Refer User Guide for the various cross-validation strategies that can be used here. Notice that if cv is not specified, KFold(5) will be used, and your training set will not be shuffled in each train-test split. This can be dangerous if your training set has internal relations, for example, structures with similar composition are close to each other in feature rows. In this case, you should use RepeatedKFold as cv instead.

  • verbose (int, default=0) – Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.

  • pre_dispatch (int, or str, default='2*n_jobs') –

    Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

    • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs

    • An int, giving the exact number of total jobs that are spawned

    • A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

  • error_score ('raise' or numeric, default=np.nan) – Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

  • return_train_score (bool, default=False) – If False, the cv_results_ attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However, computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

fit(X, y=None, *, groups=None, **fit_params)[source]#

Run fit with all sets of parameters.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,) –

  • default=None) – Target relative to X for classification or regression; None for unsupervised learning.

  • groups (array-like of shape (n_samples,), default=None) – Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold).

  • **fit_params – Parameters passed to the fit method of the estimator. If a fit parameter is an array-like whose length is equal to num_samples then it will be split across CV groups along with X and y. For example, the sample_weight parameter is split because len(sample_weights) = len(X).

Returns:

Instance of fitted estimator.

Return type:

self (LineSearch)

property classes_#

Class labels.

Only available when refit=True and the estimator is a classifier.

decision_function(X)#

Call decision_function on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports decision_function.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_score – Result of the decision function for X based on the estimator with the best found parameters.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes) or (n_samples, n_classes * (n_classes-1) / 2)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

inverse_transform(Xt)#

Call inverse_transform on the estimator with the best found params.

Only available if the underlying estimator implements inverse_transform and refit=True.

Parameters:

Xt (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

X – Result of the inverse_transform function for Xt based on the estimator with the best found parameters.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)

property n_features_in_#

Number of features seen during fit.

Only available when refit=True.

predict(X)#

Call predict on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – The predicted labels or values for X based on the estimator with the best found parameters.

Return type:

ndarray of shape (n_samples,)

predict_log_proba(X)#

Call predict_log_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict_log_proba.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – Predicted class log-probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute classes_.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes)

predict_proba(X)#

Call predict_proba on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports predict_proba.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

y_pred – Predicted class probabilities for X based on the estimator with the best found parameters. The order of the classes corresponds to that in the fitted attribute classes_.

Return type:

ndarray of shape (n_samples,) or (n_samples, n_classes)

score(X, y=None)#

Return the score on the given data, if the estimator has been refit.

This uses the score defined by scoring where provided, and the best_estimator_.score method otherwise.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like of shape (n_samples, n_output) or (n_samples,), default=None) – Target relative to X for classification or regression; None for unsupervised learning.

Returns:

score – The score defined by scoring if provided, and the best_estimator_.score method otherwise.

Return type:

float

score_samples(X)#

Call score_samples on the estimator with the best found parameters.

Only available if refit=True and the underlying estimator supports score_samples.

New in version 0.24.

Parameters:

X (iterable) – Data to predict on. Must fulfill input requirements of the underlying estimator.

Returns:

y_score – The best_estimator_.score_samples method.

Return type:

ndarray of shape (n_samples,)

set_fit_request(*, groups='$UNCHANGED$')#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a pipeline.Pipeline. Otherwise it has no effect.

Parameters:
  • groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for groups parameter in fit.

  • self (LineSearchCV) –

Returns:

self – The updated object.

Return type:

object

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

transform(X)#

Call transform on the estimator with the best found parameters.

Only available if the underlying estimator supports transform and refit=True.

Parameters:

X (indexable, length n_samples) – Must fulfill the input assumptions of the underlying estimator.

Returns:

XtX transformed in the new space based on the estimator with the best found parameters.

Return type:

{ndarray, sparse matrix} of shape (n_samples, n_features)