optimizes model/estimator, its hyperparameters and preprocessing
operation to be performed on input and output features. It consists of two
hpo loops. The parent or outer loop optimizes preprocessing/feature engineering,
feature selection and model selection while the child hpo loop optimizes
hyperparmeters of child hpo loop.
-metrics_
a pandas DataFrame of shape (parent_iterations, len(monitor)) which contains
values of metrics being monitored at each parent iteration.
-val_scores_
a 1d numpy array of length equal to parent_iterations which contains value
of evaluation metric at each parent iteration.
-parent_suggestions_
an ordered dictionary of suggestions to the parent objective function
during parent hpo loop
-child_val_scores_
a numpy array of shape (parent_iterations, child_iterations) containing
value of eval_metric at all child hpo loops
-optimizer_
an instance of ai4water.hyperopt.HyperOpt [1]_ for parent optimization
-models
a list of models being considered for optimization
-model_space
a dictionary which contains parameter space for each model
inputs_to_transform (list/dict, optional, (default=None)) – Input features on which feature engineering/transformation is to
be applied. By default all input features are considered. If you
want to apply a single transformation on a group of input features,
then pass this as a dictionary. This is helpful if the input data
consists of hundred or thousands of input features. If None (default)
transformations will be applied on all input features. If you don’t
want to apply any transformation on any input feature, pass an empty
list.
The transformations to be considered for input features. Default
is None, in which case all input features are considered.
If list, then it will be the names of transformations to be considered
for all input features. By default following transformations are
considered
minmax rescale from 0 to 1
center center the data by subtracting mean from it
scale scale the data by dividing it with its standard deviation
zscore first performs centering and then scaling
box-cox
yeo-johnson
quantile
quantile_normal
robust
log natural logarithm
log2 log with base 2
log10 log with base 10
sqrt square root
The user can however, specify list of transformations to be considered
for each input feature. In such a case, this argument must be a
dictionary whose keys are names of input features and values are
list of transformations.
outputs_to_transform (list, optional) – Output features on which feature engineering/transformation is to
be applied. If None, then transformations on outputs are not applied.
output_transformations (Optional (default=None)) – The transformations to be considered for outputs/targets. The user
can consider any transformation as given for input_transformations
The models/algorithms to consider during optimization. If not given, then all
available models from sklearn, xgboost, catboost and lgbm are
considered. For neural networks, following 6 model types are
considered by default
parent_iterations (int, optional (default=100)) – Number of iterations for parent optimization loop
child_iterations (int, optional) – Number of iterations for child optimization loop. If set to 0,
the child hpo loop is not run which means the hyperparameters
of the model are not optimized. You can customize number of hpo
iterations for each model by making using of :meth: change_child_iterations
method.
parent_algorithm (str, optional) – Algorithm for optimization of parent optimization
child_algorithm (str, optional) – Algorithm for optimization of child optimization
eval_metric (str, optional) – Validation metric to calculate val_score in objective function.
The parent and child hpo loop optimizes/improves this metric. This metric is
calculated on validation data. If cross validation is performed then
this metric is calculated using cross validation.
cv_parent_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in parent hpo loop or not?.
If given, the parent hpo loop will optimize the cross validation score.
The model is fitted on whole training data (training+validation) after
cross validation and the metrics printed (other than parent_val_metric)
are calculated on the based the updated model i.e. the one fitted on
whole training (training + validation) data.
cv_child_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in child hpo loop or not?.
If False, then val_score will be calculated on validation data.
The type of cross validator used is taken from model.config[‘cross_validator’]
monitor (Union[str, list], optional, (default=None)) – Names of performance metrics to monitor in parent hpo loop. If None,
then R2 is monitored for regression and accuracy for classification.
mode (str, optional (default="regression")) – whether this is a regression problem or classification
num_classes (int, optional (default=None)) – number of classes, only relevant if mode==”classification”.
category (str, optional (default="DL")) – either “DL” or “ML”. If DL, the pipeline is optimized for neural networks.
wandb_config (dict) – The keyword arguments to initiate wand.init() as dictionary. It is
only valid if wandb package is installed. Default value is None,
which means, wandb will not be utilized. For simplest case, pass
a dictionary with project as key.
>>> dict(project=”my_project”)
The user must however login wandb before. The behaviour of wandb is controlled
by py:meth:autotab.OptimizePipeline.wb_init , py:meth:autotab.OptimziePipeline.wb_log
and py:meth:autotab.OptimizePipeline.wb_finish method respectively
**model_kwargs – any additional key word arguments for ai4water’s Model
It runs all the models with their default parameters and without
any x and y transformation. These results can be considered as
baseline results and can be compared with optimized model’s results.
The model is trained on ‘training’+’validation’ data.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training
and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y.
The is the data on which the performance of optimized pipeline will be
calculated. This should only be given if data argument is not given.
Returns:
a tuple of two dictionaries.
- a dictionary of val_scores on test data for each model
- a dictionary of metrics being monitored for each model on test data.
Build and Evaluate the best model with respect to metric from config.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value
is y. The is the data on which the performance of optimized
pipeline will be calculated. This should only be given if data
argument is not given.
metric_name (str) – the metric with respect to which the best model is fetched
and then built/evaluated. If not given, the best model is
built/evaluated with respect to evaluation metric.
model_name (str, optional) – If given, the best version of this model will be fetched and built.
The ‘best’ will be decided based upon metric_name
verbosity (int, optional (default=1)) – determines the amount of print information
builds, trains and evaluates best versions of all the models.
The model is trained on ‘training’+’validation’ data.
Parameters:
x – the input data for training. If test_data is not given then test data
is extracted from x based upon train_fraction arguments.
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value
is y. The is the data on which the performance of optimized pipeline
will be calculated. This should only be given if data argument
is not given.
metric_name (str) – the name of metric to determine best version of a model. If not
given, parent_val_metric will be used.
verbosity (int, optional (default=0)) – determines the amount of print information
Builds, Trains and Evaluates the best model with respect to metric from
scratch. The model is trained on ‘training’+’validation’ data. Running
this mothod will also populate taylor_plot_data_ dictionary.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value
is y. The is the data on which the peformance of optimized
pipeline will be calculated. This should only be given if data
argument is not given.
metric_name (str) – the metric with respect to which the best model is searched
and then built/trained/evaluated. If None, the best model is
chosen based on the evaluation metric.
model_name (str, optional) – If given, the best version of this model will be found and built.
The ‘best’ will be decided based upon metric_name
verbosity (int, optional (default=1)) – determines amount of information to be printed.
Builds, trains and evalutes the model from a specific iteration.
The model is trained on ‘training’+’validation’ data.
Parameters:
iter_num (int) – iteration number from which to choose the model
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data – a tuple/list of length 2 whose first element is x and second
value is y. The is the data on which the performance of optimized
pipeline will be calculated. This should only be given if data
argument is not given.
We may want to change the child hpo iterations for one or more models.
For example we may want to run only 10 iterations for LinearRegression but 40
iterations for XGBRegressor. In such a case we can use this function to
modify child hpo iterations for one or more models. The iterations for all
the remaining models will remain same as defined by the user at the start.
This method updated _child_iters dictionary
Parameters:
model (dict) – a dictionary whose keys are names of models and values are number
of iterations for that model during child hpo
Example
>>> pl=OptimizePipeline(...)>>> pl.change_child_iteration({"XGBRegressor":10})... # If we want to change iterations for more than one models>>> pl.change_child_iteration(({"XGBRegressor":30,... "RandomForestRegressor":20}))
change the behvior of a transformation i.e. the way it is applied.
If features is not not given, it will modify the behavior of transformation
for all features. This function modifies the feature_transformations
attribute of the class.
Parameters:
transformation (str) – The name of transformation whose behavior is to be modified.
new_behavior (dict) – key, word arguments which determine the new behavior of Transformation.
These key,word arguments are given to the specifified transformation
when it is initialized.
features (str/list, optional (default=None)) – The name or names of features for which the behavior should be modified.
If not given, the changed behavior of transformation will apply to all
input features.
>>> data=busan_beach()>>> input_features=data.columns.tolist()[0:-1]>>> output_features=data.columns.tolist()[-1:]>>> pl=OptimizePipeline(... input_features=input_features,... output_features=output_features... )>>> pl.change_transformation_behavior('yeo-johnson',{'pre_center':True},'wind_dir_deg')... # we can change behavior behavior for multiple features as well>>> pl.change_transformation_behavior('yeo-johnson',{'pre_center':True},... ['air_p_hpa','mslp_hpa'])
>>> fromautotabimportOptimizePipeline>>> fromai4water.datasetsimportbusan_beach>>> data=busan_beach()>>> input_features=data.columns.tolist()[0:-1]>>> output_features=data.columns.tolist()[-1:]>>> pl=OptimizePipeline(input_features=input_features,>>> output_features=output_features)>>> results=pl.fit(data=data)... # compare models with respect to evaluation metric>>> pl.compare_models()... # compare models with respect to bar_chart and plot comparison using bar_chart>>> pl.compare_models('r2',"bar_chart")... # compare models with respect to r2 and get the matplotlb axes for further processing>>> axes=pl.compare_models('r2',show=False)
Generate Dumbbell plot as comparison of baseline models with
optimized models. Note that this command will train all the considered models,
so this can be expensive.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data –
a tuple/list of length 2 whose first element is x and second value
is y. The is the data on which the performance of optimized pipeline
will be calculated. This should only be given if data argument
is not given.
metric_name (str) – The name of metric with respect to which the models have
to be compared. If not given, the evaluation metric is used.
lower_limit (float/int, optional (default=None)) – clip the values below this value. Set this value to None to avoid
clipping.
upper_limit (float/int, optional (default=None)) – clip the values above this value
figsize (tuple) – If given, plot will be generated of this size.
save – By default True. If False, function will not save the
resultant plot in current working directory.
Returns:
matplotlib axes object which can be used for further processing
Return type:
plt.Axes
Examples
>>> fromautotabimportOptimizePipeline>>> fromai4water.datasetsimportbusan_beach>>> total_data=busan_beach()>>> input_features=total_data.columns.tolist()[0:-1]>>> output_features=total_data.columns.tolist()[-1:]>>> pl=OptimizePipeline(input_features=input_features,>>> output_features=output_features)>>> results=pl.fit(data=total_data)... # compare models with respect to evaluation metric>>> pl.dumbbell_plot(data=total_data)... # compare the models by also plotting bias value>>> pl.dumbbell_plot(data=total_data,metric_name="r2_score")... # get the matplotlb axes for further processing>>> axes=pl.dumbbell_plot(data=total_data,metric_name="r2_score",... lower_limit=0.0,show=False)
only x,y should be given (validation data will be taken from x and y based upon val_fraction argument
or x,y and validation_data should be given
or only data should be given (training and validation data will be taken from data based upon train_fraction and val_fraction arguments`)
every other combination of x,y, data and validation_data will raise error
Note
If test_data is not to be extracted/seprated from x,y/data then you must set
train_fraction to 1.0. Please check
this tutorial
for more on data splitting.
Parameters:
x (np.ndarray) – input data for training + validation + test. If your x does not
contain test portion, set train_fraction to 1.0 during
initializtion of OptimizePipeline class.
y (np.ndarray) – output/target/label for training data. It must of same length as x.
data – A pandas dataframe which contains input (x) and output (y) features
Only required if x and y are not given. The training and validation
data will be extracted from this data.
validation_data (tuple) – validation data on which pipeline is optimized. Only required if data
is not given.
previous_results (dict, optional (default=None)) – path of file which contains xy values.
process_results (bool, optional (default=True)) – Wether to perform postprocessing of optimization of results or not.
callbacks (list, optional (default=None)) – list of callbacks to run
finish_wb (bool) – if set to True, then wandb.finish is called at the end.
If set to False, then the user will have to manually call py:meth:autotab._main.OptimizePipeline.wb_finish
method later.
Returns:
an instance of ai4water.hyperopt.HyperOpt class which is used for
returns the best pipeline with respect to a particular model and
performance metric. The metric must be recorded i.e. must be given as
monitor argument.
Parameters:
model_name (str) – The name of model for which best pipeline is to be found. The best
is defined by metric_name.
metric_name (str, optional) – The name of metric with respect to which the best model is to
be retrieved. If not given, the best model is defined by the
evaluation metric.
Returns:
a tuple of length two
first value is a float which represents the value of
metric
second value is a dictionary of pipeline with four keys
post processing of results to draw dumbbell plot and taylor plot.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training
and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y.
The is the data on which the performance of optimized pipeline will be
calculated. This should only be given if data argument is not given.
If this is not given then test data is taken either from x,y or from data
based upon data splitting schemes.
show (bool, optional (default=True)) – whether to show the plots or not
Removes one or more transformation from being considered. This function
modifies the feature_transformations attribute of the class.
Parameters:
transformation (str/list) – the name/names of transformation to be removed.
feature (str/list, optional (default=None)) – name of feature for which the transformation should not be considered.
If not given, the transformation will be removed from all the input features.
Return type:
None
Examples
>>> pl=OptimizePipeline(...)... # remove box-cox transformation altogether>>> pl.remove_transformation('box-cox')... # remove multiple transformations>>> pl.remove_transformation(['yeo-johnson','log'])... # remove a transformation for a certain feature>>> pl.remove_transformation('log2','tide_cm')... # remove a transformation for more than one features>>> pl.remove_transformation('log10',['tide_cm','wat_temp_c'])
saves the results. It is called automatically at the end of optimization.
It saves tried models and transformations at each step as json file
with the name parent_suggestions.json.
An errors.csv file is saved which contains validation performance of
the models at each optimization iteration with respect to all metrics
being monitored.
The performance of each model during child optimization iteration is saved
as a csv file with the name child_val_scores.csv.
The global seeds for parent and child iterations are also saved in csv
files with name parent_seeds.csv and child_seeds.csv.
All of these results are saved in pl.path folder.
makes Taylor’s plot using the best version of each model.
The number of models in taylor plot will be equal to the number
of models which have been considered by the model.
Parameters:
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both
training and test will be prepared. It is only required if x, y
are not provided.
test_data (tuple) – a tuple/list of length 2 whose first element is x and second value
is y. The is the data on which the performance of optimized pipeline
will be calculated. This should only be given if data argument
is not given.
plot_bias (bool, optional) – whether to plot the bias or not
figsize (tuple, optional) – a tuple determining figure size
show (bool, optional) – whether to show the plot or not
save (bool, optional) – whether to save the plot or not
verbosity (int, optional (default=0)) – determines the amount of print information
**kwargs – any additional keyword arguments for taylor_plot function of
easy_mpl.
Returns:
matplotlib Figure object which can be used for further processing
Return type:
matplotlib.pyplot.Figure
Examples
>>> fromautotabimportOptimizePipeline>>> fromai4water.datasetsimportbusan_beach>>> total_data=busan_beach()>>> input_features=total_data.columns.tolist()[0:-1]>>> output_features=total_data.columns.tolist()[-1:]>>> pl=OptimizePipeline(input_features=input_features,>>> output_features=output_features)>>> results=pl.fit(data=total_data)... # compare models with respect to evaluation metric>>> pl.taylor_plot(data=total_data)... # compare the models by also plotting bias value>>> pl.taylor_plot(data=total_data,plot_bias=True)... # get the matplotlb Figure object for further processing>>> fig=pl.taylor_plot(data=total_data,show=False)
Similarly we can also update for a deep learning model as below
>>> pl=OptimizePipeline(input_features=["tide_cm"],output_features="tetx_coppml",... category="DL")>>> pl.update_model_space({"MLP":{... "units":Integer(low=8,high=128,prior='uniform',transform='identity',name='units'),... "activation":Categorical(["relu","elu","tanh","sigmoid"],name="activation"),... "num_layers":Integer(low=1,high=5,name="num_layers")... }})we can confirm it by printing the model space>>> pl.model_space['MLP']