OptimizePipeline
- class autotab.OptimizePipeline(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]
optimizes model/estimator, its hyperparameters and preprocessing operation to be performed on input and output features. It consists of two hpo loops. The parent or outer loop optimizes preprocessing/feature engineering, feature selection and model selection while the child hpo loop optimizes hyperparmeters of child hpo loop.
- - metrics_
a pandas DataFrame of shape (parent_iterations, len(monitor)) which contains values of metrics being monitored at each parent iteration.
- - val_scores_
a 1d numpy array of length equal to parent_iterations which contains value of evaluation metric at each parent iteration.
- - parent_suggestions_
an ordered dictionary of suggestions to the parent objective function during parent hpo loop
- - child_val_scores_
a numpy array of shape (parent_iterations, child_iterations) containing value of eval_metric at all child hpo loops
- - optimizer_
an instance of ai4water.hyperopt.HyperOpt [1]_ for parent optimization
- - models
a list of models being considered for optimization
- - model_space
a dictionary which contains parameter space for each model
Example
>>> from autotab import OptimizePipeline >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> input_features = data.columns.tolist()[0:-1] >>> output_features = data.columns.tolist()[-1:] >>> pl = OptimizePipeline(input_features=input_features, >>> output_features=output_features, >>> inputs_to_transform=input_features) >>> results = pl.fit(data=data)
Note
This optimization always solves a minimization problem even if the val_metric is $R^2$.
- Undoc-members
- Show-inheritance
- __init__(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]
initializes the class
- Parameters
inputs_to_transform (list) – Input features on which feature engineering/transformation is to be applied. By default all input features are considered. If you want to apply a single transformation on a group of input features, then pass this as a dictionary. This is helpful if the input data consists of hundred or thousands of input features.
input_transformations (list, dict) –
The transformations to be considered for input features. Default is None, in which case all input features are considered.
If list, then it will be the names of transformations to be considered for all input features. By default following transformations are considered
minmax
rescale from 0 to 1center
center the data by subtracting mean from itscale
scale the data by dividing it with its standard deviationzscore
first performs centering and then scalingbox-cox
yeo-johnson
quantile
robust
log
log2
log10
sqrt
square root
The user can however, specify list of transformations to be considered for each input feature. In such a case, this argument must be a dictionary whose keys are names of input features and values are list of transformations.
outputs_to_transform (list, optional) – Output features on which feature engineering/transformation is to be applied. If None, then transformations on outputs are not applied.
output_transformations – The transformations to be considered for outputs/targets. The user can consider any transformation as given for
input_transformations
models (list, optional) –
The models/algorithms to consider during optimzation. If not given, then all available models from sklearn, xgboost, catboost and lgbm are considered. For neural neworks, following 6 model types are considered by default
However, in such cases, the
category
must beDL
.parent_iterations (int, optional (default=100)) – Number of iterations for parent optimization loop
child_iterations (int, optional) – Number of iterations for child optimization loop. It set to 0, the child hpo loop is not run which means the hyperparameters of the model are not optimized. You can customize iterations for each model by making using of :meth: change_child_iterations method.
parent_algorithm (str, optional) – Algorithm for optimization of parent optimzation
child_algorithm (str, optional) – Algorithm for optimization of child optimization
eval_metric (str, optional) – Validation metric to calculate val_score in objective function. The parent and child hpo loop optimizes/improves this metric. This metric is calculated on valdation data. If cross validation is performed then this metric is calculated using cross validation.
cv_parent_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in parent hpo loop or not?. If given, the parent hpo loop will optimize the cross validation score. The model is fitted on whole training data (training+validation) after cross validation and the metrics printed (other than parent_val_metric) are calculated on the based the updated model i.e. the one fitted on whole training (trainning+validation) data.
cv_child_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in child hpo loop or not?. If False, then val_score will be caclulated on validation data. The type of cross validator used is taken from model.config[‘cross_validator’]
monitor (Union[str, list], optional, (default=None)) – Nmaes of performance metrics to monitor in parent hpo loop. If None, then R2 is monitored for regression and accuracy for classification.
mode (str, optional (default="regression")) – whether this is a
regression
problem orclassification
num_classes (int, optional (default=None)) – number of classes, only relevant if mode==”classification”.
category (str, optional (detault="DL")) – either “DL” or “ML”. If DL, the pipeline is optimized for neural networks.
**model_kwargs – any additional key word arguments for ai4water’s Model
References
- 1
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.MLP
- 2
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNN
- 3
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTM
- 4
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNNLSTM
- 5
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTMAutoEncoder
- 6
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TCN
- 7
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TFT
- _build_model(model: dict, val_metric: str, x_transformation, y_transformation, prefix: Optional[str], verbosity: int = 0, batch_size: int = 32, lr: float = 0.001) ai4water.main.Model [source]
build the ai4water Model. When overwriting this method, the user must return an instance of ai4water’s Model_ class. batch_size : only used when category is “DL”. lr : only used when category is “DL”
- add_dl_model(model: Callable, space: Union[list, ai4water.hyperopt._space.Real, ai4water.hyperopt._space.Categorical, ai4water.hyperopt._space.Integer]) None [source]
adds a deep learning model to be considered.
- Parameters
model (callable) – the model to be added
space (list) – the search space of the model
- add_model(model: dict) None [source]
adds a new model which will be considered during optimization.
- Parameters
model (dict) – a dictionary of length 1 whose value should also be a dictionary of parameter space for that model
Example
>>> pl = OptimizePipeline(...) >>> pl.add_model({"XGBRegressor": {"n_estimators": [100, 200,300, 400, 500]}})
- baseline_results(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) tuple [source]
Returns default performance of all models.
It runs all the models with their default parameters and without any x and y transformation. These results can be considered as baseline results and can be compared with optimized model’s results. The model is trained on ‘training’+’validation’ data.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
- Returns
a tuple of two dictionaries. - a dictionary of val_scores on test data for each model - a dictionary of metrics being monitored for each model on test data.
- Return type
tuple
- be_best_model_from_config(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, verbosity=1) ai4water.main.Model [source]
Build and Evaluate the best model with respect to metric from config.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the metric with respect to which the best model is fetched and then built/evaluated. If not given, the best model is built/evaluated with respect to evaluation metric.
model_name (str, optional) – If given, the best version of this model will be fetched and built. The ‘best’ will be decided based upon metric_name
verbosity (int, optinoal (default=1)) – determines the amount of print information
- Return type
an instance of trained ai4water Model
- bfe_all_best_models(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 0) None [source]
builds, trains and evaluates best versions of all the models. The model is trained on ‘training’+’validation’ data.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the name of metric to determine best version of a model. If not given, parent_val_metric will be used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=0)) – determines the amount of print information
- Return type
None
- bfe_best_model_from_scratch(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 1) ai4water.main.Model [source]
Builds, Trains and Evaluates the best model with respect to metric from scratch. The model is trained on ‘training’+’validation’ data. Running this mothod will also populate
taylor_plot_data_
dictionary.- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the metric with respect to which the best model is searched and then built/trained/evaluated. If None, the best model is chosen based on the evaluation metric.
model_name (str, optional) – If given, the best version of this model will be found and built. The ‘best’ will be decided based upon metric_name
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=1)) – determines amount of information to be printed.
- Return type
an instance of trained ai4water Model
- bfe_model_from_scratch(iter_num: int, x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) ai4water.main.Model [source]
Builds, trains and evalutes the model from a specific iteration. The model is trained on ‘training’+’validation’ data.
- Parameters
iter_num (int) – iteration number from which to choose the model
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
- Return type
an instance of trained ai4water Model
- change_child_iteration(model: dict)[source]
We may want to change the child hpo iterations for one or more models. For example we may want to run only 10 iterations for LinearRegression but 40 iterations for XGBRegressor. In such a case we can use this function to modify child hpo iterations for one or more models. The iterations for all the remaining models will remain same as defined by the user at the start. This method updated _child_iters dictionary
- Parameters
model (dict) – a dictionary whose keys are names of models and values are number of iterations for that model during child hpo
Example
>>> pl = OptimizePipeline(...) >>> pl.change_child_iteration({"XGBRegressor": 10}) If we want to change iterations for more than one models >>> pl.change_child_iteration(({"XGBRegressor": 30, >>> "RandomForestRegressor": 20}))
- compare_models(metric_name: Optional[str] = None, plot_type: str = 'circular', show: bool = False, **kwargs) matplotlib.axes._axes.Axes [source]
Compares all the models with respect to a metric and plots a bar plot.
- metric_namestr, optional
The metric with respect to which to compare the models.
- plot_typestr, optional
if “circular” then easy_mpl.circular_bar_plot is drawn otherwise a simple bar_plot is drawn.
- showbool, optional
whether to show the plot or not
- **kwargs :
keyword arguments for easy_mpl.circular_bar_plot or easy_mpl.bar_chart
- Return type
matplotlib.pyplot.Axes
- config() dict [source]
Returns a dictionary which contains all the information about the class and from which the class can be created.
- Returns
a dictionary with two keys
init_paras
andruntime_paras
andversion_info
.- Return type
dict
- dumbbell_plot(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True) matplotlib.axes._axes.Axes [source]
Generate Dumbbell plot as comparison of baseline models with optimized models. Not that this command will train all the considered models, so this can be expensive.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – The name of metric with respect to which the models have to be compared. If not given, the evaluation metric is used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
figsize (tuple) – If given, plot will be generated of this size.
show (bool) – whether to show the plot or not
save – By default True. If False, function will not save the resultant plot in current working directory.
- Return type
matplotlib Axes
- fit(x: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, data: Optional[pandas.core.frame.DataFrame] = None, validation_data: Optional[Tuple[numpy.ndarray, numpy.ndarray]] = None, previous_results: Optional[dict] = None, process_results: bool = True) ai4water.hyperopt._main.HyperOpt [source]
Optimizes the pipeline for the given data.
- Parameters
x (np.ndarray) – input training data
y (np.ndarray) – output/target/label data. It must of same length as
x
.data – A pandas dataframe which contains input (x) and output (y) features Only required if
x
andy
are not given. The training and validation data will be extracted from this data.validation_data – validation data on which pipeline is optimized. Only required if
data
is not given.previous_results (dict, optional) – path of file which contains xy values.
process_results (bool) –
- Returns
an instance of ai4water.hyperopt.HyperOpt class which is used for
optimization.
- classmethod from_config(config: dict) autotab._main.OptimizePipeline [source]
Builds the class from config dictionary
- Parameters
config (dict) – a dictionary which contains init_paras key.
- Return type
an instance of OptimizePipeline class
- classmethod from_config_file(config_file: str) autotab._main.OptimizePipeline [source]
Builds the class from config file.
- Parameters
config_file (str) – complete path of config file which has .json extension
- Return type
an instance of OptimizePipeline class
- get_best_metric(metric_name: str) float [source]
returns the best value of a particular performance metric. The metric must be recorded i.e. must be given as monitor argument.
- Parameters
metric_name (str) – Name of performance metric
- Returns
the best value of performance metric acheived
- Return type
float
- get_best_metric_iteration(metric_name: Optional[str] = None) int [source]
returns iteration of the best value of a particular performance metric.
- Parameters
metric_name (str, optional) – The metric must be recorded i.e. must be given as monitor argument. If not given, then evaluation metric is used.
- get_best_pipeline_by_metric(metric_name: Optional[str] = None) dict [source]
returns the best pipeline with respect to a particular performance metric.
- Parameters
metric_name (str, optional) – The name of metric whose best value is to be retrieved. The metric must be recorded i.e. must be given as monitor.
- Returns
a dictionary with follwoing keys
path
path where the model is saved on diskmodel
name of modelx_transfromations
transformations for the input datay_transformations
transformations for the target dataiter_num
iteration number on which this pipeline was achieved
- Return type
dict
- get_best_pipeline_by_model(model_name: str, metric_name: Optional[str] = None) tuple [source]
returns the best pipeline with respect to a particular model and performance metric. The metric must be recorded i.e. must be given as monitor argument.
- Parameters
model_name (str) – The name of model for which best pipeline is to be found. The best is defined by
metric_name
.metric_name (str, optional) – The name of metric with respect to which the best model is to be retrieved. If not given, the best model is defined by the evaluation metric.
- Returns
a tuple of length two
- first value is a float which represents the value of
metric
second value is a dictionary of pipeline with four keys
x_transformation
y_transformation
model
path
iter_num
- Return type
tuple
- post_fit(x=None, y=None, data=None, test_data: Optional[Union[list, tuple]] = None, fit_on_all_train_data: bool = True, show: bool = True) None [source]
post processing of results to draw dumbell plot and taylor plot.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
show (bool, optional (default=True)) – whether to show the plots or not
- Return type
None
- remove_model(models: Union[str, list]) None [source]
removes an model/models from being considered. The follwoing attributes are updated.
models
model_space
_child_iters
- Parameters
models (list, str) – name or names of model to be removed.
Example
>>> pl = OptimizePipeline(...) >>> pl.remove_model("ExtraTreeRegressor")
- save_results() None [source]
saves the results. It is called automatically at the end of optimization. It saves tried models and transformations at each step as json file with the name
parent_suggestions.json
.An
errors.csv
file is saved which contains validation peformance of the models at each optimization iteration with respect to all metrics being monitored.The performance of each model during child optimization iteration is saved as a csv file with the name
child_val_scores.csv
.The global seeds for parent and child iterations are also saved in csv files with name
parent_seeds.csv
andchild_seeds.csv
. All of these results are saved in pl.path folder.- Return type
None
- taylor_plot(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True, plot_bias: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True, verbosity: int = 0, **kwargs) matplotlib.figure.Figure [source]
makes Taylor’s plot using the best version of each model. The number of models in taylor plot will be equal to the number of models which have been considered by the model.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
plot_bias (bool, optional) – whether to plot the bias or not
figsize (tuple, optional) – a tuple determining figure size
show (bool, optional) – whether to show the plot or not
save (bool, optional) – whether to save the plot or not
verbosity (int, optional (default=0)) – determines the amount of print information
**kwargs – any additional keyword arguments for taylor_plot function of easy_mpl.
- Return type
matplotlib.pyplot.Figure
- update_model_space(space: dict) None [source]
updates or changes the search space of an already existing model
- Parameters
space – a dictionary whose keys are names of models and values are parameter space for that model.
- Return type
None
Example
>>> pl = OptimizePipeline(...) >>> rf_space = {'max_depth': [5,10, 15, 20], >>> 'n_models': [5,10, 15, 20]} >>> pl.update_model_space({"RandomForestRegressor": rf_space})