OptimizePipeline

class autotab.OptimizePipeline(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]

optimizes model/estimator, its hyperparameters and preprocessing operation to be performed on input and output features. It consists of two hpo loops. The parent or outer loop optimizes preprocessing/feature engineering, feature selection and model selection while the child hpo loop optimizes hyperparmeters of child hpo loop.

- metrics_: a pandas DataFrame of shape (parent_iterations, len(monitor)) which contains values of metrics being monitored at each parent iteration.

- val_scores_: a 1d numpy array of length equal to parent_iterations which contains value of evaluation metric at each parent iteration.

- parent_suggestions_: an ordered dictionary of suggestions to the parent objective function during parent hpo loop

- child_val_scores_: a numpy array of shape (parent_iterations, child_iterations) containing value of eval_metric at all child hpo loops

- optimizer_: an instance of ai4water.hyperopt.HyperOpt [1]_ for parent optimization

- models: a list of models being considered for optimization

- model_space: a dictionary which contains parameter space for each model

Example

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(input_features=input_features,
>>>                       output_features=output_features,
>>>                       inputs_to_transform=input_features)
>>> results = pl.fit(data=data)

Note

This optimization always solves a minimization problem even if the val_metric is $R^2$.

1: https://ai4water.readthedocs.io/en/latest/hpo.html#hyperopt

Undoc-members
Show-inheritance

__init__(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]

initializes the class

Parameters

inputs_to_transform (list) – Input features on which feature engineering/transformation is to be applied. By default all input features are considered. If you want to apply a single transformation on a group of input features, then pass this as a dictionary. This is helpful if the input data consists of hundred or thousands of input features.
input_transformations (list, dict) –
The transformations to be considered for input features. Default is None, in which case all input features are considered.

If list, then it will be the names of transformations to be considered for all input features. By default following transformations are considered
- minmax rescale from 0 to 1
- center center the data by subtracting mean from it
- scale scale the data by dividing it with its standard deviation
- zscore first performs centering and then scaling
- box-cox
- yeo-johnson
- quantile
- robust
- log
- log2
- log10
- sqrt square root
The user can however, specify list of transformations to be considered for each input feature. In such a case, this argument must be a dictionary whose keys are names of input features and values are list of transformations.
outputs_to_transform (list, optional) – Output features on which feature engineering/transformation is to be applied. If None, then transformations on outputs are not applied.
output_transformations – The transformations to be considered for outputs/targets. The user can consider any transformation as given for input_transformations
models (list, optional) –
The models/algorithms to consider during optimzation. If not given, then all available models from sklearn, xgboost, catboost and lgbm are considered. For neural neworks, following 6 model types are considered by default
- MLP [1]_ multi layer perceptron
- CNN 2 1D convolution neural network
- LSTM 3 Long short term memory network
- CNNLSTM 4 CNN-> LSTM
- LSTMAutoEncoder 5 LSTM based autoencoder
- TCN 6 Temporal convolution networks
- TFT 7 Temporal fusion Transformer
However, in such cases, the category must be DL.
parent_iterations (int, optional (default=100)) – Number of iterations for parent optimization loop
child_iterations (int, optional) – Number of iterations for child optimization loop. It set to 0, the child hpo loop is not run which means the hyperparameters of the model are not optimized. You can customize iterations for each model by making using of :meth: change_child_iterations method.
parent_algorithm (str, optional) – Algorithm for optimization of parent optimzation
child_algorithm (str, optional) – Algorithm for optimization of child optimization
eval_metric (str, optional) – Validation metric to calculate val_score in objective function. The parent and child hpo loop optimizes/improves this metric. This metric is calculated on valdation data. If cross validation is performed then this metric is calculated using cross validation.
cv_parent_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in parent hpo loop or not?. If given, the parent hpo loop will optimize the cross validation score. The model is fitted on whole training data (training+validation) after cross validation and the metrics printed (other than parent_val_metric) are calculated on the based the updated model i.e. the one fitted on whole training (trainning+validation) data.
cv_child_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in child hpo loop or not?. If False, then val_score will be caclulated on validation data. The type of cross validator used is taken from model.config[‘cross_validator’]
monitor (Union[str, list], optional, (default=None)) – Nmaes of performance metrics to monitor in parent hpo loop. If None, then R2 is monitored for regression and accuracy for classification.
mode (str, optional (default="regression")) – whether this is a regression problem or classification
num_classes (int, optional (default=None)) – number of classes, only relevant if mode==”classification”.
category (str, optional (detault="DL")) – either “DL” or “ML”. If DL, the pipeline is optimized for neural networks.
**model_kwargs – any additional key word arguments for ai4water’s Model

References

1: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.MLP
2: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNN
3: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTM
4: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNNLSTM
5: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTMAutoEncoder
6: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TCN
7: https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TFT

_build_model(model: dict, val_metric: str, x_transformation, y_transformation, prefix: Optional[str], verbosity: int = 0, batch_size: int = 32, lr: float = 0.001) → ai4water.main.Model[source]: build the ai4water Model. When overwriting this method, the user must return an instance of ai4water’s Model_ class. batch_size : only used when category is “DL”. lr : only used when category is “DL”

add_dl_model(model: Callable, space: Union[list, ai4water.hyperopt._space.Real, ai4water.hyperopt._space.Categorical, ai4water.hyperopt._space.Integer]) → None[source]

adds a deep learning model to be considered.

Parameters

model (callable) – the model to be added
space (list) – the search space of the model

add_model(model: dict) → None[source]

adds a new model which will be considered during optimization.

Parameters: model (dict) – a dictionary of length 1 whose value should also be a dictionary of parameter space for that model

Example

>>> pl = OptimizePipeline(...)
>>> pl.add_model({"XGBRegressor": {"n_estimators": [100, 200,300, 400, 500]}})

baseline_results(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) → tuple[source]

Returns default performance of all models.

It runs all the models with their default parameters and without any x and y transformation. These results can be considered as baseline results and can be compared with optimized model’s results. The model is trained on ‘training’+’validation’ data.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.

Returns

a tuple of two dictionaries. - a dictionary of val_scores on test data for each model - a dictionary of metrics being monitored for each model on test data.

Return type

tuple

be_best_model_from_config(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, verbosity=1) → ai4water.main.Model[source]

Build and Evaluate the best model with respect to metric from config.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
metric_name (str) – the metric with respect to which the best model is fetched and then built/evaluated. If not given, the best model is built/evaluated with respect to evaluation metric.
model_name (str, optional) – If given, the best version of this model will be fetched and built. The ‘best’ will be decided based upon metric_name
verbosity (int, optinoal (default=1)) – determines the amount of print information

Return type

an instance of trained ai4water Model

bfe_all_best_models(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 0) → None[source]

builds, trains and evaluates best versions of all the models. The model is trained on ‘training’+’validation’ data.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
metric_name (str) – the name of metric to determine best version of a model. If not given, parent_val_metric will be used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=0)) – determines the amount of print information

Return type

None

bfe_best_model_from_scratch(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 1) → ai4water.main.Model[source]

Builds, Trains and Evaluates the best model with respect to metric from scratch. The model is trained on ‘training’+’validation’ data. Running this mothod will also populate taylor_plot_data_ dictionary.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
metric_name (str) – the metric with respect to which the best model is searched and then built/trained/evaluated. If None, the best model is chosen based on the evaluation metric.
model_name (str, optional) – If given, the best version of this model will be found and built. The ‘best’ will be decided based upon metric_name
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=1)) – determines amount of information to be printed.

Return type

an instance of trained ai4water Model

bfe_model_from_scratch(iter_num: int, x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) → ai4water.main.Model[source]

Builds, trains and evalutes the model from a specific iteration. The model is trained on ‘training’+’validation’ data.

Parameters

iter_num (int) – iteration number from which to choose the model
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.

Return type

an instance of trained ai4water Model

change_child_iteration(model: dict)[source]

We may want to change the child hpo iterations for one or more models. For example we may want to run only 10 iterations for LinearRegression but 40 iterations for XGBRegressor. In such a case we can use this function to modify child hpo iterations for one or more models. The iterations for all the remaining models will remain same as defined by the user at the start. This method updated _child_iters dictionary

Parameters: model (dict) – a dictionary whose keys are names of models and values are number of iterations for that model during child hpo

Example

>>> pl = OptimizePipeline(...)
>>> pl.change_child_iteration({"XGBRegressor": 10})
If we want to change iterations for more than one models
>>> pl.change_child_iteration(({"XGBRegressor": 30,
>>>                             "RandomForestRegressor": 20}))

compare_models(metric_name: Optional[str] = None, plot_type: str = 'circular', show: bool = False, **kwargs) → matplotlib.axes._axes.Axes[source]

Compares all the models with respect to a metric and plots a bar plot.

metric_namestr, optional
The metric with respect to which to compare the models.

plot_typestr, optional
if “circular” then easy_mpl.circular_bar_plot is drawn otherwise a simple bar_plot is drawn.

showbool, optional
whether to show the plot or not

**kwargs :
keyword arguments for easy_mpl.circular_bar_plot or easy_mpl.bar_chart

Return type: matplotlib.pyplot.Axes

config() → dict[source]

Returns a dictionary which contains all the information about the class and from which the class can be created.

Returns: a dictionary with two keys init_paras and runtime_paras and version_info.
Return type: dict

dumbbell_plot(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True) → matplotlib.axes._axes.Axes[source]

Generate Dumbbell plot as comparison of baseline models with optimized models. Not that this command will train all the considered models, so this can be expensive.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
metric_name (str) – The name of metric with respect to which the models have to be compared. If not given, the evaluation metric is used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
figsize (tuple) – If given, plot will be generated of this size.
show (bool) – whether to show the plot or not
save – By default True. If False, function will not save the resultant plot in current working directory.

Return type

matplotlib Axes

fit(x: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, data: Optional[pandas.core.frame.DataFrame] = None, validation_data: Optional[Tuple[numpy.ndarray, numpy.ndarray]] = None, previous_results: Optional[dict] = None, process_results: bool = True) → ai4water.hyperopt._main.HyperOpt[source]

Optimizes the pipeline for the given data.

Parameters

x (np.ndarray) – input training data
y (np.ndarray) – output/target/label data. It must of same length as x.
data – A pandas dataframe which contains input (x) and output (y) features Only required if x and y are not given. The training and validation data will be extracted from this data.
validation_data – validation data on which pipeline is optimized. Only required if data is not given.
previous_results (dict, optional) – path of file which contains xy values.
process_results (bool) –

Returns

an instance of ai4water.hyperopt.HyperOpt class which is used for
optimization.

classmethod from_config(config: dict) → autotab._main.OptimizePipeline[source]

Builds the class from config dictionary

Parameters: config (dict) – a dictionary which contains init_paras key.
Return type: an instance of OptimizePipeline class

classmethod from_config_file(config_file: str) → autotab._main.OptimizePipeline[source]

Builds the class from config file.

Parameters: config_file (str) – complete path of config file which has .json extension
Return type: an instance of OptimizePipeline class

get_best_metric(metric_name: str) → float[source]

returns the best value of a particular performance metric. The metric must be recorded i.e. must be given as monitor argument.

Parameters: metric_name (str) – Name of performance metric
Returns: the best value of performance metric acheived
Return type: float

get_best_metric_iteration(metric_name: Optional[str] = None) → int[source]

returns iteration of the best value of a particular performance metric.

Parameters: metric_name (str, optional) – The metric must be recorded i.e. must be given as monitor argument. If not given, then evaluation metric is used.

get_best_pipeline_by_metric(metric_name: Optional[str] = None) → dict[source]

returns the best pipeline with respect to a particular performance metric.

Parameters

metric_name (str, optional) – The name of metric whose best value is to be retrieved. The metric must be recorded i.e. must be given as monitor.

Returns

a dictionary with follwoing keys

path path where the model is saved on disk

model name of model

x_transfromations transformations for the input data

y_transformations transformations for the target data

iter_num iteration number on which this pipeline was achieved

Return type

dict

get_best_pipeline_by_model(model_name: str, metric_name: Optional[str] = None) → tuple[source]

returns the best pipeline with respect to a particular model and performance metric. The metric must be recorded i.e. must be given as monitor argument.

Parameters

model_name (str) – The name of model for which best pipeline is to be found. The best is defined by metric_name.
metric_name (str, optional) – The name of metric with respect to which the best model is to be retrieved. If not given, the best model is defined by the evaluation metric.

Returns

a tuple of length two

first value is a float which represents the value of
metric
second value is a dictionary of pipeline with four keys

x_transformation y_transformation model path iter_num

Return type

tuple

post_fit(x=None, y=None, data=None, test_data: Optional[Union[list, tuple]] = None, fit_on_all_train_data: bool = True, show: bool = True) → None[source]

post processing of results to draw dumbell plot and taylor plot.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
show (bool, optional (default=True)) – whether to show the plots or not

Return type

None

remove_model(models: Union[str, list]) → None[source]

removes an model/models from being considered. The follwoing attributes are updated.

models

model_space

_child_iters

Parameters: models (list, str) – name or names of model to be removed.

Example

>>> pl = OptimizePipeline(...)
>>> pl.remove_model("ExtraTreeRegressor")

report(write: bool = True) → str[source]: makes the reprot and writes it in text form

save_results() → None[source]

saves the results. It is called automatically at the end of optimization. It saves tried models and transformations at each step as json file with the name parent_suggestions.json.

An errors.csv file is saved which contains validation peformance of the models at each optimization iteration with respect to all metrics being monitored.

The performance of each model during child optimization iteration is saved as a csv file with the name child_val_scores.csv.

The global seeds for parent and child iterations are also saved in csv files with name parent_seeds.csv and child_seeds.csv. All of these results are saved in pl.path folder.

Return type: None

taylor_plot(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True, plot_bias: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True, verbosity: int = 0, **kwargs) → matplotlib.figure.Figure[source]

makes Taylor’s plot using the best version of each model. The number of models in taylor plot will be equal to the number of models which have been considered by the model.

Parameters

x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
plot_bias (bool, optional) – whether to plot the bias or not
figsize (tuple, optional) – a tuple determining figure size
show (bool, optional) – whether to show the plot or not
save (bool, optional) – whether to save the plot or not
verbosity (int, optional (default=0)) – determines the amount of print information
**kwargs – any additional keyword arguments for taylor_plot function of easy_mpl.

Return type

matplotlib.pyplot.Figure

update_model_space(space: dict) → None[source]

updates or changes the search space of an already existing model

Parameters: space – a dictionary whose keys are names of models and values are parameter space for that model.
Return type: None

Example

>>> pl = OptimizePipeline(...)
>>> rf_space = {'max_depth': [5,10, 15, 20],
>>>          'n_models': [5,10, 15, 20]}
>>> pl.update_model_space({"RandomForestRegressor": rf_space})