OptimizePipeline

class autotab.OptimizePipeline(input_features, output_features, inputs_to_transform: Optional[Union[list, dict]] = None, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, wandb_config: Optional[dict] = None, **model_kwargs)[source]

optimizes model/estimator, its hyperparameters and preprocessing operation to be performed on input and output features. It consists of two hpo loops. The parent or outer loop optimizes preprocessing/feature engineering, feature selection and model selection while the child hpo loop optimizes hyperparmeters of child hpo loop.

- metrics_

a pandas DataFrame of shape (parent_iterations, len(monitor)) which contains values of metrics being monitored at each parent iteration.

- val_scores_

a 1d numpy array of length equal to parent_iterations which contains value of evaluation metric at each parent iteration.

- parent_suggestions_

an ordered dictionary of suggestions to the parent objective function during parent hpo loop

- child_val_scores_

a numpy array of shape (parent_iterations, child_iterations) containing value of eval_metric at all child hpo loops

- optimizer_

an instance of ai4water.hyperopt.HyperOpt [1]_ for parent optimization

- models

a list of models being considered for optimization

- model_space

a dictionary which contains parameter space for each model

Example

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(input_features=input_features,
>>>                       output_features=output_features,
>>>                       inputs_to_transform=input_features)
>>> results = pl.fit(data=data)

Note

This optimization always solves a minimization problem even if the val_metric is $R^2$.

Undoc-members:

Show-inheritance:

__init__(input_features, output_features, inputs_to_transform: Optional[Union[list, dict]] = None, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, wandb_config: Optional[dict] = None, **model_kwargs)[source]

initializes the class

Parameters:
  • input_features (list) – names of input features

  • output_features (str) – names of output features

  • inputs_to_transform (list/dict, optional, (default=None)) – Input features on which feature engineering/transformation is to be applied. By default all input features are considered. If you want to apply a single transformation on a group of input features, then pass this as a dictionary. This is helpful if the input data consists of hundred or thousands of input features. If None (default) transformations will be applied on all input features. If you don’t want to apply any transformation on any input feature, pass an empty list.

  • input_transformations (list, dict) –

    The transformations to be considered for input features. Default is None, in which case all input features are considered.

    If list, then it will be the names of transformations to be considered for all input features. By default following transformations are considered

    • minmax rescale from 0 to 1

    • center center the data by subtracting mean from it

    • scale scale the data by dividing it with its standard deviation

    • zscore first performs centering and then scaling

    • box-cox

    • yeo-johnson

    • quantile

    • quantile_normal

    • robust

    • log natural logarithm

    • log2 log with base 2

    • log10 log with base 10

    • sqrt square root

    The user can however, specify list of transformations to be considered for each input feature. In such a case, this argument must be a dictionary whose keys are names of input features and values are list of transformations.

  • outputs_to_transform (list, optional) – Output features on which feature engineering/transformation is to be applied. If None, then transformations on outputs are not applied.

  • output_transformations (Optional (default=None)) – The transformations to be considered for outputs/targets. The user can consider any transformation as given for input_transformations

  • models (list, optional) –

    The models/algorithms to consider during optimization. If not given, then all available models from sklearn, xgboost, catboost and lgbm are considered. For neural networks, following 6 model types are considered by default

    • MLP [1]_ multi layer perceptron

    • CNN [2] 1D convolution neural network

    • LSTM [3] Long short term memory network

    • CNNLSTM [4] CNN-> LSTM

    • LSTMAutoEncoder [5] LSTM based autoencoder

    • TCN [6] Temporal convolution networks

    • TFT [7] Temporal fusion Transformer

    However, in such cases, the category must be DL.

  • parent_iterations (int, optional (default=100)) – Number of iterations for parent optimization loop

  • child_iterations (int, optional) – Number of iterations for child optimization loop. If set to 0, the child hpo loop is not run which means the hyperparameters of the model are not optimized. You can customize number of hpo iterations for each model by making using of :meth: change_child_iterations method.

  • parent_algorithm (str, optional) – Algorithm for optimization of parent optimization

  • child_algorithm (str, optional) – Algorithm for optimization of child optimization

  • eval_metric (str, optional) – Validation metric to calculate val_score in objective function. The parent and child hpo loop optimizes/improves this metric. This metric is calculated on validation data. If cross validation is performed then this metric is calculated using cross validation.

  • cv_parent_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in parent hpo loop or not?. If given, the parent hpo loop will optimize the cross validation score. The model is fitted on whole training data (training+validation) after cross validation and the metrics printed (other than parent_val_metric) are calculated on the based the updated model i.e. the one fitted on whole training (training + validation) data.

  • cv_child_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in child hpo loop or not?. If False, then val_score will be calculated on validation data. The type of cross validator used is taken from model.config[‘cross_validator’]

  • monitor (Union[str, list], optional, (default=None)) – Names of performance metrics to monitor in parent hpo loop. If None, then R2 is monitored for regression and accuracy for classification.

  • mode (str, optional (default="regression")) – whether this is a regression problem or classification

  • num_classes (int, optional (default=None)) – number of classes, only relevant if mode==”classification”.

  • category (str, optional (default="DL")) – either “DL” or “ML”. If DL, the pipeline is optimized for neural networks.

  • wandb_config (dict) – The keyword arguments to initiate wand.init() as dictionary. It is only valid if wandb package is installed. Default value is None, which means, wandb will not be utilized. For simplest case, pass a dictionary with project as key. >>> dict(project=”my_project”) The user must however login wandb before. The behaviour of wandb is controlled by py:meth:autotab.OptimizePipeline.wb_init , py:meth:autotab.OptimziePipeline.wb_log and py:meth:autotab.OptimizePipeline.wb_finish method respectively

  • **model_kwargs – any additional key word arguments for ai4water’s Model

References

add_dl_model(model: Callable, space: Union[list, Real, Categorical, Integer]) None[source]

adds a deep learning model to be considered.

Parameters:
  • model (callable) – the model to be added

  • space (list) – the search space of the model

add_model(model: dict) None[source]

adds a new model which will be considered during optimization.

Parameters:

model (dict) – a dictionary of length 1 whose value should also be a dictionary of parameter space for that model

Example

>>> pl = OptimizePipeline(...)
>>> pl.add_model({"XGBRegressor": {"n_estimators": [100, 200,300, 400, 500]}})
baseline_results(x=None, y=None, data=None, test_data=None) tuple[source]

Returns default performance of all models.

It runs all the models with their default parameters and without any x and y transformation. These results can be considered as baseline results and can be compared with optimized model’s results. The model is trained on ‘training’+’validation’ data.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given.

Returns:

a tuple of two dictionaries. - a dictionary of val_scores on test data for each model - a dictionary of metrics being monitored for each model on test data.

Return type:

tuple

be_best_model_from_config(x=None, y=None, data=None, test_data: Optional[Union[tuple, list]] = None, metric_name: Optional[str] = None, model_name: Optional[str] = None, verbosity=1) Model[source]

Build and Evaluate the best model with respect to metric from config.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given.

  • metric_name (str) – the metric with respect to which the best model is fetched and then built/evaluated. If not given, the best model is built/evaluated with respect to evaluation metric.

  • model_name (str, optional) – If given, the best version of this model will be fetched and built. The ‘best’ will be decided based upon metric_name

  • verbosity (int, optional (default=1)) – determines the amount of print information

Return type:

an instance of trained ai4water Model

bfe_all_best_models(x=None, y=None, data=None, test_data: Optional[tuple] = None, metric_name: Optional[str] = None, verbosity: int = 0) DataFrame[source]

builds, trains and evaluates best versions of all the models. The model is trained on ‘training’+’validation’ data.

Parameters:
  • x – the input data for training. If test_data is not given then test data is extracted from x based upon train_fraction arguments.

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given.

  • metric_name (str) – the name of metric to determine best version of a model. If not given, parent_val_metric will be used.

  • verbosity (int, optional (default=0)) – determines the amount of print information

Return type:

pd.DataFrame

bfe_best_model_from_scratch(x=None, y=None, data=None, test_data: Optional[tuple] = None, metric_name: Optional[str] = None, model_name: Optional[str] = None, verbosity: int = 1) Model[source]

Builds, Trains and Evaluates the best model with respect to metric from scratch. The model is trained on ‘training’+’validation’ data. Running this mothod will also populate taylor_plot_data_ dictionary.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if data argument is not given.

  • metric_name (str) – the metric with respect to which the best model is searched and then built/trained/evaluated. If None, the best model is chosen based on the evaluation metric.

  • model_name (str, optional) – If given, the best version of this model will be found and built. The ‘best’ will be decided based upon metric_name

  • verbosity (int, optional (default=1)) – determines amount of information to be printed.

Return type:

an instance of trained ai4water Model

bfe_model_from_scratch(iter_num: int, x=None, y=None, data=None, test_data: Optional[Union[tuple, list]] = None) Model[source]

Builds, trains and evalutes the model from a specific iteration. The model is trained on ‘training’+’validation’ data.

Parameters:
  • iter_num (int) – iteration number from which to choose the model

  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given.

Return type:

an instance of trained ai4water Model

change_batch_size_space(space: list, low=None, high=None)[source]

changes the value of class attribute batch_space. It should be used after pipeline initialization and before calling fit method.

change_child_iteration(model: dict)[source]

We may want to change the child hpo iterations for one or more models. For example we may want to run only 10 iterations for LinearRegression but 40 iterations for XGBRegressor. In such a case we can use this function to modify child hpo iterations for one or more models. The iterations for all the remaining models will remain same as defined by the user at the start. This method updated _child_iters dictionary

Parameters:

model (dict) – a dictionary whose keys are names of models and values are number of iterations for that model during child hpo

Example

>>> pl = OptimizePipeline(...)
>>> pl.change_child_iteration({"XGBRegressor": 10})
... # If we want to change iterations for more than one models
>>> pl.change_child_iteration(({"XGBRegressor": 30,
...                             "RandomForestRegressor": 20}))
change_lr_space(space: list, low=None, high=None)[source]

changes the value of class attribute lr_space. It should be used after pipeline initialization and before calling fit method.

change_transformation_behavior(transformation: str, new_behavior: dict, features: Optional[Union[list, str]] = None) None[source]

change the behvior of a transformation i.e. the way it is applied. If features is not not given, it will modify the behavior of transformation for all features. This function modifies the feature_transformations attribute of the class.

Parameters:
  • transformation (str) – The name of transformation whose behavior is to be modified.

  • new_behavior (dict) – key, word arguments which determine the new behavior of Transformation. These key,word arguments are given to the specifified transformation when it is initialized.

  • features (str/list, optional (default=None)) – The name or names of features for which the behavior should be modified. If not given, the changed behavior of transformation will apply to all input features.

Return type:

None

Example

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> input_features=data.columns.tolist()[0:-1]
>>> output_features=data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(
...                    input_features=input_features,
...                    output_features=output_features
...                     )
>>> pl.change_transformation_behavior('yeo-johnson', {'pre_center': True}, 'wind_dir_deg')
... # we can change behavior behavior for multiple features as well
>>> pl.change_transformation_behavior('yeo-johnson', {'pre_center': True},
...                                   ['air_p_hpa',  'mslp_hpa'])
cleanup(dirs_to_exclude: Optional[Union[list, str]] = None) None[source]

removes the folders from path except the ‘results_from_scratch’ and the folders defined by user.

Parameters:

dirs_to_exclude (str, list, optional) – The names of folders inside path which should not be deleted.

Return type:

None

compare_models(metric_name: Optional[str] = None, plot_type: str = 'circular', show: bool = False, **kwargs) Axes[source]

Compares all the models with respect to a metric and plots a bar plot.

Parameters:
Return type:

matplotlib.pyplot.Axes

Exmaples

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> data = busan_beach()
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(input_features=input_features,
>>>                       output_features=output_features)
>>> results = pl.fit(data=data)
... # compare models with respect to evaluation metric
>>> pl.compare_models()
... # compare models with respect to bar_chart and plot comparison using bar_chart
>>> pl.compare_models('r2', "bar_chart")
... # compare models with respect to r2 and get the matplotlb axes for further processing
>>> axes = pl.compare_models('r2', show=False)
config() dict[source]

Returns a dictionary which contains all the information about the class and from which the class can be created.

Returns:

a dictionary with two keys init_paras and runtime_paras and version_info.

Return type:

dict

dumbbell_plot(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, lower_limit: Optional[Union[int, float]] = None, upper_limit: Optional[Union[int, float]] = None, figsize: Optional[tuple] = None, show: bool = True, save: bool = True) Axes[source]

Generate Dumbbell plot as comparison of baseline models with optimized models. Note that this command will train all the considered models, so this can be expensive.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data

    a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline

    will be calculated. This should only be given if data argument is not given.

  • metric_name (str) – The name of metric with respect to which the models have to be compared. If not given, the evaluation metric is used.

  • lower_limit (float/int, optional (default=None)) – clip the values below this value. Set this value to None to avoid clipping.

  • upper_limit (float/int, optional (default=None)) – clip the values above this value

  • figsize (tuple) – If given, plot will be generated of this size.

  • show (bool) – whether to show the plot or not

  • save – By default True. If False, function will not save the resultant plot in current working directory.

Returns:

matplotlib axes object which can be used for further processing

Return type:

plt.Axes

Examples

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> total_data = busan_beach()
>>> input_features = total_data.columns.tolist()[0:-1]
>>> output_features = total_data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(input_features=input_features,
>>>                       output_features=output_features)
>>> results = pl.fit(data=total_data)
... # compare models with respect to evaluation metric
>>> pl.dumbbell_plot(data=total_data)
... # compare the models by also plotting bias value
>>> pl.dumbbell_plot(data=total_data, metric_name="r2_score")
... # get the matplotlb axes for further processing
>>> axes = pl.dumbbell_plot(data=total_data, metric_name="r2_score",
...       lower_limit=0.0, show=False)
evaluate_model(model: Model, x=None, y=None, data=None, metric_name: Optional[str] = None) float[source]

Evaluates the ai4water’s Model on the data for the metric.

Parameters:
  • model – an instance of ai4water’s Model class

  • data – raw, unprocessed data form which x,y pairs are made

  • metric_name (str, optional) – name of performance metric. If not given, evaluation metric is used.

  • x – alternative to data. Only required if data is not given.

  • y – only required if x is given

Return type:

float, the evaluation score of model with respect to metric_name

fit(x: Optional[ndarray] = None, y: Optional[ndarray] = None, data: Optional[DataFrame] = None, validation_data: Optional[Tuple[ndarray, ndarray]] = None, previous_results: Optional[dict] = None, process_results: bool = True, callbacks: Optional[Union[Callbacks, List[Callbacks]]] = None, finish_wb: bool = True) HyperOpt[source]

Optimizes the pipeline for the given data. Either

  • only x,y should be given (validation data will be taken from x and y based upon val_fraction argument

  • or x,y and validation_data should be given

  • or only data should be given (training and validation data will be taken from data based upon train_fraction and val_fraction arguments`)

every other combination of x,y, data and validation_data will raise error

Note

If test_data is not to be extracted/seprated from x,y/data then you must set train_fraction to 1.0. Please check this tutorial for more on data splitting.

Parameters:
  • x (np.ndarray) – input data for training + validation + test. If your x does not contain test portion, set train_fraction to 1.0 during initializtion of OptimizePipeline class.

  • y (np.ndarray) – output/target/label for training data. It must of same length as x.

  • data – A pandas dataframe which contains input (x) and output (y) features Only required if x and y are not given. The training and validation data will be extracted from this data.

  • validation_data (tuple) – validation data on which pipeline is optimized. Only required if data is not given.

  • previous_results (dict, optional (default=None)) – path of file which contains xy values.

  • process_results (bool, optional (default=True)) – Wether to perform postprocessing of optimization of results or not.

  • callbacks (list, optional (default=None)) – list of callbacks to run

  • finish_wb (bool) – if set to True, then wandb.finish is called at the end. If set to False, then the user will have to manually call py:meth:autotab._main.OptimizePipeline.wb_finish method later.

Returns:

  • an instance of ai4water.hyperopt.HyperOpt class which is used for

  • optimization.

classmethod from_config(config: dict) OptimizePipeline[source]

Builds the class from config dictionary

Parameters:

config (dict) – a dictionary which contains init_paras key.

Returns:

an instance of OptimizePipeline class

Return type:

OptimizePipeline

classmethod from_config_file(config_file: str) OptimizePipeline[source]

Builds the class from config file.

Parameters:

config_file (str) – complete path of config file which has .json extension

Return type:

an instance of OptimizePipeline class

get_best_metric(metric_name: str) float[source]

returns the best value of a particular performance metric. The metric must be recorded i.e. must be given as monitor argument.

Parameters:

metric_name (str) – Name of performance metric

Returns:

the best value of performance metric achieved

Return type:

float

get_best_metric_iteration(metric_name: Optional[str] = None) int[source]

returns iteration of the best value of a particular performance metric.

Parameters:

metric_name (str, optional) – The metric must be recorded i.e. must be given as monitor argument. If not given, then evaluation metric is used.

Returns:

the parent iteration on which metric was obtained.

Return type:

int

get_best_pipeline_by_metric(metric_name: Optional[str] = None) dict[source]

returns the best pipeline with respect to a particular performance metric.

Parameters:

metric_name (str, optional) – The name of metric whose best value is to be retrieved. The metric must be recorded i.e. must be given as monitor.

Returns:

a dictionary with following keys

  • path path where the model is saved on disk

  • model name of model

  • x_transformations transformations for the input data

  • y_transformations transformations for the target data

  • iter_num iteration number on which this pipeline was achieved

Return type:

dict

get_best_pipeline_by_model(model_name: str, metric_name: Optional[str] = None) tuple[source]

returns the best pipeline with respect to a particular model and performance metric. The metric must be recorded i.e. must be given as monitor argument.

Parameters:
  • model_name (str) – The name of model for which best pipeline is to be found. The best is defined by metric_name.

  • metric_name (str, optional) – The name of metric with respect to which the best model is to be retrieved. If not given, the best model is defined by the evaluation metric.

Returns:

a tuple of length two

  • first value is a float which represents the value of

    metric

  • second value is a dictionary of pipeline with four keys

    x_transformation y_transformation model path iter_num

Return type:

tuple

post_fit(x=None, y=None, data=None, test_data: Optional[Union[tuple, list]] = None, show: bool = True) None[source]

post processing of results to draw dumbbell plot and taylor plot.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given. If this is not given then test data is taken either from x,y or from data based upon data splitting schemes.

  • show (bool, optional (default=True)) – whether to show the plots or not

Return type:

None

remove_model(models: Union[str, list]) None[source]

removes an model/models from being considered. The follwoing attributes are updated.

  • models

  • model_space

  • _child_iters

Parameters:

models (list, str) – name or names of model to be removed.

Example

>>> pl = OptimizePipeline(...)
... # If we don't want 'ExtraTreeRegressor' to be considered
>>> pl.remove_model("ExtraTreeRegressor")
remove_transformation(transformation: Union[str, list], feature: Optional[Union[list, str]] = None) None[source]

Removes one or more transformation from being considered. This function modifies the feature_transformations attribute of the class.

Parameters:
  • transformation (str/list) – the name/names of transformation to be removed.

  • feature (str/list, optional (default=None)) – name of feature for which the transformation should not be considered. If not given, the transformation will be removed from all the input features.

Return type:

None

Examples

>>> pl = OptimizePipeline(...)
... # remove box-cox transformation altogether
>>> pl.remove_transformation('box-cox')
... # remove multiple transformations
>>> pl.remove_transformation(['yeo-johnson', 'log'])
... # remove a transformation for a certain feature
>>> pl.remove_transformation('log2', 'tide_cm')
... # remove a transformation for more than one features
>>> pl.remove_transformation('log10', ['tide_cm', 'wat_temp_c'])
report(write: bool = True) str[source]

makes the report and writes it in text form

save_results() None[source]

saves the results. It is called automatically at the end of optimization. It saves tried models and transformations at each step as json file with the name parent_suggestions.json.

An errors.csv file is saved which contains validation performance of the models at each optimization iteration with respect to all metrics being monitored.

The performance of each model during child optimization iteration is saved as a csv file with the name child_val_scores.csv.

The global seeds for parent and child iterations are also saved in csv files with name parent_seeds.csv and child_seeds.csv. All of these results are saved in pl.path folder.

Return type:

None

taylor_plot(x=None, y=None, data=None, test_data=None, plot_bias: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True, verbosity: int = 0, **kwargs) Figure[source]

makes Taylor’s plot using the best version of each model. The number of models in taylor plot will be equal to the number of models which have been considered by the model.

Parameters:
  • x – the input data for training

  • y – the target data for training

  • data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.

  • test_data (tuple) – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the performance of optimized pipeline will be calculated. This should only be given if data argument is not given.

  • plot_bias (bool, optional) – whether to plot the bias or not

  • figsize (tuple, optional) – a tuple determining figure size

  • show (bool, optional) – whether to show the plot or not

  • save (bool, optional) – whether to save the plot or not

  • verbosity (int, optional (default=0)) – determines the amount of print information

  • **kwargs – any additional keyword arguments for taylor_plot function of easy_mpl.

Returns:

matplotlib Figure object which can be used for further processing

Return type:

matplotlib.pyplot.Figure

Examples

>>> from autotab import OptimizePipeline
>>> from ai4water.datasets import busan_beach
>>> total_data = busan_beach()
>>> input_features = total_data.columns.tolist()[0:-1]
>>> output_features = total_data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(input_features=input_features,
>>>                       output_features=output_features)
>>> results = pl.fit(data=total_data)
... # compare models with respect to evaluation metric
>>> pl.taylor_plot(data=total_data)
... # compare the models by also plotting bias value
>>> pl.taylor_plot(data=total_data, plot_bias=True)
... # get the matplotlb Figure object for further processing
>>> fig = pl.taylor_plot(data=total_data, show=False)
update_model_space(space: dict) None[source]

updates or changes the search space of an already existing model

Parameters:

space (dict) – a dictionary whose keys are names of models and values are parameter space for that model.

Return type:

None

Example

>>> pl = OptimizePipeline(...)
>>> rf_space = {'max_depth': [5,10, 15, 20],
>>>          'n_models': [5,10, 15, 20]}
>>> pl.update_model_space({"RandomForestRegressor": rf_space})

Similarly we can also update for a deep learning model as below

>>> pl = OptimizePipeline(input_features=["tide_cm"], output_features="tetx_coppml",
...       category="DL")
>>> pl.update_model_space({"MLP": {
...     "units": Integer(low=8, high=128, prior='uniform', transform='identity', name='units'),
...     "activation": Categorical(["relu", "elu", "tanh", "sigmoid"], name="activation"),
...     "num_layers": Integer(low=1, high=5, name="num_layers")
...         }})
we can confirm it by printing the model space
>>> pl.model_space['MLP']