installation
using pip
The most easy way to install autotab is using pip
pip install autotab
However, if you are interested in optimizing pipeline for deep learning models, you can
choose to install tensorflow as well by using all
option
pip install autotab[all]
For list of all options see installation options options
using github link
You can use github link for install autotab.
python -m pip install git+https://github.com/Sara-Iftikhar/AutoTab.git
The latest code however (possibly with less bugs and more features) can be insalled from dev
branch instead
python -m pip install git+https://github.com/Sara-Iftikhar/AutoTab.git@dev
To install the latest branch (dev) with all requirements use all
keyword
python -m pip install "AI4Water[all] @ git+https://github.com/Sara-Iftikhar/AutoTab.git@dev"
using setup.py file
go to folder where repository is downloaded
python setup.py install
installation options
The all
option will install tensorflow 2.7 version along with autotab and h5py.
quick start
This page describes optimization of pipeline for different problems and using different models.
Optimize pipeline for machine learning models (regression)
This covers all scikit-learng models, catboost, lightgbm and xgboost
>>> from ai4water.datasets import busan_beach
>>> from autotab import OptimizePipeline
>>> data = busan_beach()
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(
... inputs_to_transform=input_features,
... outputs_to_transform=output_features,
... models=["LinearRegression",
... "LassoLars",
... "Lasso",
... "RandomForestRegressor",
... "HistGradientBoostingRegressor",
... "CatBoostRegressor",
... "XGBRegressor",
... "LGBMRegressor",
... "GradientBoostingRegressor",
... "ExtraTreeRegressor",
... "ExtraTreesRegressor"
... ],
... parent_iterations=30,
... child_iterations=12,
... parent_algorithm='bayes',
... child_algorithm='bayes',
... eval_metric='mse',
... monitor=['r2', 'nse'],
... input_features=input_features,
... output_features=output_features,
... split_random=True,
... )
>>> pl.fit(data=data)
>>> pl.post_fit(data=data)
machine learning models (classification)
This covers all scikit-learng models, catboost, lightgbm and xgboost
>>> from ai4water.datasets import MtropicsLaos
>>> from autotab import OptimizePipeline
>>> data = MtropicsLaos().make_classification(lookback_steps=1)
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(
... mode="classification",
... eval_metric="accuracy",
... inputs_to_transform=input_features,
... outputs_to_transform=output_features,
... models=["ExtraTreeClassifier",
... "RandomForestClassifier",
... "XGBClassifier",
... "CatBoostClassifier",
... "LGBMClassifier",
... "GradientBoostingClassifier",
... "HistGradientBoostingClassifier",
... "ExtraTreesClassifier",
... "RidgeClassifier",
... "SVC",
... "KNeighborsClassifier",
... ],
... parent_iterations=30,
... child_iterations=12,
... parent_algorithm='bayes',
... child_algorithm='bayes',
... monitor=['accuracy'],
... input_features=input_features,
... output_features=output_features,
... split_random=True,
... )
>>> pl.fit(data=data)
>>> pl.post_fit(data=data)
deep learning models (regression)
This covers MLP, LSTM, CNN, CNNLSTM, TFT, TCN, LSTMAutoEncoder for regression .
Each model can consist of stacks of layers. For example MLP can consist of
stacks of Dense layers. The number of layers are also optimized. When using
deep learning models, also set the value fo epochs
because the default
value is 14 which is too small for a deep learning model. Also consider
setting values for batch_size
and lr
.
>>> from ai4water.datasets import busan_beach
>>> from autotab import OptimizePipeline
>>> data = busan_beach()
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(
... inputs_to_transform=input_features,
... outputs_to_transform=output_features,
... models=["MLP", "LSTM", "CNN", "CNNLSTM", "TFT", "TCN", "LSTMAutoEncoder"],
... parent_iterations=30,
... child_iterations=12,
... parent_algorithm='bayes',
... child_algorithm='bayes',
... eval_metric='mse',
... monitor=['r2', 'nse'],
... input_features=input_features,
... output_features=output_features,
... split_random=True,
... epochs=100,
... )
>>> pl.fit(data=data)
>>> pl.post_fit(data=data)
deep learning models (classification)
This covers MLP, LSTM, CNN, CNNLSTM, TFT, TCN, LSTMAutoEncoder for classification problem. Each model can consist of stacks of layers. For example MLP can consist of stacks of Dense layers. The number of layers are also optimized.
>>> from ai4water.datasets import MtropicsLaos
>>> from autotab import OptimizePipeline
>>> data = MtropicsLaos().make_classification(lookback_steps=1,)
>>> input_features = data.columns.tolist()[0:-1]
>>> output_features = data.columns.tolist()[-1:]
>>> pl = OptimizePipeline(
... mode="classification",
... eval_metric="accuracy",
... inputs_to_transform=input_features,
... outputs_to_transform=output_features,
... models=["MLP", "CNN"],
... parent_iterations=30,
... child_iterations=12,
... parent_algorithm='bayes',
... child_algorithm='bayes',
... monitor=['f1_score'],
... input_features=input_features,
... output_features=output_features,
... split_random=True,
... epochs=100,
... )
>>> pl.fit(data=data)
>>> pl.post_fit(data=data)
deep learning models (multi-class classification)
For multi-class classification with neural networks, we must set
num_classes
argument to some value greater than 2.
Frequently Asked Questions
What is difference between parent and child iterations/algorithm?
AutoTab operates based upon parent and child optimization iterations The parent iteration is responsible for preprocessing step optimization and model optimization. During each parent iteration, when the preprocessing and model is selected/suggested by the algorithm for this iteration, the child optimization loops starts. The job of child optimization loop is to optimize hyperparameters of the selected/suggested model. The user can specify any algorithm from following algorithms for parent and child optimization algorithms.
bayes
random
grid
bayes_rf
tpe
atpe
cmaes
what splitting scheme is used
By default it is supposed that the data is split into 3 sets i.e. training, validation
and test sets. validation data is only used during pipeline optimization inside
.fit
method while the test data is only used after optimization. If you have
only two sets i.e. training and validation, set fit_on_all_train_data
to False
during post_fit
Is the pipeline optimized for test data or validation data?
for validation data
I don’t want to optimize preprocessing step
If you dont want any preprocessing steps, keep inputs_to_transform and outputs_to_transform arguments equal to None or an empty list. In this way transformations will not be optimized for both inputs and targets. As shown in below example,
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
... inputs_to_transform=[],
... outputs_to_transform=[],
... )
>>> results = pl.fit(data=data)
I don’t want to optimize hyperprameters of the models
If you dont want to optimize hyperparameters of the models, the child iterations needs to be set to zero. As shown in below example,
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(child_iterations=0)
>>> results = pl.fit(data=data)
I don’t want to optimize model selection
If you dont want to optimize model selection, keep models argument equals to None or an empty list. As shown in below example,
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(models=[])
>>> results = pl.fit(data=data)
I want to optimize pipeline for only one model
You can set models parameter to the desired model. In this way, pipeline will be optimized by using only one model. For example, in the following code, only AdaBoostRegressor will be used in pipeline optimization.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
>>> models=["AdaBoostRegressor"])
>>> results = pl.fit(data=data)
I want to optimize pipeline for only selected models
List the desired models in models as a list. In this way, pipeline will be optimized for the selected models.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
>>> models=[
... "GradientBoostingRegressor",
... "HistGradientBoostingRegressor",
... "DecisionTreeRegressor",
... "CatBoostRegressor",
... "ExtraTreeRegressor",
... "ExtraTreesRegressor",
... ])
>>> results = pl.fit(data=data)
Can I use different optimization algorithms for parent and child iterations
Different optimization algorithms can be set by parent_algorithm and child_algorithm.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
... parent_algorithm="bayes",
... child_algorithm="bayes"
... )
>>> results = pl.fit(data=data)
How to monitor more than one metrics
The metrics you want to monitor can be given to monitor as a list. In this example, two metrics NSE and $R^2$ are being monitored.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(monitor=['r2', 'nse'])
>>> results = pl.fit(data=data)
How to find best/optimized pipeline
There are two functions to get best pipeline after optimization. They are get_best_pipeline_by_metric which returns optimized pipeline according to given metric. On the other hand, get_best_pipeline_by_model gives us best pipeline according to given model.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline()
>>> results = pl.fit(data=data)
>>> pl.get_best_pipeline_by_metric(metric_name='nse')
>>> pl.get_best_pipeline_by_model(model_name='RandomForest_regressor')
Find best pipeline with respect to a specific (performance) metric
get_best_pipeline_by_metric function can be used to get best pipeline with respect to a specific (performance) metric.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline()
>>> results = pl.fit(data=data)
>>> pl.get_best_pipeline_by_metric(metric_name='nse')
Find best pipeline with respect to a particular model
get_best_pipeline_by_model returns the best pipeline with respect to a particular model and performance metric. The metric must be recorded i.e. must be given as monitor argument.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline()
>>> results = pl.fit(data=data)
>>> pl.get_best_pipeline_by_model(model_name='RandomForest_regressor')
Change search space of a particular model
update_model_space updates or changes the search space of an already existing model.
>>> pl = OptimizePipeline(...)
>>> rf_space = {'max_depth': [5,10, 15, 20],
>>> 'n_models': [5,10, 15, 20]}
>>> pl.update_model_space({"RandomForestRegressor": rf_space})
consider only selected transformations
Selected transformations can be given to input_transformations and output_transformations. In this way, the given transformations will be used for preprocessing steps.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
... input_transformations=['minmax', 'log', 'zscore'],
... output_transformations=['quantile', 'box-cox', 'yeo-johnson']
... )
>>> results = pl.fit(data=data)
do not optimize transformations for input data
If you dont want to optimize transformations for input data, keep inputs_to_transform argument equal to None or an empty list. In this way transformations will not be optimized for input data.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(inputs_to_transform=[])
>>> results = pl.fit(data=data)
change number of optimization iterations of a specific model
Number of optimization iterations for a particular model can be changed by using change_child_iteration function after initializing the OptimizePipeline class. For example we may want to change the child hpo iterations for one or more models. We may want to run only 10 iterations for LinearRegression but 40 iterations for XGBRegressor. In such a case we can use this function to modify child hpo iterations for one or more models. The iterations for all the remaining models will remain same as defined by the user at the start.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(...)
>>> pl.change_child_iteration({"XGBRegressor": 10})
#If we want to change iterations for more than one models
>>> pl.change_child_iteration(({"XGBRegressor": 30,
>>> "RandomForestRegressor": 20}))
where are all the results stored
The results are stored in folder named results in the current working directory. The exact path of stored results can be checked by printing model.path.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(...)
>>> print(pl.path)
what if optimization stops in the middle
If optimization stops in the middle due to an error, remaining results can be saved and analyzed by using these commands.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(...)
>>> pl.fit(data=data)
.. # if above command stops in the middle due to an error
>>> pl.save_results()
>>> pl.post_fit(data=data)
what is config.json
file
config.json is a simply plain text file that stores information about pipeline such as parameters, pipeline configuration. The pipeline can be built again by using from_config_file method as shown below.
>>> from autotab import OptimizePipeline
>>> config_path = "path/to/config.json"
>>> new_pipeline = OptimizePipeline.from_config_file(config_path)
How to include results from previous runs
The path to iterations.json from previous pipeline results has to be given to fit function in order to include results from previous runs.
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(...)
>>> fpath = "path/to/previous/iterations.json"
>>> results = pl.fit(data=data, previous_results=fpath)
What versions of underlying libraries do this package depends
Currently AutoTab is strongly coupled with a ML python framework AI4Water, whose version should be 1.2 or greater. Another dependency is h5py which does not have any specific version requirement.
how to use cross validation during pipeline optimization
By default the pipeline is evaluated on the validation data according to eval_metric
.
However, you can choose to perform cross validation on child or parent or on both
iterations. To perform cross validation at parent iterations set cv_parent_hpo
to True
. Similarly to perform cross validation at child iteration, set cv_child_hpo
to True. You must pass the cross_validator
argument as well to determine
what kind of cross validation to be performed. Consider the following example
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
...
... cv_parent_hpo=True,
... cross_validator={"KFold": {"n_splits": 5}},
... )
Instead of KFold
, we also choose LeaveOneOut
, or ShuffleSplit
or TimeSeriesSplit
.
how to change search space for batch_size and learning rate
The learning_rate and batch_size search space is only active for
deep learning models i.e. when the category
is “DL”. The default
search space for learning rate is Real(low=1e-5, high=0.05, num_samples=10, name="lr")
while for batch_size, the default search space is [8, 16, 32, 64]
.
We can change the default search space by making use of change_batch_size_space
and change_lr_space
methods after class initialization. For example we can
achieve a different batch_size search space as below
>>> from autotab import OptimizePipeline
>>> pl = OptimizePipeline(
... ...
... category="DL
... )
... pl.change_batch_size_space([32, 64, 128, 256, 512])
OptimizePipeline
- class autotab.OptimizePipeline(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]
optimizes model/estimator, its hyperparameters and preprocessing operation to be performed on input and output features. It consists of two hpo loops. The parent or outer loop optimizes preprocessing/feature engineering, feature selection and model selection while the child hpo loop optimizes hyperparmeters of child hpo loop.
- - metrics_
a pandas DataFrame of shape (parent_iterations, len(monitor)) which contains values of metrics being monitored at each parent iteration.
- - val_scores_
a 1d numpy array of length equal to parent_iterations which contains value of evaluation metric at each parent iteration.
- - parent_suggestions_
an ordered dictionary of suggestions to the parent objective function during parent hpo loop
- - child_val_scores_
a numpy array of shape (parent_iterations, child_iterations) containing value of eval_metric at all child hpo loops
- - optimizer_
an instance of ai4water.hyperopt.HyperOpt [1]_ for parent optimization
- - models
a list of models being considered for optimization
- - model_space
a dictionary which contains parameter space for each model
Example
>>> from autotab import OptimizePipeline >>> from ai4water.datasets import busan_beach >>> data = busan_beach() >>> input_features = data.columns.tolist()[0:-1] >>> output_features = data.columns.tolist()[-1:] >>> pl = OptimizePipeline(input_features=input_features, >>> output_features=output_features, >>> inputs_to_transform=input_features) >>> results = pl.fit(data=data)
Note
This optimization always solves a minimization problem even if the val_metric is $R^2$.
- Undoc-members
- Show-inheritance
- __init__(inputs_to_transform, input_transformations: Optional[Union[list, dict]] = None, outputs_to_transform=None, output_transformations: Optional[list] = None, models: Optional[list] = None, parent_iterations: int = 100, child_iterations: int = 25, parent_algorithm: str = 'bayes', child_algorithm: str = 'bayes', eval_metric: Optional[str] = None, cv_parent_hpo: Optional[bool] = None, cv_child_hpo: Optional[bool] = None, monitor: Optional[Union[list, str]] = None, mode: str = 'regression', num_classes: Optional[int] = None, category: str = 'ML', prefix: Optional[str] = None, **model_kwargs)[source]
initializes the class
- Parameters
inputs_to_transform (list) – Input features on which feature engineering/transformation is to be applied. By default all input features are considered. If you want to apply a single transformation on a group of input features, then pass this as a dictionary. This is helpful if the input data consists of hundred or thousands of input features.
input_transformations (list, dict) –
The transformations to be considered for input features. Default is None, in which case all input features are considered.
If list, then it will be the names of transformations to be considered for all input features. By default following transformations are considered
minmax
rescale from 0 to 1center
center the data by subtracting mean from itscale
scale the data by dividing it with its standard deviationzscore
first performs centering and then scalingbox-cox
yeo-johnson
quantile
robust
log
log2
log10
sqrt
square root
The user can however, specify list of transformations to be considered for each input feature. In such a case, this argument must be a dictionary whose keys are names of input features and values are list of transformations.
outputs_to_transform (list, optional) – Output features on which feature engineering/transformation is to be applied. If None, then transformations on outputs are not applied.
output_transformations – The transformations to be considered for outputs/targets. The user can consider any transformation as given for
input_transformations
models (list, optional) –
The models/algorithms to consider during optimzation. If not given, then all available models from sklearn, xgboost, catboost and lgbm are considered. For neural neworks, following 6 model types are considered by default
However, in such cases, the
category
must beDL
.parent_iterations (int, optional (default=100)) – Number of iterations for parent optimization loop
child_iterations (int, optional) – Number of iterations for child optimization loop. It set to 0, the child hpo loop is not run which means the hyperparameters of the model are not optimized. You can customize iterations for each model by making using of :meth: change_child_iterations method.
parent_algorithm (str, optional) – Algorithm for optimization of parent optimzation
child_algorithm (str, optional) – Algorithm for optimization of child optimization
eval_metric (str, optional) – Validation metric to calculate val_score in objective function. The parent and child hpo loop optimizes/improves this metric. This metric is calculated on valdation data. If cross validation is performed then this metric is calculated using cross validation.
cv_parent_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in parent hpo loop or not?. If given, the parent hpo loop will optimize the cross validation score. The model is fitted on whole training data (training+validation) after cross validation and the metrics printed (other than parent_val_metric) are calculated on the based the updated model i.e. the one fitted on whole training (trainning+validation) data.
cv_child_hpo (bool, optional (default=False)) – Whether we want to apply cross validation in child hpo loop or not?. If False, then val_score will be caclulated on validation data. The type of cross validator used is taken from model.config[‘cross_validator’]
monitor (Union[str, list], optional, (default=None)) – Nmaes of performance metrics to monitor in parent hpo loop. If None, then R2 is monitored for regression and accuracy for classification.
mode (str, optional (default="regression")) – whether this is a
regression
problem orclassification
num_classes (int, optional (default=None)) – number of classes, only relevant if mode==”classification”.
category (str, optional (detault="DL")) – either “DL” or “ML”. If DL, the pipeline is optimized for neural networks.
**model_kwargs – any additional key word arguments for ai4water’s Model
References
- 1
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.MLP
- 2
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNN
- 3
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTM
- 4
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.CNNLSTM
- 5
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.LSTMAutoEncoder
- 6
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TCN
- 7
https://ai4water.readthedocs.io/en/latest/models/models.html#ai4water.models.TFT
- _build_model(model: dict, val_metric: str, x_transformation, y_transformation, prefix: Optional[str], verbosity: int = 0, batch_size: int = 32, lr: float = 0.001) ai4water.main.Model [source]
build the ai4water Model. When overwriting this method, the user must return an instance of ai4water’s Model_ class. batch_size : only used when category is “DL”. lr : only used when category is “DL”
- add_dl_model(model: Callable, space: Union[list, ai4water.hyperopt._space.Real, ai4water.hyperopt._space.Categorical, ai4water.hyperopt._space.Integer]) None [source]
adds a deep learning model to be considered.
- Parameters
model (callable) – the model to be added
space (list) – the search space of the model
- add_model(model: dict) None [source]
adds a new model which will be considered during optimization.
- Parameters
model (dict) – a dictionary of length 1 whose value should also be a dictionary of parameter space for that model
Example
>>> pl = OptimizePipeline(...) >>> pl.add_model({"XGBRegressor": {"n_estimators": [100, 200,300, 400, 500]}})
- baseline_results(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) tuple [source]
Returns default performance of all models.
It runs all the models with their default parameters and without any x and y transformation. These results can be considered as baseline results and can be compared with optimized model’s results. The model is trained on ‘training’+’validation’ data.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
- Returns
a tuple of two dictionaries. - a dictionary of val_scores on test data for each model - a dictionary of metrics being monitored for each model on test data.
- Return type
tuple
- be_best_model_from_config(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, verbosity=1) ai4water.main.Model [source]
Build and Evaluate the best model with respect to metric from config.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the metric with respect to which the best model is fetched and then built/evaluated. If not given, the best model is built/evaluated with respect to evaluation metric.
model_name (str, optional) – If given, the best version of this model will be fetched and built. The ‘best’ will be decided based upon metric_name
verbosity (int, optinoal (default=1)) – determines the amount of print information
- Return type
an instance of trained ai4water Model
- bfe_all_best_models(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 0) None [source]
builds, trains and evaluates best versions of all the models. The model is trained on ‘training’+’validation’ data.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the name of metric to determine best version of a model. If not given, parent_val_metric will be used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=0)) – determines the amount of print information
- Return type
None
- bfe_best_model_from_scratch(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, model_name: Optional[str] = None, fit_on_all_train_data: bool = True, verbosity: int = 1) ai4water.main.Model [source]
Builds, Trains and Evaluates the best model with respect to metric from scratch. The model is trained on ‘training’+’validation’ data. Running this mothod will also populate
taylor_plot_data_
dictionary.- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – the metric with respect to which the best model is searched and then built/trained/evaluated. If None, the best model is chosen based on the evaluation metric.
model_name (str, optional) – If given, the best version of this model will be found and built. The ‘best’ will be decided based upon metric_name
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
verbosity (int, optional (default=1)) – determines amount of information to be printed.
- Return type
an instance of trained ai4water Model
- bfe_model_from_scratch(iter_num: int, x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True) ai4water.main.Model [source]
Builds, trains and evalutes the model from a specific iteration. The model is trained on ‘training’+’validation’ data.
- Parameters
iter_num (int) – iteration number from which to choose the model
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
- Return type
an instance of trained ai4water Model
- change_child_iteration(model: dict)[source]
We may want to change the child hpo iterations for one or more models. For example we may want to run only 10 iterations for LinearRegression but 40 iterations for XGBRegressor. In such a case we can use this function to modify child hpo iterations for one or more models. The iterations for all the remaining models will remain same as defined by the user at the start. This method updated _child_iters dictionary
- Parameters
model (dict) – a dictionary whose keys are names of models and values are number of iterations for that model during child hpo
Example
>>> pl = OptimizePipeline(...) >>> pl.change_child_iteration({"XGBRegressor": 10}) If we want to change iterations for more than one models >>> pl.change_child_iteration(({"XGBRegressor": 30, >>> "RandomForestRegressor": 20}))
- compare_models(metric_name: Optional[str] = None, plot_type: str = 'circular', show: bool = False, **kwargs) matplotlib.axes._axes.Axes [source]
Compares all the models with respect to a metric and plots a bar plot.
- metric_namestr, optional
The metric with respect to which to compare the models.
- plot_typestr, optional
if “circular” then easy_mpl.circular_bar_plot is drawn otherwise a simple bar_plot is drawn.
- showbool, optional
whether to show the plot or not
- **kwargs :
keyword arguments for easy_mpl.circular_bar_plot or easy_mpl.bar_chart
- Return type
matplotlib.pyplot.Axes
- config() dict [source]
Returns a dictionary which contains all the information about the class and from which the class can be created.
- Returns
a dictionary with two keys
init_paras
andruntime_paras
andversion_info
.- Return type
dict
- dumbbell_plot(x=None, y=None, data=None, test_data=None, metric_name: Optional[str] = None, fit_on_all_train_data: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True) matplotlib.axes._axes.Axes [source]
Generate Dumbbell plot as comparison of baseline models with optimized models. Not that this command will train all the considered models, so this can be expensive.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.metric_name (str) – The name of metric with respect to which the models have to be compared. If not given, the evaluation metric is used.
fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
figsize (tuple) – If given, plot will be generated of this size.
show (bool) – whether to show the plot or not
save – By default True. If False, function will not save the resultant plot in current working directory.
- Return type
matplotlib Axes
- fit(x: Optional[numpy.ndarray] = None, y: Optional[numpy.ndarray] = None, data: Optional[pandas.core.frame.DataFrame] = None, validation_data: Optional[Tuple[numpy.ndarray, numpy.ndarray]] = None, previous_results: Optional[dict] = None, process_results: bool = True) ai4water.hyperopt._main.HyperOpt [source]
Optimizes the pipeline for the given data.
- Parameters
x (np.ndarray) – input training data
y (np.ndarray) – output/target/label data. It must of same length as
x
.data – A pandas dataframe which contains input (x) and output (y) features Only required if
x
andy
are not given. The training and validation data will be extracted from this data.validation_data – validation data on which pipeline is optimized. Only required if
data
is not given.previous_results (dict, optional) – path of file which contains xy values.
process_results (bool) –
- Returns
an instance of ai4water.hyperopt.HyperOpt class which is used for
optimization.
- classmethod from_config(config: dict) autotab._main.OptimizePipeline [source]
Builds the class from config dictionary
- Parameters
config (dict) – a dictionary which contains init_paras key.
- Return type
an instance of OptimizePipeline class
- classmethod from_config_file(config_file: str) autotab._main.OptimizePipeline [source]
Builds the class from config file.
- Parameters
config_file (str) – complete path of config file which has .json extension
- Return type
an instance of OptimizePipeline class
- get_best_metric(metric_name: str) float [source]
returns the best value of a particular performance metric. The metric must be recorded i.e. must be given as monitor argument.
- Parameters
metric_name (str) – Name of performance metric
- Returns
the best value of performance metric acheived
- Return type
float
- get_best_metric_iteration(metric_name: Optional[str] = None) int [source]
returns iteration of the best value of a particular performance metric.
- Parameters
metric_name (str, optional) – The metric must be recorded i.e. must be given as monitor argument. If not given, then evaluation metric is used.
- get_best_pipeline_by_metric(metric_name: Optional[str] = None) dict [source]
returns the best pipeline with respect to a particular performance metric.
- Parameters
metric_name (str, optional) – The name of metric whose best value is to be retrieved. The metric must be recorded i.e. must be given as monitor.
- Returns
a dictionary with follwoing keys
path
path where the model is saved on diskmodel
name of modelx_transfromations
transformations for the input datay_transformations
transformations for the target dataiter_num
iteration number on which this pipeline was achieved
- Return type
dict
- get_best_pipeline_by_model(model_name: str, metric_name: Optional[str] = None) tuple [source]
returns the best pipeline with respect to a particular model and performance metric. The metric must be recorded i.e. must be given as monitor argument.
- Parameters
model_name (str) – The name of model for which best pipeline is to be found. The best is defined by
metric_name
.metric_name (str, optional) – The name of metric with respect to which the best model is to be retrieved. If not given, the best model is defined by the evaluation metric.
- Returns
a tuple of length two
- first value is a float which represents the value of
metric
second value is a dictionary of pipeline with four keys
x_transformation
y_transformation
model
path
iter_num
- Return type
tuple
- post_fit(x=None, y=None, data=None, test_data: Optional[Union[list, tuple]] = None, fit_on_all_train_data: bool = True, show: bool = True) None [source]
post processing of results to draw dumbell plot and taylor plot.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
show (bool, optional (default=True)) – whether to show the plots or not
- Return type
None
- remove_model(models: Union[str, list]) None [source]
removes an model/models from being considered. The follwoing attributes are updated.
models
model_space
_child_iters
- Parameters
models (list, str) – name or names of model to be removed.
Example
>>> pl = OptimizePipeline(...) >>> pl.remove_model("ExtraTreeRegressor")
- save_results() None [source]
saves the results. It is called automatically at the end of optimization. It saves tried models and transformations at each step as json file with the name
parent_suggestions.json
.An
errors.csv
file is saved which contains validation peformance of the models at each optimization iteration with respect to all metrics being monitored.The performance of each model during child optimization iteration is saved as a csv file with the name
child_val_scores.csv
.The global seeds for parent and child iterations are also saved in csv files with name
parent_seeds.csv
andchild_seeds.csv
. All of these results are saved in pl.path folder.- Return type
None
- taylor_plot(x=None, y=None, data=None, test_data=None, fit_on_all_train_data: bool = True, plot_bias: bool = True, figsize: Optional[tuple] = None, show: bool = True, save: bool = True, verbosity: int = 0, **kwargs) matplotlib.figure.Figure [source]
makes Taylor’s plot using the best version of each model. The number of models in taylor plot will be equal to the number of models which have been considered by the model.
- Parameters
x – the input data for training
y – the target data for training
data – raw unprepared and unprocessed data from which x,y pairs for both training and test will be prepared. It is only required if x, y are not provided.
test_data – a tuple/list of length 2 whose first element is x and second value is y. The is the data on which the peformance of optimized pipeline will be calculated. This should only be given if
data
argument is not given.fit_on_all_train_data (bool, optional (default=True)) – If true, the model is trained on (training+validation) data. This is based on supposition that the data is splitted into training, validation and test sets. The optimization of pipeline was performed on validation data. But now, we are training the model on all available training data which is (training + validation) data. If False, then model is trained only on training data.
plot_bias (bool, optional) – whether to plot the bias or not
figsize (tuple, optional) – a tuple determining figure size
show (bool, optional) – whether to show the plot or not
save (bool, optional) – whether to save the plot or not
verbosity (int, optional (default=0)) – determines the amount of print information
**kwargs – any additional keyword arguments for taylor_plot function of easy_mpl.
- Return type
matplotlib.pyplot.Figure
- update_model_space(space: dict) None [source]
updates or changes the search space of an already existing model
- Parameters
space – a dictionary whose keys are names of models and values are parameter space for that model.
- Return type
None
Example
>>> pl = OptimizePipeline(...) >>> rf_space = {'max_depth': [5,10, 15, 20], >>> 'n_models': [5,10, 15, 20]} >>> pl.update_model_space({"RandomForestRegressor": rf_space})
Examples
Below is a gallery of examples
Note
Click here to download the full example code or to run this example in your browser via Binder
regression
from ai4water.datasets import busan_beach
from skopt.plots import plot_objective
from autotab import OptimizePipeline
data = busan_beach()
pl = OptimizePipeline(
inputs_to_transform=data.columns.tolist()[0:-1],
outputs_to_transform=data.columns.tolist()[-1:],
parent_iterations=30,
child_iterations=0, # don't optimize hyperparamters only for demonstration
parent_algorithm='bayes',
child_algorithm='random',
eval_metric='mse',
monitor=['r2', 'r2_score'],
models=[ "LinearRegression",
"LassoLars",
"Lasso",
"RandomForestRegressor",
"HistGradientBoostingRegressor",
"CatBoostRegressor",
"XGBRegressor",
"LGBMRegressor",
"GradientBoostingRegressor",
"ExtraTreeRegressor",
"ExtraTreesRegressor"
],
input_features=data.columns.tolist()[0:-1],
output_features=data.columns.tolist()[-1:],
split_random=True,
)
results = pl.fit(data=data, process_results=False)
Out:
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:17: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
"Since version 1.0, "
Iter mse r2 r2_score mse
WARNING:tensorflow:From /home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/ai4water/utils/utils.py:1685: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
0 2.73e+15 0.2340123 0.02512779
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
1
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
2
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
3 0.4546905
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
4
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_base.py:138: FutureWarning: The default of 'normalize' will be set to False in version 1.2 and deprecated in version 1.4.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LassoLars())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * np.sqrt(n_samples).
FutureWarning,
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
5
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
6
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:648: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.223e+14, tolerance: 2.710e+11
coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
7
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
8
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
9 2.71e+15 0.03148921
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
10 2.69e+15 0.03712889
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
11
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_base.py:138: FutureWarning: The default of 'normalize' will be set to False in version 1.2 and deprecated in version 1.4.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LassoLars())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * np.sqrt(n_samples).
FutureWarning,
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
12
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
13
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
14 2.53e+15 0.09695015
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
15
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
16
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
17 0.5653205
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
18
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/preprocessing/_data.py:3218: RuntimeWarning: overflow encountered in power
out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
19
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning: X has feature names, but PowerTransformer was fitted without feature names
f"X has feature names, but {self.__class__.__name__} was fitted without"
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning: X has feature names, but PowerTransformer was fitted without feature names
f"X has feature names, but {self.__class__.__name__} was fitted without"
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
20
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning: X has feature names, but QuantileTransformer was fitted without feature names
f"X has feature names, but {self.__class__.__name__} was fitted without"
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/base.py:444: UserWarning: X has feature names, but QuantileTransformer was fitted without feature names
f"X has feature names, but {self.__class__.__name__} was fitted without"
/home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/autotab/_main.py:2472: RuntimeWarning: All-NaN axis encountered
best_so_far = func(self.metrics_best_.loc[:self.parent_iter_, _metric])
21 5.4e+13 0.6369561 0.4878222 5.40484e+13
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
22
23
24 0.7440143
25
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/preprocessing/_data.py:3218: RuntimeWarning: overflow encountered in power
out[pos] = (np.power(x[pos] + 1, lmbda) - 1) / lmbda
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:244: RuntimeWarning: overflow encountered in reduce
ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)
26 0.8295779
27
28
29
pl.optimizer_._plot_convergence(save=False)

pl.optimizer_._plot_parallel_coords(figsize=(16, 8), save=False)

pl.optimizer_._plot_distributions(save=False)

Out:
<Figure size 2100x2100 with 16 Axes>
pl.optimizer_.plot_importance(save=False)

_ = plot_objective(results)

pl.optimizer_._plot_evaluations(save=False)

pl.optimizer_._plot_edf(save=False)

pl.bfe_all_best_models(data=data)

Out:
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/numpy/core/_methods.py:233: RuntimeWarning: overflow encountered in multiply
x = um.multiply(x, x, out=x)
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_base.py:138: FutureWarning: The default of 'normalize' will be set to False in version 1.2 and deprecated in version 1.4.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LassoLars())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * np.sqrt(n_samples).
FutureWarning,
pl.dumbbell_plot(data=data, save=False)

Out:
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_base.py:138: FutureWarning: The default of 'normalize' will be set to False in version 1.2 and deprecated in version 1.4.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), LassoLars())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * np.sqrt(n_samples).
FutureWarning,
/home/docs/checkouts/readthedocs.org/user_builds/autotab/envs/latest/lib/python3.7/site-packages/sklearn/linear_model/_coordinate_descent.py:648: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.075e+16, tolerance: 9.458e+12
coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive
<AxesSubplot:xlabel='mse', ylabel='Models'>
pl.dumbbell_plot(data=data, metric_name='r2', save=False)

Out:
<AxesSubplot:xlabel='r2', ylabel='Models'>
pl.taylor_plot(data=data, save=False)

Out:
<Figure size 640x480 with 1 Axes>
pl.compare_models()

Out:
<PolarAxesSubplot:>
pl.compare_models(plot_type="bar_chart")

Out:
<AxesSubplot:xlabel='mse'>
pl.compare_models("r2", plot_type="bar_chart")

Out:
<AxesSubplot:xlabel='r2'>
print(f"all results are save in {pl.path} folder")
Out:
all results are save in /home/docs/checkouts/readthedocs.org/user_builds/autotab/checkouts/latest/examples/results/pipeline_opt_20220504_162449 folder
pl.cleanup()
Total running time of the script: ( 3 minutes 7.167 seconds)