visit
Many data scientists, machine learning engineers, and researchers rely on this library for their machine learning projects. I personally love using the scikit-learn library because it offers a ton of flexibility and it’s easy to understand its documentation with a lot of examples.
In this article, I’m happy to share with you the 5 best new features in scikit-learn 0.24.pip install --upgrade scikit-learn
conda install -c conda-forge scikit-learn
Note: This version supports Python versions 3.6 to 3.9.
Now, let’s look at the new features!np.mean(np.abs((y_test — preds)/y_test))
But now you can call a function called mean_absolute_percentage_error from sklearn.metrics module to evaluate the performance of your regression model.
Example:
from sklearn.metrics import mean_absolute_percentage_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_absolute_percentage_error(y_true, y_pred))
Note: Keep in mind that the function does not represent the output as a percentage in the range [0, 100]. Instead, we represent it in the range [0, 1/eps]. The best value is 0.0.
OneHotEncoder can now handle missing values if presented in the dataset. It treats a missing value as a category. Let’s understand more about how it works in the following example.
First import important packages.import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# intialise data of lists.
data = {'education_level':['primary', 'secondary', 'bachelor', np.nan,'masters',np.nan]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
As you can see, we have two missing values in our education_level column.
Create the instance of OneHotEncoder.enc = OneHotEncoder()
enc.fit_transform(df).toarray()
SequentialFeatureSelector is a new method for feature selection in scikit-learn. It can be either forward selection or backward selection.
(a)Forward Selection
It iteratively finds the best new feature and then adds it to the set of selected features. This means we start with zero features and then find a feature that maximizes the cross-validation score of an estimator. The selected feature is added to the set and the procedure is repeated until we reach the desired number of selected features.(b) Backward Selection
This second selection follows the same idea but in a different direction. Here we start with all features and then remove a feature from the set until we reach the desired number of selected features.Example
Import important package.from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True, as_frame=True)
feature_names = X.columns
knn = KNeighborsClassifier(n_neighbors=3)
Create the instance of SequentialFeatureSelector, set the number of features to select to be 2, and set the direction to be “backward”.
sfs = SequentialFeatureSelector(knn, n_features_to_select=2,direction='backward')
sfs.fit(X,y)
print("Features selected by backward sequential selection: "
f"{feature_names[sfs.get_support()].tolist()}")
When it comes to hyper-parameters tuning, GridSearchCV and RandomizedSearchCv from Scikit-learn have been the first choice for many data scientists. But in the new version, we have two new classes for hyper-parameters tuning called HalvingGridSearchCV and HalvingRandomSearchCV.
HalvingGridSearchCV and HalvingRandomSearchCV use a new approach called successive halving to find the best hyperparameters. Successive halving is like competition or tournament among all hyperparameter combinations.
How does successive halving work?
In the first iteration, they train a combination of hyper-parameters on a subset of observations(training data). Then in the next iteration, only a combination of hyper-parameters that have good performance in the first iteration are selected and they will be trained in a large number of observations to compete.So this selection process is repeated in each iteration until the best combination of hyperparameters is selected in the final iteration.Note: These classes are still experimental:
Example:
Import important packages.from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import randint
Since these new classes are still experimental, to use them, we explicitly import enable_halving_search_cv:
Create a classification dataset by using the make_classification method.X, y = make_classification(n_samples=1000)
clf = RandomForestClassifier(n_estimators=20)
param_dist = {"max_depth": [3, None],
"max_features": randint(1, 11),
"min_samples_split": randint(2, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
rsh = HalvingRandomSearchCV(
estimator=clf,
param_distributions=param_dist,
cv = 5,
factor=2,
min_resources = 20)
(a) factor - This determines the proportion of the combination of hyper-parameters that are selected for each subsequent iteration. For example, factor=3 means that only one-third of the candidates are selected for the next iteration.
(b) min_resources is the amount of resources(number of observations) allocated at the first iteration for each combination of hyper-parameters.
Finally, we can fit the search object that we have created with our dataset.rsh.fit(X,y)
print(rsh.n_iterations_ )
6
print(rsh.n_candidates_ )
[50, 25, 13, 7, 4, 2]
print(rsh.n_resources_)
[20, 40, 80, 160, 320, 640]
print(rsh.best_params_)
{'bootstrap': False,
'criterion': 'entropy',
'max_depth': None,
'max_features': 5,
'min_samples_split': 2}
Scikit-learn 0.24 has introduced a new self-training implementation for semi-supervised learning called SelfTrainingclassifier. SelfTrainingClassifier can be used with any supervised classifier that can return probability estimates for each class.
This means any supervised classifier can function as a semi-supervised classifier, allowing it to learn from unlabeled observations in the dataset.Note: The unlabeled values in the target column must have a value of -1.
Let’s understand more about how it works in the following example.Import important packages.import numpy as np
from sklearn import datasets
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.svm import SVC
In this example, we will use the iris dataset and Super vector machine algorithm as a supervised classifier (It can implement fit and predict_proba).
Then we load the dataset and select randomly some of the observations to be unlabeled.rng = np.random.RandomState(42)
iris = datasets.load_iris()
random_unlabeled_points = rng.rand(iris.target.shape[0]) < 0.3
iris.target[random_unlabeled_points] = -1
svc = SVC(probability=True, gamma="auto")
self_training_model = SelfTrainingClassifier(base_estimator=svc)
self_training_model.fit(iris.data, iris.target)
And you can read more articles like this here.