集成学习是一种流行的机器学习技术,将多个模型的预测结合起来,以提高单个模型的准确性和鲁棒性。随着数据量和复杂性的增加,集成学习在图像分类、自然语言处理和推荐系统等领域的应用日益广泛。
Bagging(自举汇聚)
Bagging通过在不同的训练数据子集上独立训练多个模型,并将它们的输出组合起来,以减少模型的方差和防止过拟合。随机森林就是使用了Bagging的典型例子。
示例代码:
<code>from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载数据集 iris = load_iris() X, y = iris.data, iris.target # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 定义基本估计器 base_estimator = DecisionTreeClassifier() # 定义Bagging分类器 bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42) # 拟合模型 bagging.fit(X_train, y_train) # 在测试集上做出预测 y_pred = bagging.predict(X_test) # 计算准确率 accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}")</code>
Bagging的优点之一是能够提高模型的稳定性和准确性,尤其适用于处理噪声数据和高维输入空间。然而,其计算成本可能较高,并且对小型数据集或类别不平衡的数据集可能不适用。
Boosting(提升)
Boosting顺序训练多个弱模型,每个模型都从前一个模型的错误中学习,从而提高模型的准确性。AdaBoost和Gradient Boosting是常见的Boosting算法。
示例代码:
<code>from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载数据集 iris = load_iris() # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3) # 创建基本估计器 base_estimator = DecisionTreeClassifier(max_depth=1) # 创建AdaBoost分类器 adaboost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50) # 拟合模型 adaboost.fit(X_train, y_train) # 在测试数据上做出预测 y_pred = adaboost.predict(X_test) # 计算准确率 accuracy = accuracy_score(y_test, y_pred) # 输出准确率 print("Accuracy:", accuracy)</code>
Boosting的优点包括能够处理复杂关系和减少模型的偏差和方差。然而,容易过拟合,计算成本高,对大型数据集可能不适用。
Stacking(堆叠)
Stacking将多个模型的输出作为元模型的输入,通过元模型进行最终预测。它与Bagging和Boosting不同之处在于模型组合的方式。
示例代码:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
import numpy as np
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# Create base models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
# Create meta-model
lr = LogisticRegression()
# Split the dataset into train and test sets
kf = KFold(n_splits=5, random_state=42)
# Initialize variables for storing predictions
train_meta_features = np.zeros((X.shape[0], 2))
test_meta_features = np.zeros((X.shape[0], 2))
# Perform stacking ensemble
for train_index, test_index in kf.split(X):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
# Fit base models
rf.fit(X_train, y_train)
gb.fit(X_train, y_train)
# Generate predictions using base models
rf_train_pred = cross_val_predict(rf, X_train, y_train,
cv=5, method='predict_proba')
gb_train_pred = cross_val_predict(gb, X_train, y_train,
cv=5, method='predict_proba')
# Generate meta-features using predictions from base models
train_meta_features[train_index, 0] = rf_train_pred[:, 1]
train_meta_features[train_index, 1] = gb_train_pred[:, 1]
# Generate predictions using base models on test set
rf_test_pred = rf.predict_proba(X_test)
gb_test_pred = gb.predict_proba(X_test)
# Generate meta-features using predictions from base models on test set
test_meta_features[test_index, 0] = rf_test_pred[:, 1]
test_meta_features[test_index, 1] = gb_test_pred[:, 1]
# Fit meta-model on meta-features
lr.fit(train_meta_features, y)
# Generate final predictions using meta-model on meta-features of test set
final_pred = lr.predict(test_meta_features)
# Calculate accuracy of final predictions
acc = accuracy_score(y, final_pred)
print(f"Accuracy of stacking ensemble: {acc}")
Stacking的优点之一是能够利用不同模型的优势,比Bagging或Boosting获得更高的准确性。然而,计算成本高,因为需要训练多个模型和元模型。
Voting(投票)
Voting是一种流行的集成学习方法,通过组合多个模型的预测来进行最终预测。多数投票确定最终预测结果。
示例代码:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, n_classes=2, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Define individual classifiers
clf1 = LogisticRegression(random_state=42)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = SVC(random_state=42, probability=True)
# Define the voting classifier with hard voting
voting_clf = VotingClassifier(estimators=[('lr', clf1),
('dt', clf2), ('svm', clf3)], voting='hard')
# Fit the voting classifier on the training set
voting_clf.fit(X_train, y_train)
# Evaluate the voting classifier on the test set
accuracy = voting_clf.score(X_test, y_test)
print(f"Accuracy: {accuracy}")
Voting的优点在于能够结合各个模型的优势,获得更好的准确性和泛化性能。可以通过硬投票和软投票进行投票。
暂无评论内容