提高预测性能的四种关键集成学习方法

集成学习是一种流行的机器学习技术,将多个模型的预测结合起来,以提高单个模型的准确性和鲁棒性。随着数据量和复杂性的增加,集成学习在图像分类、自然语言处理和推荐系统等领域的应用日益广泛。

Bagging(自举汇聚)

Bagging通过在不同的训练数据子集上独立训练多个模型,并将它们的输出组合起来,以减少模型的方差和防止过拟合。随机森林就是使用了Bagging的典型例子。

图片[1]-提高预测性能的四种关键集成学习方法-山海云端论坛

示例代码:

<code>from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载数据集 iris = load_iris() X, y = iris.data, iris.target # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 定义基本估计器 base_estimator = DecisionTreeClassifier() # 定义Bagging分类器 bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42) # 拟合模型 bagging.fit(X_train, y_train) # 在测试集上做出预测 y_pred = bagging.predict(X_test) # 计算准确率 accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}")</code>

Bagging的优点之一是能够提高模型的稳定性和准确性,尤其适用于处理噪声数据和高维输入空间。然而,其计算成本可能较高,并且对小型数据集或类别不平衡的数据集可能不适用。

Boosting(提升)

Boosting顺序训练多个弱模型,每个模型都从前一个模型的错误中学习,从而提高模型的准确性。AdaBoost和Gradient Boosting是常见的Boosting算法。

示例代码:

<code>from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 加载数据集 iris = load_iris() # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3) # 创建基本估计器 base_estimator = DecisionTreeClassifier(max_depth=1) # 创建AdaBoost分类器 adaboost = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=50) # 拟合模型 adaboost.fit(X_train, y_train) # 在测试数据上做出预测 y_pred = adaboost.predict(X_test) # 计算准确率 accuracy = accuracy_score(y_test, y_pred) # 输出准确率 print("Accuracy:", accuracy)</code>

Boosting的优点包括能够处理复杂关系和减少模型的偏差和方差。然而,容易过拟合,计算成本高,对大型数据集可能不适用。

Stacking(堆叠)

图片[2]-提高预测性能的四种关键集成学习方法-山海云端论坛

Stacking将多个模型的输出作为元模型的输入,通过元模型进行最终预测。它与Bagging和Boosting不同之处在于模型组合的方式。

示例代码:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)

# Create base models
rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Create meta-model
lr = LogisticRegression()

# Split the dataset into train and test sets
kf = KFold(n_splits=5, random_state=42)

# Initialize variables for storing predictions
train_meta_features = np.zeros((X.shape[0], 2))
test_meta_features = np.zeros((X.shape[0], 2))

# Perform stacking ensemble
for train_index, test_index in kf.split(X):
    X_train, y_train = X[train_index], y[train_index]
    X_test, y_test = X[test_index], y[test_index]
    
    # Fit base models
    rf.fit(X_train, y_train)
    gb.fit(X_train, y_train)
    
    # Generate predictions using base models
    rf_train_pred = cross_val_predict(rf, X_train, y_train, 
                  cv=5, method='predict_proba')
    gb_train_pred = cross_val_predict(gb, X_train, y_train, 
                  cv=5, method='predict_proba')
    
    # Generate meta-features using predictions from base models
    train_meta_features[train_index, 0] = rf_train_pred[:, 1]
    train_meta_features[train_index, 1] = gb_train_pred[:, 1]
    
    # Generate predictions using base models on test set
    rf_test_pred = rf.predict_proba(X_test)
    gb_test_pred = gb.predict_proba(X_test)
    
    # Generate meta-features using predictions from base models on test set
    test_meta_features[test_index, 0] = rf_test_pred[:, 1]
    test_meta_features[test_index, 1] = gb_test_pred[:, 1]

# Fit meta-model on meta-features
lr.fit(train_meta_features, y)

# Generate final predictions using meta-model on meta-features of test set
final_pred = lr.predict(test_meta_features)

# Calculate accuracy of final predictions
acc = accuracy_score(y, final_pred)

print(f"Accuracy of stacking ensemble: {acc}")

Stacking的优点之一是能够利用不同模型的优势,比Bagging或Boosting获得更高的准确性。然而,计算成本高,因为需要训练多个模型和元模型。

Voting(投票)

Voting是一种流行的集成学习方法,通过组合多个模型的预测来进行最终预测。多数投票确定最终预测结果。

图片[3]-提高预测性能的四种关键集成学习方法-山海云端论坛

示例代码:

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, 
              n_informative=5, n_classes=2, random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                test_size=0.2, random_state=42)

# Define individual classifiers
clf1 = LogisticRegression(random_state=42)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = SVC(random_state=42, probability=True)

# Define the voting classifier with hard voting
voting_clf = VotingClassifier(estimators=[('lr', clf1), 
        ('dt', clf2), ('svm', clf3)], voting='hard')

# Fit the voting classifier on the training set
voting_clf.fit(X_train, y_train)

# Evaluate the voting classifier on the test set
accuracy = voting_clf.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

Voting的优点在于能够结合各个模型的优势,获得更好的准确性和泛化性能。可以通过硬投票和软投票进行投票。

© 版权声明
THE END
喜欢就支持一下吧
点赞8 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容