机器学习完全指南：如何用 Python 生成合成数据集？

• 生成你的第一个合成数据集

• 添加噪声

• 调整类别平衡

• 调整类分离

生成你的第一个合成数据集

Scikit-Learn库附带了一个方便的make_classification()函数。虽然这个函数不是唯一一个生成合成数据集的函数，但你在如今必然会频繁地使用它。

import numpy as np  import pandas as pd from sklearn.datasets import make_classification  import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['axes.spines.top'] = False rcParams['axes.spines.right'] = False


X, y = make_classification(     n_samples=1000,     n_features=2,     n_redundant=0,     n_clusters_per_class=1,     random_state=42 )  df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) df.columns = ['x1', 'x2', 'y'] # 5 random rows df.sample(5)


def plot(df: pd.DataFrame, x1: str, x2: str, y: str, title: str = '', save: bool = False, figname='figure.png'):     plt.figure(figsize=(14, 7))     plt.scatter(x=df[df[y] == 0][x1], y=df[df[y] == 0][x2], label='y = 0')     plt.scatter(x=df[df[y] == 1][x1], y=df[df[y] == 1][x2], label='y = 1')     plt.title(title, fontsize=20)     plt.legend()     if save:         plt.savefig(figname, dpi=300, bbox_inches='tight', pad_inches=0)     plt.show()   plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes')


添加噪声

X, y = make_classification(     n_samples=1000,     n_features=2,     n_redundant=0,     n_clusters_per_class=1,     flip_y=0.15,     random_state=42 )  df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) df.columns = ['x1', 'x2', 'y']  plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Added noise')


调整类别平衡

X, y = make_classification(     n_samples=1000,     n_features=2,     n_redundant=0,     n_clusters_per_class=1,     weights=[0.95],     random_state=42 )  df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) df.columns = ['x1', 'x2', 'y']  plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 1)')


X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.05],
random_state=42
)
df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
df.columns = [‘x1’, ‘x2’, ‘y’]
plot(df=df, x1=’x1′, x2=’x2′, y=’y’, title=’Dataset with 2 classes – Class imbalance (y = 0)’)

调整类分离

X, y = make_classification(     n_samples=1000,     n_features=2,     n_redundant=0,     n_clusters_per_class=1,     class_sep=5,     random_state=42 )  df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1) df.columns = ['x1', 'x2', 'y']  plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Make classification easier')