共计 1105 个字符,预计需要花费 3 分钟才能阅读完成。
特征选择博文均来自于Sklearn机器学习库,基本上对应翻译而来,训练模型的好坏一定程度上受特征提取的影响,因此特征提取是重要的一步。
Removing features with low variance
【去除方差较小的特征,说白了就是当前特征对应不同的个体而言特征值基本上都是相差不大,因此不具备区分能力】
VarianceThreshold
is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.
【VarianceThreshold
是特征选取很简单的一种衡量指标,本质上就是去除方差没有达到制定标准值对应的特征,默认是移除0方差的特征(就是所有的样本对应特征值都是同一个值)】
As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by
【举例如下,假设我们有一个布尔类型的数据集,现在想去除其中包含0或者1占据样本总数超过80%的特征,考虑到样本是0-1二项分布,对应的方差计算公式如下】
so we can select using the threshold .8 * (1 - .8)
:
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
As expected, VarianceThreshold
has removed the first column, which has a probability of containing a zero.