一种可扩展的同时进化实例和特征选择方法

2024-05-09 00:58

本文主要是介绍一种可扩展的同时进化实例和特征选择方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

#引用

##Latex

@article{GARCIAPEDRAJAS2013150,
title = “A scalable approach to simultaneous evolutionary instance and feature selection”,
journal = “Information Sciences”,
volume = “228”,
pages = “150 - 174”,
year = “2013”,
issn = “0020-0255”,
doi = “https://doi.org/10.1016/j.ins.2012.10.006”,
url = “http://www.sciencedirect.com/science/article/pii/S0020025512006718”,
author = “Nicol’{a}s Garc’{\i}a-Pedrajas and Aida de Haro-Garc’{\i}a and Javier P’{e}rez-Rodr’{\i}guez”,
keywords = “Simultaneous instance and feature selection, Instance selection, Feature selection, Instance-based learning, Very large problems”
}

##Normal

Nicolás García-Pedrajas, Aida de Haro-García, Javier Pérez-Rodríguez,
A scalable approach to simultaneous evolutionary instance and feature selection,
Information Sciences,
Volume 228,
2013,
Pages 150-174,
ISSN 0020-0255,
https://doi.org/10.1016/j.ins.2012.10.006.
(http://www.sciencedirect.com/science/article/pii/S0020025512006718)
Keywords: Simultaneous instance and feature selection; Instance selection; Feature selection; Instance-based learning; Very large problems


#摘要

enormous amount of information
bioinformatics, security and intrusion detection and text mining

data reduction — removing

  1. missing

  2. redundant

  3. information-poor data and/or

  4. erroneous data
    from the dataset to obtain a tractable problem size

  5. feature selection

  6. feature-value discretization

  7. instance selection

the divide-and-conquer principle + bookkeeping

in linear time


#主要内容

Instance selection:choosing a subset of the total available data to achieve the original purpose of the datamining application as though all the data were being used

  1. prototype selection(k-Nearest Neighbors)
  2. obtaining the training set for a learning algorithm(classification trees or neural networks)

‘‘the isolation of the smallest set of instances that enable us to predict the class of a query instance with the same (or higher) accuracy than the original set’’

The objectives of feature selection:

  1. To avoid over-fitting and to improve model performance
  2. To provide faster and more cost-effective models
  3. To gain a deeper insight into the underlying processes that generated the data

simultaneous instance and feature selection

这里写图片描述

scalable simultaneous instance and feature selection method (SSIFSM)


#相关工作

three fundamental approaches for scaling up learning methods:

  1. designing fast algorithms
  2. partitioning the data
  3. using a relational representation

The stratification strategy
分层策略
splits the training data into disjoint strata with equal class distribution

这里写图片描述


#算法

Scalable simultaneous instance and feature selection method (SSIFSM)

K K K classes and N N N training instances with M M M features
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , … , ( x N , y N ) } T = \left\{ \left(x_1, y_1 \right), \left(x_2, y_2 \right), \ldots, \left(x_N, y_N \right) \right\} T={(x1,y1),(x2,y2),,(xN,yN)}
Y = { 1 , … , K } Y = \left\{ 1, \ldots, K \right\} Y={1,,K}

simultaneous instance and feature selection

Bookkeeping

这里写图片描述

这里写图片描述

这里写图片描述

这里写图片描述

an evolutionary algorithm

这里写图片描述

a CHC algorithm —
Cross-generational elitist selection, Heterogeneous recombination and Cataclysmic mutation

  1. To obtain the next generation for a population of size P P P, the parents and the offspring are put together and the P P P best individuals are selected.
  2. To avoid premature convergence, only different individuals separated by a threshold Hamming distance—in our implementation, the length of the chromosome divided by four—are allowed to mate.
  3. During crossover, two parents exchange exactly half of their nonmatching bits. This operator is referred to as Half Uniform Crossover (HUX).
  4. Mutation is not used during the regular evolution. To avoid premature convergence or stagnation of the search, the population is reinitialized when the individuals are not diverse. In such a case, only the best individual is retained in the new population.

这里写图片描述

a c c ( x ) acc(x) acc(x) — the accuracy — 1-NN classifier
r e d ( x ) red(x) red(x) — the reduction — KaTeX parse error: Unexpected character: '' at position 3: 1 ̲ - \frac{N'}{N}…

α \alpha α — a weighting parameter — 0.5 0.5 0.5

the instance and feature selection

record the number of times that each instance and feature has been selected to be kept
— the number of votes (its similarity to the combination of classifiers in an ensemble by voting)
— repeated for r r r rounds
all the instances of a certain subset belong to the same class — ignored

the combination of the different rounds

the philosophy of ensembles of classifiers
several weak learners are combined to form a strong classifier — several weak (in the sense that they are applied to subsets of the training data) instance and feature selection procedures are combined to produce a strong and fast selection method

bagging or boosting

number of votes — KaTeX parse error: Unexpected character: '' at position 8: [ 0, r ̲\cdot s ] (an instance)
an instance is in s s s subsets in every round.
Each feature is in t t t subsets each round.
KaTeX parse error: Unexpected character: '' at position 8: [ 0, r ̲\cdot t ] (an feature)

threshold

majority voting — at least half (depends heavily on the problem)

automatically

这里写图片描述

θ i \theta_i θi — threshold for instance
θ f \theta_f θf — threshold for feature
T ( θ i , θ f ) T \left( \theta_i, \theta_f \right) T(θi,θf) — the subset of the training set
1-NN classifier
all the possible values
β \beta β — close to 1 1 1 0.75 0.75 0.75
r e d ( x ) = 1 − ( s i + s f ) N + M red \left( x \right) = 1 - \dfrac{ \left( s_i + s_f \right) }{ N + M } red(x)=1N+M(si+sf)

evaluate all possible pairs of instance and feature thresholds — KaTeX parse error: Unexpected character: '' at position 5: [ r ̲\cdot s ] \tim… O ( N 2 M ) O(N^2M) O(N2M)
a divide-and-conquer method
the same partition philosophy
training set is divided into random disjoint subsets and the accuracy is estimated separately in each subset using the average evaluation of all the subsets for the fitness of each pair of thresholds

这里写图片描述
这里写图片描述

The scalability of the method is assured by the following features:

  1. Application of the method to small datasets. Due to the small size of the subsets in which the selection process is applied, the selection process will always be fast, regardless of the complexity of the instance and feature selection method used in the subsets.
  2. Only small datasets must be kept in memory. This allows the application of instance and feature selection when datasets do not fit into memory.
  3. Bookkeeping is applied in the evolution for every subset.

##Complexity of our methodology

linear in the number of instances, N, of the dataset

The process of random partition — O ( N M ) O(NM) O(NM)
a subset of fixed size, n n n
the complexity of the CHC selection algorithm
K K K — the number of operations required by the selection algorithm to perform its task in a dataset of size n n n
N N N instances and M M M features
his selection process once for each subset, N n M m \frac{N}{n} \frac{M}{m} nNmM times
O ( ( N n M m ) K ) O \left( \left( \frac{N}{n} \frac{M}{m} \right) K \right) O((nNmM)K)
r r r rounds
O ( r ( N n M m ) K ) O \left( r \left( \frac{N}{n} \frac{M}{m} \right) K \right) O(r(nNmM)K)
linear
easy parallel implementation


##Parallel implementation

the master/slave architecture
master — the partition of the dataset and sends the subsets to each slave
slave — performs the selection algorithm + returns the selected instances and features to the master
master — stores the votes for each kept instance and feature

这里写图片描述

communication between different tasks
occurs only twice:

  1. Before each slave initiates its selection process, it must receive the subset of data to perform that process. This amount of information is always small because the method is based on each slave taking care of only a small part of the whole dataset. Furthermore, if the slaves can access the disk, they can read the necessary data directly from it.
  2. Once the selection process is finished, the slaves send the selection performed to the master. This selection consists of a list of the selected instances and features, which is a small sequence of integers.

the best value for the votes threshold:

  1. Before each slave initiates the evaluation of a certain pair of thresholds, it must receive the subset of data to perform that task. This amount of information is always small because the method is based on each slave taking care of only a small part of the whole dataset.
  2. Once the evaluation process is finished, the slaves send the evaluation performed to the master. The evaluation is the error obtained when the corresponding pair of thresholds is used, which is a real number.

#试验

50 problems

这里写图片描述

the UCI Machine Learning Repository

the reduction and accuracy
a 10-fold cross-validation method
a 1-nearest neighbor (1-NN) classifier

from small to medium sizes


##Algorithms for the comparison

  1. Nearest neighbor error using all instances and features is chosen as a baseline measure (1-NN). Any method of performing selection of features or instances must at least match the performance of the 1-NN algorithm or improve its accuracy if possible.
  2. As stated, Cano et al. [10] performed a comprehensive comparison of the performances of different evolutionary algorithms for instance selection. They compared a generational genetic algorithm, a steady-state genetic algorithm, a CHC genetic algorithm and a population-based incremental learning algorithm. Among these methods, CHC achieved the best overall performance (ISCHC). Therefore, we have included the CHC algorithm in our comparison. [10] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study, IEEE Transactions on Evolutionary Computation 7 (2003) 561–575.
  3. Feature selection using a genetic algorithm (FSCHC). Following the same idea used for instance selection, we used a CHC algorithm for feature selection with the same characteristics of the algorithm used for instance selection.
  4. Instance and feature selection using a genetic algorithm (IS + FSCHC). Combining the two previous methods, we also used a CHC algorithm that simultaneously evolved the instances and features.
  5. Intelligent Multiobjective Evolutionary Algorithm (IMOEA) [13] method. This method is a multi-objective evolutionary algorithm, which considers both instance and feature selection. The algorithm has two objectives, maximization of training accuracy and minimization of the number of instances and features selected. The multi-objective algorithm used is based on Pareto dominance because the approach is common in multi-objective algorithms [65]. The fitness of each individual is the difference between the individuals it dominates and the individuals that dominate it. The algorithm also includes a new crossover operator called intelligent crossover, which incorporates the systematic reasoning ability of orthogonal experimental design [43] to estimate the contribution of each gene to the fitness of the individuals.
  6. The major aim of our method is scalability. However, if scalability can be achieved with a simple random sampling method, it may be argued that our method is superfluous. Thus, our last method for comparison is a CHC algorithm for instance and feature selection, which uses a random sampling of instances (SAMPLING). The method is applied using a random 10% of all the instances in the dataset.

The source code used for all methods is in C and is licensed under the GNU General Public License.

in a cluster of 32 blades
Each blade is a biprocessor DELL Power Edge M600 with four cores per processor
256 cores
a 1 GB network
2.5 GHz
each blade has 16 GB of memory


##Statistical tests

Iman-Davenport test — based on the χ F 2 \chi^2_F χF2
Friedman test — compares the average ranks of k k k algorithms — more powerful than the Friedman test

这里写图片描述
following a F F F distribution with KaTeX parse error: Unexpected character: '' at position 4: k -̲ 1 and KaTeX parse error: Unexpected character: '' at position 5: (k -̲ 1) (N - 1) degrees of freedom

pairwise comparisons — the Wilcoxon test — stronger
family-wise error

The test statistic for comparing the ith and jth classifier

这里写图片描述

z z z — used to find the corresponding probability from the table of normal distribution

The Bonferroni–Dunn test — Holm
Holm’s procedure — more powerful than Bonferroni–Dunn’s


##Evaluation measures

accuracy — the percentage of instances classified correctly
reduction — the percentage of total data removed during the evolution

class-imbalanced problems

这里写图片描述

  1. true positives (TPs)
  2. false positives (FPs)
  3. true negatives (TNs)
  4. false negatives (FNs)

  • the sensitivity (Sn) — S n = T P T P + F N Sn = \frac{TP}{TP+FN} Sn=TP+FNTP
  • the specificity (Sp) — S p = T N T N + F P Sp = \frac{TN}{TN+FP} Sp=TN+FPTN
  • the G - mean measure — G − m e a n = S p ⋅ S n G-mean = \sqrt{Sp \cdot Sn} Gmean=SpSn

the reduction — 1 − n N m M 1 - \frac{n}{N} \frac{m}{M} 1NnMm


##实验结果

regarding the size of the subsets used

  1. 1000 instances and seven features — (1000,7)
  2. 500 instances and nine features — (500,9)
  3. 100 instances and 13 features — (100,13)

cache — 300–400 MB

memory thrashing

这里写图片描述
这里写图片描述

no significant differences among the three configurations for both accuracy and reduction

execution time — remarkable differences

这里写图片描述
这里写图片描述

这里写图片描述

The Iman–Davenport test with a p-value of 0.0000 for both accuracy and reduction — significant differences

Holm’s procedure —

这里写图片描述

the κ \kappa κ-error relative movement diagrams —

这里写图片描述
— the reduction difference — instead of the κ \kappa κ difference value
These diagrams use an arrow to represent the results of two methods applied to the same dataset.
a convenient way of summarizing the results
arrows pointing up-right
arrows pointing down-left

这里写图片描述

这里写图片描述
Regarding time — an approximately linear behavior

Holm test — (15 rounds)

这里写图片描述


###运行时间

the philosophy of divide-and-conquer
bookkeeping

这里写图片描述

the wall-clock time

significantly faster

the speedup of SSIFSM with respect to the standard IS + FSCHC for the parallel and sequential implementations respectively —

这里写图片描述

这里写图片描述

three control experiments —

这里写图片描述

这里写图片描述

  1. a random selection method — [ 5 % , 15 % ] [5\%,15\%] [5%,15%] and [ 15 % , 25 % ] [15\%,25\%] [15%,25%] (instance and features)
  2. IB3 [1] algorithm for instance selection and ReliefF [55] method for feature selection
  3. IB3 [1] algorithm for feature selection and ReliefF [55] method for instance selection

the performance in terms of reduction was similar
a very significant worsening of the accuracy


###Scalability to very large datasets

problems with many instances, many features and both many instances and features

the largest set containing 50 million instances and 800 features

a large imbalance ratio

G-mean

这里写图片描述


#结论

simultaneous instance and feature selection

a bookkeeping mechanism

a voting method

a CHC algorithm

class-imbalanced data

scaled up to problems of very large size

scalability:

  1. instance and feature selection is always performed over small datasets
  2. a bookkeeping approach
  3. only small subsets must be kept in memory

decision trees
support vector machines

这篇关于一种可扩展的同时进化实例和特征选择方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/971938

相关文章

Python中注释使用方法举例详解

《Python中注释使用方法举例详解》在Python编程语言中注释是必不可少的一部分,它有助于提高代码的可读性和维护性,:本文主要介绍Python中注释使用方法的相关资料,需要的朋友可以参考下... 目录一、前言二、什么是注释?示例:三、单行注释语法:以 China编程# 开头,后面的内容为注释内容示例:示例:四

一文详解Git中分支本地和远程删除的方法

《一文详解Git中分支本地和远程删除的方法》在使用Git进行版本控制的过程中,我们会创建多个分支来进行不同功能的开发,这就容易涉及到如何正确地删除本地分支和远程分支,下面我们就来看看相关的实现方法吧... 目录技术背景实现步骤删除本地分支删除远程www.chinasem.cn分支同步删除信息到其他机器示例步骤

java向微信服务号发送消息的完整步骤实例

《java向微信服务号发送消息的完整步骤实例》:本文主要介绍java向微信服务号发送消息的相关资料,包括申请测试号获取appID/appsecret、关注公众号获取openID、配置消息模板及代码... 目录步骤1. 申请测试系统2. 公众号账号信息3. 关注测试号二维码4. 消息模板接口5. Java测试

在Golang中实现定时任务的几种高效方法

《在Golang中实现定时任务的几种高效方法》本文将详细介绍在Golang中实现定时任务的几种高效方法,包括time包中的Ticker和Timer、第三方库cron的使用,以及基于channel和go... 目录背景介绍目的和范围预期读者文档结构概述术语表核心概念与联系故事引入核心概念解释核心概念之间的关系

MySQL数据库的内嵌函数和联合查询实例代码

《MySQL数据库的内嵌函数和联合查询实例代码》联合查询是一种将多个查询结果组合在一起的方法,通常使用UNION、UNIONALL、INTERSECT和EXCEPT关键字,下面:本文主要介绍MyS... 目录一.数据库的内嵌函数1.1聚合函数COUNT([DISTINCT] expr)SUM([DISTIN

在Linux终端中统计非二进制文件行数的实现方法

《在Linux终端中统计非二进制文件行数的实现方法》在Linux系统中,有时需要统计非二进制文件(如CSV、TXT文件)的行数,而不希望手动打开文件进行查看,例如,在处理大型日志文件、数据文件时,了解... 目录在linux终端中统计非二进制文件的行数技术背景实现步骤1. 使用wc命令2. 使用grep命令

Python中Tensorflow无法调用GPU问题的解决方法

《Python中Tensorflow无法调用GPU问题的解决方法》文章详解如何解决TensorFlow在Windows无法识别GPU的问题,需降级至2.10版本,安装匹配CUDA11.2和cuDNN... 当用以下代码查看GPU数量时,gpuspython返回的是一个空列表,说明tensorflow没有找到

XML重复查询一条Sql语句的解决方法

《XML重复查询一条Sql语句的解决方法》文章分析了XML重复查询与日志失效问题,指出因DTO缺少@Data注解导致日志无法格式化、空指针风险及参数穿透,进而引发性能灾难,解决方案为在Controll... 目录一、核心问题:从SQL重复执行到日志失效二、根因剖析:DTO断裂引发的级联故障三、解决方案:修复

C++ 检测文件大小和文件传输的方法示例详解

《C++检测文件大小和文件传输的方法示例详解》文章介绍了在C/C++中获取文件大小的三种方法,推荐使用stat()函数,并详细说明了如何设计一次性发送压缩包的结构体及传输流程,包含CRC校验和自动解... 目录检测文件的大小✅ 方法一:使用 stat() 函数(推荐)✅ 用法示例:✅ 方法二:使用 fsee

Java继承映射的三种使用方法示例

《Java继承映射的三种使用方法示例》继承在Java中扮演着重要的角色,它允许我们创建一个类(子类),该类继承另一个类(父类)的所有属性和方法,:本文主要介绍Java继承映射的三种使用方法示例,需... 目录前言一、单表继承(Single Table Inheritance)1-1、原理1-2、使用方法1-