CompUtational metHods oF FeatUre seleCtion. Huan Liu and Hiroshi Motoda. PubLiSHeD TiTLeS. SeRieS eDiToR. Vipin Kumar. University of minnesota. PDF | On Jan 1, , Renato Cordeiro de Amorim and others published Computational Methods of Feature Selection, Huan Liu, Hiroshi Motoda, CRC Press. Request PDF on ResearchGate | On Jan 1, , H Liu and others published Computational Methods of Feature Selection.
|Language:||English, Spanish, French|
|Distribution:||Free* [*Sign up for free]|
both new applications of existing randomized feature selection methods, and ization is often called for when deterministic feature selection algorithms are. Computational Methods of Feature Selection. HUAN LIU AND HIROSHI MOTODA. REVIEWED BY LONGBING CAO. AND DAVID TANIAR. Feature selection. Highlighting current research issues, Computational Methods of Feature Selection introduces the basic concepts and principles, state-of-the-art algorithms , and.
It then reports on some recent results of empowering feature selection, including active feature selection, decision-border estimate, the use of ensembles with independent probes, and incremental feature selection. This is followed by discussions of weighting and local methods, such as the ReliefF family, k -means clustering, local feature relevance, and a new interpretation of Relief.
The book subsequently covers text classification, a new feature selection score, and both constraint-guided and aggressive feature selection. The final section examines applications of feature selection in bioinformatics, including feature construction as well as redundancy-, ensemble-, and penalty-based feature selection.
Through a clear, concise, and coherent presentation of topics, this volume systematically covers the key concepts, underlying principles, and inventive applications of feature selection, illustrating how this powerful tool can efficiently harness massive, high-dimensional data and turn it into valuable, reliable information. Bridges, and Shane C.
Dozens of algorithms and their comparisons in experiments with synthetic and real data are presented, which can be very helpful to researchers and students working with large data stores.
Overall, we enjoyed reading this book. It presents state-of-the-art guidance and tutorials on methodologies and algorithms in computational methods in feature selection. Enhanced by the editors insights, and based on previous work by these leading experts in the field, the book forms another milestone of relevant research and development in feature selection.
We provide complimentary e-inspection copies of primary textbooks to instructors considering our books for course adoption. CPD consists of any educational activity which helps to maintain and develop knowledge, problem-solving, and technical skills with the aim to provide better health care through higher standards. It could be through conference attendance, group discussion or directed reading to name just a few examples. We provide a free online form to document your learning and a certificate for your records.
Already read this title? Stay on CRCPress. Exclusive web offer for individuals on print book only. Preview this Book. Select Format: Add to Wish List. Close Preview. Toggle navigation Additional Book Information. Description Table of Contents Reviews. Summary Due to increasing demands for dimensionality reduction, research on feature selection has deeply and widely expanded into many fields, including computational statistics, pattern recognition, machine learning, data mining, and knowledge discovery.
Part V is on Feature Selection in Bioinformatics, discussing redundancy-based feature selection, feature construction and selection, ensemble-based robust feature selection, and penalty-based feature selection.
A summary of each chapter is given next. The existence of irrelevant features can misguide clustering results. Feature selection can either be global or local, and the features to be selected can vary from cluster to cluster.
Chapter 3 is also an overview about randomization techniques for feature selection. There are two broad classes of algorithms: Las Vegas algorithms, which guarantee a correct answer but may require a long time to execute with small probability, and Monte Carlo algorithms, which may output an incorrect answer with small probability but always complete execution quickly.
The chapter introduces examples of several randomization algorithms. Chapter 4 addresses the notion of causality and reviews techniques for learning causal relationships from data in applications to feature selection. Only direct causes are strongly causally relevant. It is shown that the minimum mean-squared error criterion is equivalent to the maximum average change criterion. The results obtained by using a mixture model for the joint class-feature distribution show the advantage of the active sampling policy over the random sampling in reducing the number of feature samples.
The approach is computationally expensive. Considering only a random subset of the missing entries at each sampling step is a promising solution. Chapter 6 discusses feature extraction as opposed to feature selection based on the properties of the decision border. It is shown that this approach is comparable to the SVM-based decision boundary approach and better than the MLP Multi Layer Perceptron -based approach, but with a lower computational cost.
The key is to use the same distribution in generating a probe. Feature relevance is estimated by averaging the relevance obtained from each tree in the ensemble. It shows excellent performance and low computational complexity, and is able to address massive amounts of data. Chapter 8 introduces an incremental feature selection algorithm for highdimensional data.
The key idea is to decompose the whole process into feature ranking and selection. The incremental subset search does not retract what it has selected, but it can decide not to add the next candidate feature, i. Thus, the average number of features used to construct a learner during the search is kept small, which makes the wrapper approach feasible for high-dimensional data.
Relief exploits the context of other features through distance measures and can detect highly conditionally-dependent features. The chapter explains the idea, advantages, and applications of Relief and introduces two extensions: ReliefF and RReliefF. RReliefF is its extension designed for regression.
The variety of the Relief family shows the general applicability of the basic idea of Relief as a non-myopic feature quality measure.
Chapter 10 discusses how to automatically determine the important features in the k-means clustering process. The weight of a feature is determined by the sum of the within-cluster dispersions of the feature, which measures its importance in clustering.
The latter, called subspace k-means clustering, has applications in text clustering, bioinformatics, and customer behavior analysis. Chapter 11 is in line with Chapter 5, but focuses on local feature relevance and weighting. The weight has a large value for a direction along which the class probability is not locally constant. Chapter 12 gives further insights into Relief refer to Chapter 9.
The weights can be iteratively updated by an EM-like algorithm, which guarantees the uniqueness of the optimal weights and the convergence. The generated features are ranked by scoring each feature independently.
A case study shows considerable improvement of F -measure by feature selection. It also shows that adding two word phrases as new features generally gives good performance gain over the features comprising only selected words. The score assumes a probability distribution on the words of the documents. Bernoulli and Poisson distributions are assumed respectively when only the presence or absence of a word matters and when the number of occurrences of a word matters.
The score computation is inexpensive, and the value that the score assigns to each word has an appealing Bayesian interpretation when the predictive model corresponds to a naive Bayes model.
Chapter 15 focuses on dimensionality reduction for semi-supervised clustering where some weak supervision is available in terms of pairwise instance constraints must-link and cannot-link. Two methods are proposed by leveraging pairwise instance constraints: This reduces to an elegant eigenvalue decomposition problem. Feature clustering and data clustering are mutually reinforced in the co-clustering process.
Experiments show that feature redundancy can be as destructive as noise. A new multi-stage approach for text feature selection is proposed: In addition, term redundancy is modeled by a term-redundancy tree for visualization purposes.
For high-throughput data like microarrays, redundancy among genes becomes a critical issue. It is known that if there is a Markov blanket for a feature, the feature can be safely eliminated. Finding a Markov blanket is computationally heavy. The solution proposed is to use an approximate Markov blanket, in which it is assumed that the Markov blanket always consists of one feature. Redundancy-based feature selection makes it possible for a biologist to specify what genes are to be included before feature selection.
Chapter 18 presents a scalable method for automatic feature generation on biological sequence data. Chapter 20 presents a penalty-based feature selection method, elastic net, for genomic data, which is a generalization of lasso a penalized least squares method with L1 penalty for regression.
Elastic net has a nice property that irrelevant features receive their parameter estimates equal to 0, leading to sparse and easy to interpret models like lasso, and, in addition, strongly correlated relevant features are all selected whereas in lasso only one of them is selected.
Thus, it is a more appropriate tool for feature selection with high-dimensional data than lasso. As data evolve, new challenges arise and the expectations of feature selection are also elevated, due to its own success. In addition to high-throughput data, the pervasive use of Internet and Web technologies has been bringing about a great number of new services and applications, ranging from recent Web 2.
The frontier of feature selection research is expanding incessantly in answering the emerging challenges posed by the ever-growing amounts of data, multiple sources of heterogeneous data, data streams, and disparate dataintensive applications. References  A. Blum and P. Selection of relevant features and examples in machine learning. Dash and H. Intelligent Data Analysis: An International Journal, 1 3: Dy and C. Feature selection for unsupervised learning.
Journal of Machine Learning Research, 5: Guyon and A. An introduction to variable and feature selection. Hastie, R. Tibshirani, and J. The Elements of Statistical Learning. Springer, Jakulin and I. ACM Press, John, R. Kohavi, and K. Irrelevant feature and the subset selection problem. Cohen and H. Rutgers University, Liu and H. Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective.
Kluwer Academic Publishers, Instance Selection and Construction for Data Mining. Liu and L. IEEE Trans. Machine Learning. New York: McGraw-Hill, Refaeilzadeh, L. Tang, and H. On comparison of feature selection algorithms. Singhi and H. In International Conference on Machine Learning, Yu and H. Zhao and H. Searching for interacting features. Semi-supervised feature selection via spectral analysis. Spectral feature selection for supervised and unsupervised learning.
Dy Northeastern University 2. Feature Selection. Feature Selection for Unlabeled Data. Local Approaches.
Moreover, human labeling is expensive and subjective. Hence, unsupervised learning is needed. Besides being unlabeled, several applications are characterized by high-dimensional data e.
However, not all of the features domain experts utilize to represent these data are important for the learning task. We have seen the need for feature selection in the supervised learning case. This is also true in the unsupervised case. Unsupervised means there is no teacher, in the form of class labels. One type of unsupervised learning problem is clustering.
In the supervised paradigm, feature selection algorithms maximize some function of prediction accuracy. Since class labels are available in supervised learning, it is natural to keep only the features that are related to or lead to these classes. But in unsupervised learning, we are not given class labels. Which features should we keep? Why not use all the information that we have? The problem is that not all the features are important.
Some of the features may be redundant and some may be irrelevant. Furthermore, the existence of several irrelevant features can misguide clustering results. A reason why some clustering algorithms break down in high dimensions is due to the curse of dimensionality . Additional dimensions increase the volume exponentially and spread the data such that the data points would look equally far.
Figure 2. Observe that the data become more and more sparse in higher dimensions. There are 12 samples that fall inside the unit-sized box in Figure 2. Illustration for the curse of dimensionality. These are plots of a sample data generated from a uniform distribution between 0 and 2. Note that data are more sparse with respect to the unit-sized volume in higher dimensions. There are 12 samples in the unit-sized box in a , 7 samples in b , and 2 samples in c. As noted earlier, supervised learning has class labels to guide the feature search.
Without any labeled information, in unsupervised learning, we need to make some assumptions. We will see examples of these criterion functions later in this chapter. Before we proceed with how to do feature selection on unsupervised data, it is important to know the basics of clustering algorithms. Section 2. In Section 2. Then, we present the methods for unsupervised feature selection in Sections 2.
There are two types of clustering approaches: Partitional clustering provides one level of clustering. Hierarchical clustering, on the other hand, provides multiple levels hierarchy of clustering solutions. Hierarchical approaches can proceed bottom-up agglomerative or top-down divisive. Bottom-up approaches typically start with all instances as clusters and then, at each level, merge clusters that are most similar with each other.
Topdown approaches divide the data into k clusters at each level. There are several methods for performing clustering. A survey of these algorithms can be found in [29, 39, 18]. K-means is an iterative algorithm that locally minimizes the SSE criterion. It assumes each cluster has a hyper-spherical structure. The k-means algorithm starts with initial K centroids, then it assigns each remaining point to the nearest centroid, updates the cluster centroids, and repeats the process until the K centroids do not change convergence.
There are two versions of k-means: One version originates from Forgy  and the other version from Macqueen . To avoid local optimum, one typically applies random restarts and picks the clustering solution with the best SSE.
One can refer to [47, 4] for other ways to deal with the initialization problem. Standard k-means utilizes Euclidean distance to measure dissimilarity between the data points. Note that one can easily create various variants of k-means by modifying this distance metric e.
For example, on text data, a more suitable metric is the cosine similarity. One can also modify the objective function, instead of SSE, to other criterion measures to create other clustering algorithms.
EM is a general approach for estimating the maximum likelihood or MAP estimate for missing data problems. In the clustering context, the missing or hidden variables are the class labels.
The EM algorithm iterates between an Expectation-step E-step , which computes the expected complete data log-likelihood given the observed data and the model parameters, and a Maximization-step M-step , which estimates the model parameters by maximizing the expected complete data log-likelihood from the E-step, until convergence. In clustering, the E-step is similar to estimating the cluster membership and the M-step estimates the cluster model parameters.
A Gaussian distribution is typically utilized for continuous features and multinomials for discrete features. We repeat and summarize them here for completeness. More realistic search strategies have been studied. Sequential search methods generally use greedy techniques and hence do not guarantee global optimality of the selected subsets, only local optimality.
Examples of sequential searches include sequential forward selection, sequential backward elimination, and bidirectional selection [32, 33]. Marill and Green  introduced the sequential backward selection SBS  method, which starts with all the features and sequentially eliminates one feature at a time eliminating the feature that contributes least to the criterion function.
Whitney  introduced sequential forward selection SFS , which starts with the empty set and sequentially adds one feature at a time. A problem with these hill-climbing search techniques is that when a feature is deleted in SBS, it cannot be re-selected, while a feature added in SFS cannot be deleted once selected.
Pudil et al. Random search methods such as genetic algorithms and random mutation hill climbing add some randomness in the search procedure to help to escape from a local optimum. Individual search methods evaluate each feature individually according to a criterion or a condition . They then select features, which either satisfy the condition or are top-ranked.
Some of the features may be irrelevant and some of the features may be redundant. Each feature or feature subset needs to be evaluated based on importance by a criterion.
However, in clustering, these class labels are not available. Suppose data have features F1 and F2 only. Feature F2 does not contribute to cluster discrimination, thus, we consider feature F2 to be irrelevant.
We want to remove irrelevant features because they may mislead the clustering algorithm especially when there are more irrelevant features than relevant ones. In this example, feature F2 is irrelevant because it does not contribute to cluster discrimination. In this example, features F1 and F2 have redundant information, because feature F1 provides the same information as feature F2 with regard to discriminating the two clusters.
Therefore, we consider features F1 and F2 to be redundant. As Figure 2. On the other hand, as shown in Figure 2. Wrapper approach for feature selection for clustering. Filter approach for feature selection for clustering. However, wrapper methods are more computationally expensive since one needs to run the learning algorithm for every candidate feature subset. He claims that irrelevant features are features that do not depend on the other features.
Manoranjan et al. They observed that when the data are clustered, the distance entropy at that subspace should be low. He, Cai, and Niyogi  select features based on the Laplacian score that evaluates features based on their locality preserving power. The Laplacian score is based on the premise that two data points that are close together probably belong to the same cluster.
One can cluster the features using a k-means clustering [36, 17] type of algorithm with feature correlation as the similarity metric. Instead of a cluster mean, represent each cluster by the feature that has the highest correlation among features within the cluster it belongs to. Popular techniques for dimensionality reduction without labels are principal components analysis PCA , factor analysis, and projection pursuit [20, 27]. But rather than selecting a subset of the features, they involve some type of feature transformation.
PCA and factor analysis aim to reduce the dimension such that the representation is as faithful as possible to the original data. As such, these techniques aim at reducing dimensionality by removing redundancy. In this case, projection pursuit addresses relevance. Another method is independent component analysis ICA .
We are interested in subsets of the original features, because we want to retain the original meaning of the features. Moreover, transformations would still require the user to collect all the features to obtain the reduced set, which is sometimes not desired. They incorporate the clustering algorithm inside the feature search and selection. Wrapper approaches consist of: See Figure 2.
One can build a feature selection wrapper approach for clustering by simply picking a favorite search method any method presented in Section 2. However, there are issues that one must take into account in creating such an algorithm. In , Dy and Brodley investigated the issues involved in creating a general wrapper method where any feature selection, clustering, and selection criteria can be applied.
The second issue they discovered is that various selection criteria are biased with respect to dimensionality. They then introduced a cross-projection normalization scheme that can be utilized by any criterion function. When we are searching for the best feature subset, we run into a new problem: The value of the number of clusters depends on the feature subset. The number of cluster components varies with dimension. Feature evaluation criterion should not be biased with respect to dimensionality.
Dy and Brodley  examined two feature selection criteria: They have shown that the scatter separability criterion prefers higher dimensionality.
However, the separability criterion may not be monotonically increasing with respect to dimension when the clustering assignments change. Scatter separability or the trace criterion prefers higher dimensions, intuitively, because data are more scattered in higher dimensions, and mathematically, because adding features means adding more terms in the trace function. Ideally, one would like the criterion value to remain the same if the discrimination information is the same.
Maximum likelihood, on the other hand, prefers lower dimensions. The problem occurs when we compare feature set A with feature set B wherein set A is a subset of B. For sequential searches, this can lead to the trivial result of selecting only a single feature.
To ameliorate this bias, Dy and Brodley  suggest a cross-projection scheme that can be applied with any feature evaluation criterion. In the same way, we obtain the clusters C2 using the features in S2. Which feature subset, S1 or S2 , enables us to discover better clusters? When the normalized criterion values are equal for Si and Sj , we favor the subset with the lower cardinality. Another way to normalize the bias of a feature evaluation criterion with respect to dimensionality is to measure the criterion function of the clustering solution obtained by any subset Si onto the set of all of the original features.
This way, one can compare any candidate subset. Now, one can build any feature selection wrapper approach for unlabeled data, by performing any favorite feature search, clustering, and evaluation criterion, and take these two issues into account. These two criteria need not be the same. For example, an appropriate metric for text data might be the cosine similarity or a mixture of multinomial model for clustering.
The feature evaluation criterion should quantify what type of features the user is interested in. Unlike supervised learning, which has class labels to guide the feature search, unsupervised feature selection relies on criterion functions and would thus require domain knowledge to choose the appropriate objective functions.
To evaluate the feature subset, they tried maximum likelihood and scatter separability. Maximum likelihood ML is the same criterion used in the clustering algorithm. ML prefers the feature subspace that can be modeled best as a Gaussian mixture. They also explored scatter separability, because it can be used with many clustering algorithms.
Scatter separability is similar to the criterion function used in discriminant analysis. It measures how far apart the clusters are from each other normalized by their within cluster distance. High values of ML and scatter separability are desired. The conclusion was that no one criterion is best for all applications. For an image retrieval application, Dy et al.
The features were continuous valued image features; hence, the choice of the Gaussian mixture model for clustering, and since the goal was to retrieve similar images from the same cluster, the separability criterion was chosen for selecting the features. Vaithyanathan and Dom  proposed a probabilistic objective function for both feature selection and clustering, and applied it to text. They modeled the text data as a mixture of multinomials and used a Bayesian approach to estimate the parameters.
To search the feature space, they applied distributional clustering to pre-select candidate subsets and then picked the candidate subset that led to the largest value in the objective function. They address dimensionality bias by formulating the objective function as the integrated likelihood of the joint distribution of the relevant and irrelevant features and assumed the relevant and irrelevant features as conditionally independent given the class.
The dimensionality of the objective function will be equal to the original number of features no matter how many relevant features there are. Kim, Street, and Menczer  applied an evolutionary local selection algorithm ELSA to search the feature subset and number of clusters on two clustering algorithms: Law, Figueiredo, and Jain  added feature saliency, a measure of feature relevance, as a missing variable to a probabilistic objective function. The objective function was similar to that in  i.
To add feature saliency, they utilized the conditional feature independence assumption to build their model. Then, they derived an Expectation-Maximization EM  algorithm to estimate the feature saliency for a mixture of Gaussians. They also developed a wrapper approach that selects features using Kullback-Leibler divergence and entropy.
Another way to group the methods are based on whether the approach is global or local. Global methods select a single set of features for all the clusters.
Local methods select subsets of features associated with each cluster. All the methods presented earlier are global methods. In this section, we present two types of local unsupervised feature selection approaches: Typical subspace clustering approaches measure the existence of a cluster in a feature subspace based on density.
They take advantage of the downward closure property of density to reduce the search space. One can start from one dimension going up until no more dense units are found. When no more dense units are found, the algorithm combines adjacent dense units to form clusters. Density is measured by creating histograms in each dimension and measuring the density within each bin. Here is where the term subspace clustering was coined. CLIQUE proceeds level-by-level from one feature to the highest dimension or until no more feature subspaces with clusters regions with high density points are generated.
Once the dense units are found, CLIQUE keeps the units with the high coverage fraction of the dataset covered by the dense units. Then, clusters are found by combining adjacent dense and high-coverage units. By combining adjacent units, CLIQUE is capable of discovering irregular-shaped clusters, and points can belong to multiple clusters. A subspace with clusters typically has lower entropy than those without clusters. To learn more about subspace clustering, there is a survey in .
An approach called co-clustering initially inspired by Hartigan  has become recently popular due to research on microarray analysis. Co-clustering, also known as bi-clustering, is simply the clustering of both the row sample space and column feature space simultaneously.
The algorithms for performing co-clustering typically quantify the quality of a coclustering as a measure of the approximation error between the original data matrix and the reconstructed matrix from a co-clustering. Dhillon, Mallela, and Modha  introduced an information theoretic formulation for co-clustering. Banerjee et al. Bregman divergence covers a large class of divergence measures, which include the Kullback-Liebler divergence and the squared Euclidean distance. They show that the update steps that alternately update the row and column cluster and the minimum Bregman solution will progressively decrease the matrix approximation error and lead to a locally optimal co-clustering solution.
Cheng and Church  and Cho et al. Their algorithm, COSA, starts by initializing the weights for the features; it then clusters the data based on these weights and recompute the weights until the solution stabilizes.
The cluster update minimizes a criterion that minimizes the inverse exponential mean with separate attribute weighting within each cluster. The data points in high dimensions would look equally far. Because of this, many clustering algorithms break down in high dimensions.
In addition, usually not all the features are important — some are redundant and some are irrelevant. Data with several irrelevant features can misguide the clustering results.
There are two ways to reduce the dimensionality: Feature transformation reduces the dimension by applying some type of linear or non-linear function on the original features, whereas feature selection selects a subset of the original features. One may wish to perform feature selection rather than transformation because one may wish to keep the original meaning of the features.
Furthermore, after feature selection, one does not need to measure the features that are not selected. Feature transformation, on the other hand, still needs all the features to extract the reduced dimensions.
This chapter presents a survey of methods to perform feature selection on unsupervised data. One can select a global set of features or a local set. Global means that one selects a single subset of features that clusters the data.
Local feature selection methods include subspace clustering and co-clustering approaches. Thus, no single criterion is best for all applications. This led to research work on visualization as a guide to feature search . This led Kim, Street, and Menczer  to optimize multi-objective criteria. Knowing a few labeled points or constrained must-link and cannot-link pairs can help guide the feature search.
References  R. Agrawal, J. Gehrke, D. Gunopulos, and P. Automatic subspace clustering of high dimensional data for data mining applications. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. A generalized maximum entropy approach to bregman co-clustering and matrix approximations. Adaptive Control Processes.
Bradley and U. Censor and S. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, Chang and D. A new cell-based clustering method for large, high-dimensional data in data mining applications.
Cheng, A. Fu, and Y. Entropy-based subspace clustering for mining numerical data. ACM Press, August Cheng and G.
Biclustering of expression data. Cho, I. Dhillon, Y. Guan, and S. Minimum sum-squared residue co-clustering of gene expression data. Dempster, N. Laird, and D. Maximum likelihood from incomplete data via the em algorithm. Devaney and A. Dhillon, S. Mallela, and D. Information-theoretic coclustering. Interactive visualization and feature selection for unsupervised data.
Dy, C. Brodley, A. Kak, L. Broderick, and A. Unsupervised feature selection applied to content-based retrieval of lung images. Fern and C. Solving cluster ensemble problems by bipartite graph partitioning. Cluster analysis of multivariate data: Biometrics, Fraley and A. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97 Fred and A.
Combining multiple clustering using evidence accumulation. Exploratory projection pursuit. Journal American Statistical Association, Friedman and J. Clustering objects on subsets of attributes. Journal Royal Statistical Society B, Concept formation and attention.
Goil, H. Nagesh, and A. Journal of Machine Learning Research, 3: Direct clustering of a data matrix. Journal of the American Statistical Association, 67 He, D.
Cai, and P. Laplacian score for feature selection. Weiss, B. Projection pursuit. The Annals of Statistics, 13 2: Survey on independent component analysis. Neural Computing Surveys, 2: Jain, M. Murty, and P. Data clustering: A review. ACM Computing Surveys, 31 3: Principal Component Analysis. Kim, N. Street, and F. Evolutionary model selection in unsupervised learning.
Intelligent Data Analysis, 6: Feature set search algorithms. In Pattern Recognition and Signal Processing, pages 41—60, Kohavi and G. Wrappers for feature subset selection. Law, M. Figueiredo, and A. Feature selection in mixturebased clustering. Liu, Y. Xia, and P. Clustering through decision tree construction. Mathematical Statistics and Probability, 5th, Berkeley, 1: Manoranjan, K. Choi, P. Scheuermann, and H. Marill and D. McLachlan and D.
Finite Mixture Models.
Wiley, New York, Narendra and K. A branch and bound algorithm for feature subset selection. Parsons, E. Haque, and H. Subspace clustering for high dimensional data: Procopiuc, M. Jones, P. Agarwal, and T. A monte carlo algorithm for fast projective clustering. Floating search methods in feature selection. Pattern Recognition Letters, Estimating the dimension of a model. The Annals of Statistics, 6 2: Strehl and J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions.
Journal on Machine Learning Research, 3 , Su and J.
In search of deterministic methods for initializing kmeans andgaussian mixture clustering. Intelligent Data Analysis, 11 4 , Feature selection as a preprocessing step for hierarchical clustering. Vaithyanathan and B. Model selection in unsupervised learning with applications to document clustering.
A direct method of nonparametric measurement selection. Yang, W. Wang, H. Wang, and P. Capturing subspace correlation in a large data set. Stracuzzi Arizona State University 3. Types of Randomizations. Randomized Complexity Classes. Applying Randomization to Feature Selection. For many applications, randomized algorithms are either the simplest or the fastest algorithms available, and sometimes both .
This chapter provides an overview of randomization techniques as applied to feature selection. Motwani and Raghavan  provide a more broad and widely applicable introduction to randomized algorithms.
Learning algorithms must often make choices during execution. In the context of feature selection, randomized methods tend to be useful when the space of possible feature subsets is prohibitively large.
Likewise, randomization is often called for when deterministic feature selection algorithms are prone to becoming trapped in local optima.
In these cases, the ability of randomization to sample the feature subset space is of particular value. We then provide an overview of three complexity classes used in the analysis of randomized algorithms. Following this brief theoretical introduction, we discuss explicit methods for applying randomization to feature selection problems, and provide examples.
Finally, the chapter concludes with a discussion of several advanced issues in randomization, and a summary of key points related to the topic. Las Vegas algorithms always output a correct answer, but may require a long time to execute with small probability. One example of a Las Vegas algorithm is the randomized quicksort algorithm see Cormen, Lieserson, and Rivest , for example.
Randomized quicksort selects a pivot point at random, but always produces a correctly sorted output. The goal of randomization is to avoid degenerate inputs, such as a pre-sorted sequence, which produce the worst-case O n2 runtime of the deterministic pivot point always the same quicksort algorithm. Monte Carlo algorithms may output an incorrect answer with small probability, but always complete execution quickly.
Draw a circle inside a square such that the sides of the square are tangent to the circle. Next, toss pebbles or coins randomly in the direction of the square. Pebbles that land outside the square are ignored. Notice that the longer the algorithm runs more pebbles tossed the more accurate the solution. This is a common, but not required, property of randomized algorithms.
Algorithms that generate initial solutions quickly and then improve them over time are also known as anytime algorithms . Anytime algorithms provide a mechanism for trading solution quality against computation time.
This approach is particularly relevant to tasks, such as feature selection, in which computing the optimal solution is infeasible. Such algorithms are typically also labeled as Monte Carlo. The type of randomization used for a given problem depends on the nature and needs of the problem.
In this section, we provide a brief introduction to three complexity classes of practical importance for randomized algorithms. Papadimitriou  provides a rigorous and detailed discussion of these and other randomized complexity classes.
Randomized algorithms are related to nondeterministic algorithms. Contrast this to deterministic algorithms, which have exactly one next step available at each step of the algorithm. The well-known class N P therefore includes languages accepted by nondeterministic algorithms in a polynomial number of steps, while class P does the same for languages accepted by deterministic algorithms. For example, consider the class RP, for randomized polynomial time.
RP encompasses algorithms that accept good inputs members of the underlying language with non-trivial probability, always reject bad inputs nonmembers of the underlying language , and always execute in polynomial time. The complement of this class, co-RP, then corresponds to the set of algorithms that can make mistakes only if the input string is not a member of the target language.
Illustration of the randomized complexity classes in relation to each other and the deterministic classes P and N P. One algorithm never outputs a false positive, while the other never outputs a false negative. By conducting many repeated and independent executions of both algorithms, we are guaranteed to eventually arrive at the correct output.
Recall that Las Vegas algorithms always output the correct answer, but may take a long time to do so. This intersection is also known as the class ZPP, for polynomial randomized algorithms with zero probability of error. In practice we can use algorithms in RP to construct Monte Carlo algorithms that produce the correct output with high probability simply by running them polynomially many times.
The third and largest complexity class of practical importance is BPP, for polynomial time algorithms with bounded probability of error. This class encompasses algorithms that accept good inputs a majority of the time and rejects bad inputs a majority of the time. Like RP and ZPP, we can create an algorithm that produces the correct result with high probability simply by executing repeatedly an algorithm that meets the stated minimums. Figure 3. Finally, note that the randomized complexity classes are semantic as opposed to syntactic classes such as P and N P.
For example, we can determine whether an algorithm is a member of class P by counting the number of times the input is processed. Conversely, we must consider the probability that a given input is accepted to determine membership in the class RP. There can be no complete problems for such classes . In some cases there may be only one clear option. Given a set of supervised training examples described by a set of input features or variables x and a target concept or function y, produce a subset of the original input variables that predicts best the target concept or function when combined into a hypothesis by a learning algorithm.
In this context, there are at least two possible sources of randomization. A feature selection algorithm may choose at random which variables to include in a subset. The resulting algorithm searches for the best variable subset by sampling the space of possible subsets.
This approach to randomization carries an important advantage. As compared to the popular greedy stepwise search algorithms [1, 8], which add or remove a single variable at a time, randomization protects against local minima. The probability of selecting one particular subset at random out of all possible subsets is simply too small. A parameter must be set arbitrarily within the algorithm, or the algorithm can be run until the available computation time expires as an anytime algorithm.
The second possible source of randomization is the set of training examples, often known as the prototype selection problem. If the number of available examples is very large, an algorithm can select at random which examples to include in a given subset evaluation. The resulting algorithm may conduct a traditional deterministic search through the space of feature subsets, but evaluates those subsets based on a random sample of data. This option is particularly useful when the number of examples available is intractably large, or the available computation time is short.
Randomization itself is a problem solving heuristic. In many cases there is no guarantee that the best possible solution will be found, but often a relatively good solution is found with an acceptable amount of computation.
Many algorithms employ multiple heuristics. One type of heuristic appropriate to a randomized algorithm is a sampling bias. In the context of feature selection, an algorithm that always samples uniformly from the entire space of feature subsets to obtain its next candidate solution uses randomization as its only heuristic. However, algorithms that bias their samples, for example by sampling only in the neighborhood of the current best solution, employ a second heuristic in conjunction with randomization.
We illustrate several examples in the following section. However, not all sampling biases are appropriate to all selection problems. A sampling bias that quickly focuses the search on a small set of features may not be appropriate if there are several disjoint feature sets capable of producing good learner performance.
Likewise, an approach that samples the space broadly throughout the search may not be appropriate if the number of features is large but few are relevant. As noted above, randomization may not be a good choice of heuristic if there is some reason to believe that only a very small number of feature subsets produce desirable results, while all other subsets produce undesirable results.
Successful application of a randomized or deterministic selection algorithm requires some understanding of the underlying feature space.
The heuristics and sampling biases used must be appropriate to the given task.