Dimensionality Reduction Reduction Rate Accuracy on validation set Best Threshold AuC Notes Baseline Baseline models are using all input features Missing Values Ratio 71.4 82 Low Variance Filter 73.03 82 Only for numerical columns High Correlation Filter 74.2.
The answers involved Random Projections, NMF, (Stacked) Auto-encoders, Chi-square or Information Gain, Multidimensional Scaling, Correspondence Analysis, Factor Analysis, Clustering, and Bayesian Models.The feature that produces the highest increase in performance.If an attribute is often selected as best split, it is most likely an informative feature to retain.One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set.Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive.