Leveraging kindred projections for dimensionality reduction and improved classification

Diego Castaneda

doi:10.17077/etd.005336

Back

Leveraging kindred projections for dimensionality reduction and improved classification

Thesis

Open access

Leveraging kindred projections for dimensionality reduction and improved classification

Diego Castaneda

University of Iowa

Master of Science (MS), University of Iowa

Spring 2020

DOI: 10.17077/etd.005336

Files and links (1)

pdf

Leveraging_Kindred_Projections_for_Dimensionality_Reduction_and_Improved_Classification423.63 kBDownload View

Free to read and download, Open Access

Abstract

Data classification over multi-dimensional data is an essential task in machine learning. While it may seem that having a large number of features may be beneficial for classification, most machine learning algorithms fail for high-dimensions as the feature space is quite noisy and can lead to over-fitted models. Random Forests create random projections or subsets of attributes and build many classification trees over each projection, making it very effective for working with high dimensional data. In this thesis, we propose to combine the ideas of data projections with clustering as an effective way of categorizing data samples within distinct feature projections. We use the term Kindred Projections (KP) to refer to the multi-dimensional projections of related features. We apply a machine learning approach to decide the best clustering over each KP. A clustering's quality is determined by the adjusted mutual information criterion, which tells how well a grouping corresponds to the desired outcome variables. Then we train a Logistic Regression model with the learned clusterings over the KP as predictors. The proposed method reduces dimensionality while maintaining the interpretability of the model. This work evaluates four different, publicly available datasets. With the proposed approach, we can obtain AUC improvements up to 18% when compared against Logistic Regression, Random Forests, and Naive Bayes classifiers constructed on the original set of features. Additionally, dimensionality is reduced up to 76% with the KP models vs. models fit on the original feature space.

Computer Science

Classification

Clustering

Dimensionality Reduction

Kindred Projections

Multi-view

Details

Title: Subtitle: Leveraging kindred projections for dimensionality reduction and improved classification
Creators: Diego Castaneda
Contributors: Guadalupe M Canahuate (Advisor)
Hans J Johnson (Committee Member)
Thomas L Casavant (Committee Member)
Resource Type: Thesis
Degree Awarded: Master of Science (MS), University of Iowa
Degree in: Electrical and Computer Engineering
Date degree season: Spring 2020
DOI: 10.17077/etd.005336
Publisher: University of Iowa
Number of pages: viii, 32 pages
Language: English
Description illustrations: color illustrations
Description bibliographic: Includes bibliographical references (pages 31-32).
Public Abstract (ETD): In classification tasks, it is common to deal with data that contains hundreds or thousands of attributes. To most, it might seem that having more data means that one can build more accurate and robust models, but this is often a misconception. Models in high dimensions usually struggle to find useful patterns as these patterns are obscured by noise within the data. There are tree-based decision models that have proven to be effective in these situations by building several models over random subsets of the features. The tree-based decision models then come up with a consensus on the labels for data points by a voting process.

This work proposes a method inspired by models that use subsets of features. Instead of defining subsets of random features, this work suggests grouping attributes by how related they are to each other, specifically by autological reasoning. We then use unsupervised methods to summarize the disjoint subsets of associated features into a single category. A learning step is applied for each disjoint subgroup to determine the best clustering that corresponds to the outcome variables by incorporating information theoretic measures. Then the categorizations are used as factors for fitting a highly explainable classification model. This work demonstrates that we can effectively reduce the dimensions of the original data into meaningful categories and then leverage the groups to improve performance over models that use all the features.
Academic Unit: Electrical and Computer Engineering
Record Identifier: 9983956194502771

Metrics

11 File views/ downloads

50 Record Views