Big Data Algorithms for Visualization and Supervised Learning
dc.contributor.advisor | Vucetic, Slobodan | |
dc.creator | Djuric, Nemanja | |
dc.date.accessioned | 2020-11-03T16:23:51Z | |
dc.date.available | 2020-11-03T16:23:51Z | |
dc.date.issued | 2013 | |
dc.identifier.other | 881265180 | |
dc.identifier.uri | http://hdl.handle.net/20.500.12613/2790 | |
dc.description.abstract | Explosive growth in data size, data complexity, and data rates, triggered by emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, or computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the data mining and machine learning tools designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for development of novel, scalable algorithms for big data. In this thesis we address these important problems, and propose both supervised and unsupervised tools for handling large-scale data. First, we consider unsupervised approach to big data analysis, and explore scalable, efficient visualization method that allows fast knowledge extraction. Next, we consider supervised learning setting and propose algorithms for fast training of accurate classification models on large data sets, capable of learning state-of-the-art classifiers on data sets with millions of examples and features within minutes. Data visualization have been used for hundreds of years in scientific research, as it allows humans to easily get a better insight into complex data they are studying. Despite its long history, there is a clear need for further development of visualization methods when working with large-scale, high-dimensional data, where commonly used visualization tools are either too simplistic to gain a deeper insight into the data properties, or are too cumbersome or computationally costly. We present a novel method for data ordering and visualization. By combining efficient clustering using k-means algorithm and near-optimal ordering of found clusters using state-of-the-art TSP-solver, we obtain efficient algorithm that achieves performance better than existing, computationally intensive methods. In addition, we present visualization method for smaller-scale problems based on object matching. The experiments show that the methods allow for fast detection of hidden patterns, even by users without expertise in the areas of data mining and machine learning. Supervised learning is another important task, often intractable in many modern applications due to time and memory constraints, considering prohibitively large scales of the data sets. To address this issue, we first consider Multi-hyperplane Machine (MM) classification model, and propose online Adaptive MM algorithm which represents a trade-off between linear and kernel Support Vector Machines (SVMs), as it trains MMs in linear time on limited memory while achieving competitive accuracies on large-scale non-linear problems. Moreover, we present a C++ toolbox for developing scalable classification models, which provides an Application Programming Interface (API) for training of large-scale classifiers, as well as highly-optimized implementations of several state-of-the-art SVM approximators. Lastly, we consider parallelization and distributed learning approaches to large-scale supervised learning, and propose AROW-MapReduce, a distributed learning algorithm for confidence-weighted models using MapReduce framework. Experimental evaluation of the proposed methods shows state-of-the-art performance on a number of synthetic and real-world data sets, further paving a way for efficient and effective knowledge extraction from big data problems. | |
dc.format.extent | 135 pages | |
dc.language.iso | eng | |
dc.publisher | Temple University. Libraries | |
dc.relation.ispartof | Theses and Dissertations | |
dc.rights | IN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available. | |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | |
dc.subject | Computer Science | |
dc.subject | Big Data | |
dc.subject | Data Mining | |
dc.subject | Data Visualization | |
dc.subject | Large-scale Learning | |
dc.subject | Machine Learning | |
dc.title | Big Data Algorithms for Visualization and Supervised Learning | |
dc.type | Text | |
dc.type.genre | Thesis/Dissertation | |
dc.contributor.committeemember | Obradovic, Zoran | |
dc.contributor.committeemember | Latecki, Longin | |
dc.contributor.committeemember | Bai, Li | |
dc.description.department | Computer and Information Science | |
dc.relation.doi | http://dx.doi.org/10.34944/dspace/2772 | |
dc.ada.note | For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu | |
dc.description.degree | Ph.D. | |
refterms.dateFOA | 2020-11-03T16:23:51Z |