Show simple item record

dc.contributor.advisorVucetic, Slobodan
dc.creatorDjuric, Nemanja
dc.date.accessioned2020-11-03T16:23:51Z
dc.date.available2020-11-03T16:23:51Z
dc.date.issued2013
dc.identifier.other881265180
dc.identifier.urihttp://hdl.handle.net/20.500.12613/2790
dc.description.abstractExplosive growth in data size, data complexity, and data rates, triggered by emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, or computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the data mining and machine learning tools designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for development of novel, scalable algorithms for big data. In this thesis we address these important problems, and propose both supervised and unsupervised tools for handling large-scale data. First, we consider unsupervised approach to big data analysis, and explore scalable, efficient visualization method that allows fast knowledge extraction. Next, we consider supervised learning setting and propose algorithms for fast training of accurate classification models on large data sets, capable of learning state-of-the-art classifiers on data sets with millions of examples and features within minutes. Data visualization have been used for hundreds of years in scientific research, as it allows humans to easily get a better insight into complex data they are studying. Despite its long history, there is a clear need for further development of visualization methods when working with large-scale, high-dimensional data, where commonly used visualization tools are either too simplistic to gain a deeper insight into the data properties, or are too cumbersome or computationally costly. We present a novel method for data ordering and visualization. By combining efficient clustering using k-means algorithm and near-optimal ordering of found clusters using state-of-the-art TSP-solver, we obtain efficient algorithm that achieves performance better than existing, computationally intensive methods. In addition, we present visualization method for smaller-scale problems based on object matching. The experiments show that the methods allow for fast detection of hidden patterns, even by users without expertise in the areas of data mining and machine learning. Supervised learning is another important task, often intractable in many modern applications due to time and memory constraints, considering prohibitively large scales of the data sets. To address this issue, we first consider Multi-hyperplane Machine (MM) classification model, and propose online Adaptive MM algorithm which represents a trade-off between linear and kernel Support Vector Machines (SVMs), as it trains MMs in linear time on limited memory while achieving competitive accuracies on large-scale non-linear problems. Moreover, we present a C++ toolbox for developing scalable classification models, which provides an Application Programming Interface (API) for training of large-scale classifiers, as well as highly-optimized implementations of several state-of-the-art SVM approximators. Lastly, we consider parallelization and distributed learning approaches to large-scale supervised learning, and propose AROW-MapReduce, a distributed learning algorithm for confidence-weighted models using MapReduce framework. Experimental evaluation of the proposed methods shows state-of-the-art performance on a number of synthetic and real-world data sets, further paving a way for efficient and effective knowledge extraction from big data problems.
dc.format.extent135 pages
dc.language.isoeng
dc.publisherTemple University. Libraries
dc.relation.ispartofTheses and Dissertations
dc.rightsIN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available.
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectComputer Science
dc.subjectBig Data
dc.subjectData Mining
dc.subjectData Visualization
dc.subjectLarge-scale Learning
dc.subjectMachine Learning
dc.titleBig Data Algorithms for Visualization and Supervised Learning
dc.typeText
dc.type.genreThesis/Dissertation
dc.contributor.committeememberObradovic, Zoran
dc.contributor.committeememberLatecki, Longin
dc.contributor.committeememberBai, Li
dc.description.departmentComputer and Information Science
dc.relation.doihttp://dx.doi.org/10.34944/dspace/2772
dc.ada.noteFor Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
dc.description.degreePh.D.
refterms.dateFOA2020-11-03T16:23:51Z


Files in this item

Thumbnail
Name:
TETDEDXDjuric-temple-0225E-116 ...
Size:
2.949Mb
Format:
PDF

This item appears in the following Collection(s)

Show simple item record