• Big Data Algorithms for Visualization and Supervised Learning

      Vucetic, Slobodan; Obradovic, Zoran; Latecki, Longin; Bai, Li (Temple University. Libraries, 2013)
      Explosive growth in data size, data complexity, and data rates, triggered by emergence of high-throughput technologies such as remote sensing, crowd-sourcing, social networks, or computational advertising, in recent years has led to an increasing availability of data sets of unprecedented scales, with billions of high-dimensional data examples stored on hundreds of terabytes of memory. In order to make use of this large-scale data and extract useful knowledge, researchers in machine learning and data mining communities are faced with numerous challenges, since the data mining and machine learning tools designed for standard desktop computers are not capable of addressing these problems due to memory and time constraints. As a result, there exists an evident need for development of novel, scalable algorithms for big data. In this thesis we address these important problems, and propose both supervised and unsupervised tools for handling large-scale data. First, we consider unsupervised approach to big data analysis, and explore scalable, efficient visualization method that allows fast knowledge extraction. Next, we consider supervised learning setting and propose algorithms for fast training of accurate classification models on large data sets, capable of learning state-of-the-art classifiers on data sets with millions of examples and features within minutes. Data visualization have been used for hundreds of years in scientific research, as it allows humans to easily get a better insight into complex data they are studying. Despite its long history, there is a clear need for further development of visualization methods when working with large-scale, high-dimensional data, where commonly used visualization tools are either too simplistic to gain a deeper insight into the data properties, or are too cumbersome or computationally costly. We present a novel method for data ordering and visualization. By combining efficient clustering using k-means algorithm and near-optimal ordering of found clusters using state-of-the-art TSP-solver, we obtain efficient algorithm that achieves performance better than existing, computationally intensive methods. In addition, we present visualization method for smaller-scale problems based on object matching. The experiments show that the methods allow for fast detection of hidden patterns, even by users without expertise in the areas of data mining and machine learning. Supervised learning is another important task, often intractable in many modern applications due to time and memory constraints, considering prohibitively large scales of the data sets. To address this issue, we first consider Multi-hyperplane Machine (MM) classification model, and propose online Adaptive MM algorithm which represents a trade-off between linear and kernel Support Vector Machines (SVMs), as it trains MMs in linear time on limited memory while achieving competitive accuracies on large-scale non-linear problems. Moreover, we present a C++ toolbox for developing scalable classification models, which provides an Application Programming Interface (API) for training of large-scale classifiers, as well as highly-optimized implementations of several state-of-the-art SVM approximators. Lastly, we consider parallelization and distributed learning approaches to large-scale supervised learning, and propose AROW-MapReduce, a distributed learning algorithm for confidence-weighted models using MapReduce framework. Experimental evaluation of the proposed methods shows state-of-the-art performance on a number of synthetic and real-world data sets, further paving a way for efficient and effective knowledge extraction from big data problems.
    • Context-aware Learning from Partial Observations

      Obradovic, Zoran; Vucetic, Slobodan; Dragut, Eduard Constantin; Zhao, Zhigen (Temple University. Libraries, 2018)
      The Big Data revolution brought an increasing availability of data sets of unprecedented scales, enabling researchers in machine learning and data mining communities to escalate in learning from such data and providing data-driven insights, decisions, and predictions. However, on their journey, they are faced with numerous challenges, including dealing with missing observations while learning from such data or making predictions on previously unobserved or rare (“tail”) examples, which are present in a large span of domains including climate, medical, social networks, consumer, or computational advertising domains. In this thesis, we address this important problem and propose tools for handling partially observed or completely unobserved data by exploiting information from its context. Here, we assume that the context is available in the form of a network or sequence structure, or as additional information to point-informative data examples. First, we propose two structured regression methods for dealing with missing values in partially observed temporal attributed graphs, based on the Gaussian Conditional Random Fields (GCRF) model, which draw power from the network/graph structure (context) of the unobserved instances. Marginalized Gaussian Conditional Random Fields (m-GCRF) model is designed for dealing with missing response variable value (labels) in graph nodes, whereas Deep Feature Learning GCRF is able to deal with missing values in explanatory variables while learning feature representation jointly with learning complex interactions of nodes in a graph and together with the overall GCRF objective. Next, we consider unsupervised and supervised shallow and deep neural models for monetizing web search. We focus on two sponsored search tasks here: (i) query-to-ad matching, where we propose novel shallow neural embedding model worLd2vec with improved local query context (location) utilization and (ii) click-through-rate prediction for ads and queries, where Deeply Supervised Semantic Match model is introduced for dealing with unobserved and tail queries click-through-rate prediction problem, while jointly learning the semantic embeddings of a query and an ad, as well as their corresponding click-through-rate. Finally, we propose a deep learning approach for ranking investigators based on their expected enrollment performance on new clinical trials, that learns from both, investigator and trial-related heterogeneous (structured and free-text) data sources, and is applicable to matching investigators to new trials from partial observations, and for recruitment of experienced investigators, as well as new investigators with no previous experience in enrolling patients in clinical trials. Experimental evaluation of the proposed methods on a number of synthetic and diverse real-world data sets shows surpassing performance over their alternatives.
    • Data Mining Algorithms for Classification of Complex Biomedical Data

      Vucetic, Slobodan; Obradovic, Zoran; Latecki, Longin; Davey, Adam (Temple University. Libraries, 2012)
      In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray classification, samples belong to several predefined categories (e.g., cancer vs. control tissues) and the goal is to build a predictor that classifies a new tissue sample based on its microarray measurements. When faced with the small-sample high-dimensional microarray data, most machine learning algorithm would produce an overly complicated model that performs well on training data but poorly on new data. To reduce the risk of over-fitting, feature selection becomes an essential technique in microarray classification. However, standard feature selection algorithms are bound to underperform when the size of the microarray data is particularly small. The best remedy is to borrow strength from external microarray datasets. In this dissertation, I will present two new multi-task feature filter methods which can improve the classification performance by utilizing the external microarray data. The first method is to aggregate the feature selection results from multiple microarray classification tasks. The resulting multi-task feature selection can be shown to improve quality of the selected features and lead to higher classification accuracy. The second method jointly selects a small gene set with maximal discriminative power and minimal redundancy across multiple classification tasks by solving an objective function with integer constraints. In protein function prediction problem, gene functions are predicted from a predefined set of possible functions (e.g., the functions defined in the Gene Ontology). Gene function prediction is a complex classification problem characterized by the following aspects: (1) a single gene may have multiple functions; (2) the functions are organized in hierarchy; (3) unbalanced training data for each function (much less positive than negative examples); (4) missing class labels; (5) availability of multiple biological data sources, such as microarray data, genome sequence and protein-protein interactions. As participants in the 2011 Critical Assessment of Function Annotation (CAFA) challenge, our team achieved the highest AUC accuracy among 45 groups. In the competition, we gained by focusing on the 5-th aspect of the problem. Thus, in this dissertation, I will discuss several schemes to integrate the prediction scores from multiple data sources and show their results. Interestingly, the experimental results show that a simple averaging integration method is competitive with other state-of-the-art data integration methods. Original spatial scan algorithm is used for detection of spatial overdensities: discovery of spatial subregions with significantly higher scores according to some density measure. This algorithm is widely used in identifying cluster of disease cases (e.g., identifying environmental risk factors for child leukemia). However, the original spatial scan algorithm only works on static spatial data. In this dissertation, I will propose one possible solution for spatial scan on movement data.
    • Data Mining Algorithms for Decentralized Fault Detection and Diagnostic in Industrial Systems

      Vucetic, Slobodan; Obradovic, Zoran; Latecki, Longin; Seibold, Benjamin (Temple University. Libraries, 2012)
      Timely Fault Detection and Diagnosis in complex manufacturing systems is critical to ensure safe and effective operation of plant equipment. Process fault is defined as a deviation from normal process behavior, defined within the limits of safe production. The quantifiable objectives of Fault Detection include achieving low detection delay time, low false positive rate, and high detection rate. Once a fault has been detected pinpointing the type of fault is needed for purposes of fault mitigation and returning to normal process operation. This is known as Fault Diagnosis. Data-driven Fault Detection and Diagnosis methods emerged as an attractive alternative to traditional mathematical model-based methods, especially for complex systems due to difficulty in describing the underlying process. A distinct feature of data-driven methods is that no a priori information about the process is necessary. Instead, it is assumed that historical data, containing process features measured in regular time intervals (e.g., power plant sensor measurements), are available for development of fault detection/diagnosis model through generalization of data. The goal of my research was to address the shortcomings of the existing data-driven methods and contribute to solving open problems, such as: 1) decentralized fault detection and diagnosis; 2) fault detection in the cold start setting; 3) optimizing the detection delay and dealing with noisy data annotations. 4) developing models that can adapt to concept changes in power plant dynamics. For small-scale sensor networks, it is reasonable to assume that all measurements are available at a central location (sink) where fault predictions are made. This is known as a centralized fault detection approach. For large-scale networks, decentralized approach is often used, where network is decomposed into potentially overlapping blocks and each block provides local decisions that are fused at the sink. The appealing properties of the decentralized approach include fault tolerance, scalability, and reusability. When one or more blocks go offline due to maintenance of their sensors, the predictions can still be made using the remaining blocks. In addition, when the physical facility is reconfigured, either by changing its components or sensors, it can be easier to modify part of the decentralized system impacted by the changes than to overhaul the whole centralized system. The scalability comes from reduced costs of system setup, update, communication, and decision making. Main challenges in decentralized monitoring include process decomposition and decision fusion. We proposed a decentralized model where the sensors are partitioned into small, potentially overlapping, blocks based on the Sparse Principal Component Analysis (PCA) algorithm, which preserves strong correlations among sensors, followed by training local models at each block, and fusion of decisions based on the proposed Maximum Entropy algorithm. Moreover, we introduced a novel framework for adding constraints to the Sparse PCA problem. The constraints limit the set of possible solutions by imposing additional goals to be reached trough optimization along with the existing Sparse PCA goals. The experimental results on benchmark fault detection data show that Sparse PCA can utilize prior knowledge, which is not directly available in data, in order to produce desirable network partitions, with a pre-defined limit on communication cost and/or robustness.
    • Entity Information Extraction using Structured and Semi-structured resources

      Yates, Alexander; Obradovic, Zoran; Guo, Yuhong; Cucerzan, Silviu-Petru (Temple University. Libraries, 2014)
      Among all the tasks that exist in Information Extraction, Entity Linking, also referred to as entity disambiguation or entity resolution, is a new and important problem which has recently caught the attention of a lot of researchers in the Natural Language Processing (NLP) community. The task involves linking/matching a textual mention of a named-entity (like a person or a movie-name) to an appropriate entry in a database (e.g. Wikipedia or IMDB). If the database does not contain the entity it should return NIL (out-of-database) value. Existing techniques for linking named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. In this dissertation, we introduce a new framework, called Open-Database Entity Linking (Open-DB EL), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. In experiments on two domains, our Open-DB EL strategies outperform a state-of-the-art Wikipedia EL system by over 25% in accuracy. Existing approaches typically perform EL using a pipeline architecture: they use a Named-Entity Recognition (NER) system to find the boundaries of mentions in text, and an EL system to connect the mentions to entries in structured or semi-structured repositories like Wikipedia. However, the two tasks are tightly coupled, and each type of system can benefit significantly from the kind of information provided by the other. We propose and develop a joint model for NER and EL, called NEREL, that takes a large set of candidate mentions from typical NER systems and a large set of candidate entity links from EL systems, and ranks the candidate mention-entity pairs together to make joint predictions. In NER and EL experiments across three datasets, NEREL significantly outperforms or comes close to the performance of two state-of-the-art NER systems, and it outperforms 6 competing EL systems. On the benchmark MSNBC dataset, NEREL, provides a 60% reduction in error over the next best NER system and a 68% reduction in error over the next-best EL system. We also extend the idea of using semi-structured resources to a relatively less explored area of entity information extraction. Most previous work on information extraction from text has focused on named-entity recognition, entity linking, and relation extraction. Much less attention has been paid to extracting the temporal scope for relations between named-entities; for example, the relation president-Of (John F. Kennedy, USA) is true only in the time-frame (January 20, 1961 - November 22, 1963). In this dissertation we present a system for temporal scoping of relational facts, called TSRF which is trained on distant supervision based on the largest semi-structured resource available: Wikipedia. TSRF employs language models consisting of patterns automatically bootstrapped from sentences collected from Wikipedia pages that contain the main entity of a page and slot-fillers extracted from the infobox tuples. This proposed system achieves state-of-the-art results on 6 out of 7 relations on the benchmark Text Analysis Conference (TAC) 2013 dataset for the task of temporal slot filling (TSF). Overall, the system outperforms the next best system that participated in the TAC evaluation by 10 points on the TAC-TSF evaluation metric.

      Callaway, Brantly Mercer, IV; Swanson, Charles E.; Webber, Douglas (Douglas A.); Scott, Jonathan A. (Temple University. Libraries, 2020)
      This dissertation includes three chapters which are three papers on banking mergers and acquisitions. Bank failure and bank takeover are major risks which cause a bank to cease to exist, and Chapter 1 focuses on analyzing the factors which indicate bank takeover target vs. bank failure. The target banks would be integrated into acquiring banks, and the performance of the acquiring banks may change post the takeovers. Therefore Chapter 2 focuses on the impact of bank acquisition on the acquiring bank in the U.S.. Chapter 3 focuses on the prediction field and compares two different methodologies (multinomial logistic regression and machine learning method of XGBoost) on the prediction of bank failure or takeover. Chapter 1, titled FACTORS THAT INDICATE BANK TAKEOVER TARGET VS. BANK FAILURE, analyzes the mergers and acquisitions data for the US banking industry from 2001 to late 2015, using both multinomial logistic method and competing risk proportional hazard method, to see how the financial ratios and bank specific features affect the risk of bank failure, bank takeover by a correlated bank under the same ultimate parent bank holding company, and bank takeover by an independent bank with a different ultimate parent bank holding company. This chapter also analyzes the characteristics of failed banks and the target banks in different stages in the financial economic cycle. The results show that the failed banks or the banks which were taken over by independent banks have lower capital ratio, higher real estate loan ratio and commercial and industrial loan ratio, higher non-performing loan ratio, lower after tax profit ratio, higher operating profit ratio, higher liquidity ratio, younger age and smaller asset growth ratio than the baseline banks which continue to operate as usual during the through the cycle period. One notable difference between these two risks is that failed banks tend to be of bigger size, while the acquired banks tend to be of smaller size. Banks which were taken over by correlated banks exhibit higher equity ratio, higher commercial and industrial loan ratio, lower after tax profit ratio, lower liquidity ratio, bigger size, smaller asset growth ratio and younger age compared to the baseline banks which continue to operate as usual during the through the cycle period. The results show the three risk events are subject to some extent of sensitivity to different stages in the financial economic cycle, with the risk of bank takeover by a correlated bank has most sensitivity. The results also show there is small sensitivity observed for the factors indicating the three risks to the methodology utilized. Chapter 2, titled IMPACT OF BANK ACQUISITION ON THE ACQUIRING BANK IN THE U.S., focuses on the merger and acquisition activities in the U.S. banking industry between 2003 and 2014 and analyzes the data to see the effects of the merger and acquisition on the acquiring banks' performance post the event. This chapter selects performance measures of financial ratios implied in CAMEL measure, uses both group time difference-in-difference method and quantile difference-in-difference method to see the impacts. The results show that not all the financial ratios have been significantly impacted by the merger and acquisition, and the impacts show some variations depending on which stages in an economic cycle the mergers and acquisitions are conducted in. Equity ratio, commercial and industrial loan ratio, delinquent assets ratio, non-performing assets ratio and return on equity ratio show significant impact from the mergers and acquisitions for all the three stages across the economic cycle. The results also show that there are variations of merger and acquisition effects on the performance measures depending on whether they are in high end or low end of their distributions. Chapter 3, titled PREDICTION OF U.S. BANK STATUS USING MACHINE LEARNING VS. MULTINOMIAL LOGISTIC REGRESSION, compares multinomial logistic regression methodology with machine learning method of eXtreme Gradient Boosting (XGBoost), to see which methodology can give better prediction on two types of risk events faced by U.S. banks, namely bank failure and bank takeover, using the features consisting of financial ratios on the data from 2002 to 2014. This paper also compares the most important features in each methodology. Beyond that, this paper explores SHapley Additive exPlanations (SHAP) analysis to interpret how bank features influence these two types of risk events from XGBoost method. The results show that XGBoost method gives better prediction accuracy if both developing the model and evaluating the performance on the whole length of US banking mergers and acquisitions data from 2002 to 2014, but the outperformance of XGBoost method is not obvious if developing the model in restricted in-sample data (from 2002 to 2010) and evaluating the performance using the out-of-sample data (from 2011 and 2014). Both two methodologies can give better prediction accuracy on the risk of bank failure than the risk of bank takeover. In addition, the most important features from XGBoost method and multinomial logistic regression method are highly aligned, with non-operating expense ratio, net after tax income ratio, equity ratio, non-performing asset ratio are the top important features. Finally, the SHAP analysis on XGBoost model shows that the features contribute to the targeted risks in a non-linear way.
    • Exploration of 3D Images to Understand 3D Real World

      Ling, Haibin; Shi, Justin Y.; Vucetic, Slobodan; Zheng, Yefeng, 1975- (Temple University. Libraries, 2016)
      Our world is composed of 3-dimension objects. Every one of us is living in a world with X, Y and Z axis. Even though the way we record our world is usually taking a photo: reduce dimensionality from 3-dimension to 2-dimension, the most natural and vivid way to understand the world, and to interact with it, is to sense from our 3D real world. We human beings are sensoring our 3D real world everyday using our build-in stereo system: two eyes. In another word, the raw source data human beings obtain to recognize the real 3D world has depth information. It is not difficult to figure out: Will it help if we give machines depth map of a scene during understanding the 3D real world using computer vision technologies? The answer is yes. Following this concept, my research work is focused on 3D topics in Computer Vision. 3-dimension world is the most intuitive and vivid world human beings can perceive. In the past, it is very costly to get 3D raw source data. However, things have changed since the release of many 3D sensors in recent decades. With the help of many modern 3D sensor, I am motivated to choose my research topics among this direction. Nowadays, 3D sensor has been used in various aspects of industries. In gaming industry, we have many kinds of commercial in-door 3D sensors. This kind of sensors can generate 3D cloud points in in-door environment with very cheap cost. Thus, provides depth information to traditional computer vision algorithms, and achieves state-of-the-art detection results of human body skeleton. 3D sensor in gaming brings out new ways to interact with computers. In medical industry, engineers offer cone beam computed tomography (CBCT). The raw source data this technology provides gives doctors the idea of holographic structure of target soft/hard tissue. By extend pattern recognition algorithms from 2D to 3D, computer vision scientists can now suggest doctors with 3D texture feature, and help them when diagnose. My research works are along these two lines. In medical image, by looking into trabecular bone 3D structures, I want to use Computer Vision tools to interpret the most tiny density change. In human-computer-interaction task, by studying the 3D point cloud, I want to find a way to estimate human hand pose. First of all, in Medical Image, by using Computer Vision methods, I want to find out a useful algorithm to distinguish bone texture patterns. This task is critical in clinical diagnosis. Variations in trabecular bone texture are known to be correlated with bone diseases, such as osteoporosis. In my research work, we propose a multi-feature multi-ROI (MFMR) approach for analyzing trabecular patterns inside the oral cavity using cone beam computed tomography (CBCT) volumes. For each dental CBCT volume, a set of features including fractal dimension, multi-fractal spectrum and gradient based features are extracted from eight regions-of-interest (ROI) to address the low image quality of trabecular patterns. Then, we use generalized multi-kernel learning (GMKL) to effectively fuse these features for distinguishing trabecular patterns from different groups. To validate the proposed method, we apply it to distinguish trabecular patterns from different gender-age groups. On a dataset containing dental CBCT volumes from 96 subjects, divided into gender-age subgroups, our approach achieves 96.1\% average classification rate, which greatly outperforms approaches without the feature fusion. Besides, in human-computer-interaction task, the most natural way is to use your hand pointing things, or use a gesture to express your ideas. I am motivated to estimate all skeleton joint locations in 3D space, which is the foundation of all gesture understanding. Through logical decision on these skeleton join locations, we can obtain the Semantics behind the hand pose gesture. So, the task is to estimate a hand pose in 3D space, locating all skeletal joints. A real-time 3D hand pose estimation algorithm is then proposed using the randomized decision forest framework. The algorithm takes a depth image as input and generates a set of skeletal joints as output. Previous decision-forest-based methods often give labels to all points in a point cloud at a very early stage and vote for the joint locations. By contrast, this algorithm only tracks a set of more flexible virtual landmark points, named segmentation index points (SIPs), before reaching the final decision at a leaf node. Roughly speaking, an SIP represents the centroid of a subset of skeletal joints, which are to be located at the leaves of the branch expanded from the SIP. Inspired by a latent regression-forest-based hand pose estimation framework, we integrate SIP into the framework with several important improvements. The experimental results on public benchmark datasets show clearly the advantage of the proposed algorithm over previous state-of-the-art methods, and the algorithm runs at 55.5 fps on a normal CPU without parallelism. After the study on RGBD (RGB-depth) images, we have come to another issue. When we want to take advantage of our algorithms, and make an application, we find it really hard to accomplish. The majority of devices today are equipped with RGB cameras. Smart devices in recent years rarely have RGBD cameras on them. We have come to a dilemma that we are not able to apply our algorithms to more general scenarios. So I have changed my perspective to try some 3D reconstruction algorithms on ordinary RGB cameras. As a result, we shift our attention to human face analysis in RGB images. Detection faces in photos are critical in intelligent applications. However, this is far from enough for modern application scenarios. Many applications require accurate localization of facial landmarks. Face Alignment (FA) is critical for face analysis, it has been studied extensively in recently years. For academia, research work among this line is challenging when face images have extreme poses, lighting, expressions, and occlusions etc. Besides, FA is also a fundamental component in all face analysis algorithms. For industry, once having these facial key point locations, many impossible applications becomes reachable. A robust FA algorithm is in great demand. We developed our proposed Convolutional Neural Networks (CNN) on Deep Learning framework Caffe while employing a GPU server of 8 NVIDIA TitanX GPUs. Once finalized the CNN structure, thousands of human-labeled face image data are used to train the proposed CNN on a GPU server cluster with 2 nodes connected by Infinite Band. Each node has 4 NVIDIA K-40 GPU on its own. Our framework outperforms deep learning state-of-the-art algorithms.

      Megalooikonomou, Vasilis; Obradovic, Zoran; Ling, Haibin; Faro, Scott H. (Temple University. Libraries, 2012)
      Gene expression signatures in the mammalian brain hold the key to understanding neural development and neurological diseases, and gene expression profiles have been widely used in functional genomic studies. However, not much work in traditional gene expression profiling takes into account the location information of a gene's expressions in the brain. Gene expression maps, which are obtained by combining voxelation and microarrays, contain spatial information regarding the expression of genes in mice's brain. We study approaches for identifying the relationship between gene expression maps and gene functions, for mining association rules, and for predicting certain gene functions and functional similarities based on the gene expression maps obtained by voxelation. First, we identified the relationship between gene functions and gene expression maps. On one side, we chose typical genes as queries and aimed at discovering the groups of the genes which have similar gene expression maps to the queries. Then we study the relationship between functions and maps by checking the similarities of gene functions in the detected gene groups. The similarity between a pair of gene expression maps was identified by calculating the Euclidean Distance between the pair of feature vectors which were extracted by wavelet transformation from the hemispheres averaged gene expression maps. Similarities of gene functions were identified by Lin's method based on gene ontology structures. On the other side, we proposed a multiple clustering approach, combined with hierarchical clustering method to detect significant clusters of genes which have both similar gene functions and similar gene expression maps. Among each group of similar genes, the gene function similarity was measured by calculating the average pair-wise gene function distance in the group and then ranking it in random cases. By finding groups of similar genes toward typical genes, we were able to improve our understanding of gene expression patterns and gene functions. By doing the multiple clustering, we obtained significant clusters of similar genes and very similar gene functions respectively to their corresponding gene ontologies. The cellular component ontology resulted in prominent clusters expressed in cortex and corpus callosum. The molecular function ontology gave prominent clusters in cortex, corpus callosum and hypothalamus. The biological process ontology resulted in clusters in cortex, hypothalamus and choroid plexus. Clusters from all three ontologies combined were most prominently expressed in cortex and corpus callosum. The experimental results confirm the hypothesis that genes with similar gene expression maps have similar gene functions for certain genes. Based on the relationship between gene functions and expression maps, we developed a modified Apriori algorithm to mine association rules among gene functions in the significant clusters. The experimental results show that the detected association rules (frequent itemsets of gene functions) make sense biologically. By inspecting the obtained clusters and the genes having the same frequent itemsets of functions, interesting clues were discovered that provide valuable insight to biological scientists. The discovered association rules can be potentially used to predict gene functions based on similarity of gene expression maps. Moreover, proposed an efficient approach to identify gene functions. A gene function or a set of certain gene functions can potentially be associated with a specific gene expression profile. We named this specific gene expression profile, Functional Expression Profile (FEP) for one function, or Multiple Functional Expression Profile (MFEP) for a set of functions. We suggested two different ways of finding (M)FEPS, a cluster-based and a non-cluster-based method. Both of these methods achieved high accuracy in predicting gene functions, each for different kinds of gene functions. Compared to the traditional K-nearest neighbor method, our approach shows higher accuracy in predicting functions. The visualized gene expression maps of (M)FEPs were in good agreement with anatomical components of mice's brain Furthermore, we proposed a supervised learning methodology to predict pair-wise gene functional similarity from gene expression maps. By using modified AdaBoost algorithm coupled with our proposed weak classifier, we predicted the gene functional similarities between genes to a certain degree. The experimental results showed that with increasing similarities of gene expression maps, the functional similarities were increased too. The weights of the features in the model indicated the most significant single voxels and pairs of neighboring voxels which can be visualized in the expression map image of a mouse brain.

      Obradovic, Zoran; Vucetic, Slobodan; Latecki, Longin; Davey, Adam (Temple University. Libraries, 2013)
      Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from two instances. Based on this distance, we define the new objective function based on the temporal margin of each data instance. A fixed-point gradient descent method is proposed to solve the formulated objective function to learn the optimal feature weights. The experimental results on real temporal microarray data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance.
    • Learning from Multiple Knowledge Sources

      Obradovic, Zoran; Vucetic, Slobodan; Yates, Alexander; McLaughlin Joseph P., Jr.; Agarwal, Pankaj (Temple University. Libraries, 2013)
      In supervised learning, it is usually assumed that true labels are readily available from a single annotator or source. However, recent advances in corroborative technology have given rise to situations where the true label of the target is unknown. In such problems, multiple sources or annotators are often available that provide noisy labels of the targets. In these multi-annotator problems, building a classifier in the traditional single-annotator manner, without regard for the annotator properties may not be effective in general. In recent years, how to make the best use of the labeling information provided by multiple annotators to approximate the hidden true concept has drawn the attention of researchers in machine learning and data mining. In our previous work, a probabilistic method (i.e., MAP-ML algorithm) of iteratively evaluating the different annotators and giving an estimate of the hidden true labels is developed. However, the method assumes the error rate of each annotator is consistent across all the input data. This is an impractical assumption in many cases since annotator knowledge can fluctuate considerably depending on the groups of input instances. In this dissertation, one of our proposed methods, GMM-MAPML algorithm, follows MAP-ML but relaxes the data-independent assumption, i.e., we assume an annotator may not be consistently accurate across the entire feature space. GMM-MAPML uses a Gaussian mixture model (GMM) and Bayesian information criterion (BIC) to find the fittest model to approximate the distribution of the instances. Then the maximum a posterior (MAP) estimation of the hidden true labels and the maximum-likelihood (ML) estimation of quality of multiple annotators at each Gaussian component are provided alternately. Recent studies show that it is not the case that employing more annotators regardless of their expertise will result in improved highest aggregating performance. In this dissertation, we also propose a novel algorithm to integrate multiple annotators by Aggregating Experts and Filtering Novices, which we call AEFN. AEFN iteratively evaluates annotators, filters the low-quality annotators, and re-estimates the labels based only on information obtained from the good annotators. The noisy annotations we integrate are from any combination of human and previously existing machine-based classifiers, and thus AEFN can be applied to many real-world problems. Emotional speech classification, CASP9 protein disorder prediction, and biomedical text annotation experiments show a significant performance improvement of the proposed methods (i.e., GMM-MAPML and AEFN) as compared to the majority voting baseline and the previous data-independent MAP-ML method. Recent experiments include predicting novel drug indications (i.e., drug repositioning) for both approved drugs and new molecules by integrating multiple chemical, biological or phenotypic data sources.
    • Learning Top-N Recommender Systems with Implicit Feedbacks

      Guo, Yuhong; Shi, Justin Y.; Dragut, Eduard Constantin; Dong, Yuexiao (Temple University. Libraries, 2017)
      Top-N recommender systems automatically recommend N items for users from huge amounts of products. Personalized Top-N recommender systems have great impact on many real world applications such as E-commerce platforms and social networks. Sometimes there is no rating information in user-item feedback matrix but only implicit purchase or browsing history, that means the user-item feedback matrix is a binary matrix, we call such feedbacks as implicit feedbacks. In our work we try to learn Top-N recommender systems with implicit feedbacks. First, we design a heterogeneous loss function to learn the model. Second, we incorporate item side information into recommender systems. We formulate a low-rank constraint minimization problem and give a closed-form solution for it. Third, we also use item side information to learn recommender systems. We use gradient descent method to learn our model. Most existing methods produce personalized top-N recommendations by minimizing a specific uniform loss such as pairwise ranking loss or pointwise recovery loss. In our first model, we propose a novel personalized Top-N recommendation approach that minimizes a combined heterogeneous loss based on linear self-recovery models. The heterogeneous loss integrates the strengths of both pairwise ranking loss and pointwise recovery loss to provide more informative recommendation predictions. We formulate the learning problem with heterogeneous loss as a constrained convex minimization problem and develop a projected stochastic gradient descent optimization algorithm to solve it. Most previous systems are only based on the user-item feedback matrix. In many applications, in addition to the user-item rating/purchase matrix, item-based side information such as product reviews, book reviews, item comments, and movie plots can be easily collected from the Internet. This abundant item-based information can be used for recommendation systems. In the second model, we propose a novel predictive collaborative filtering approach that exploits both the partially observed user-item recommendation matrix and the item-based side information to produce top-N recommender systems. The proposed approach automatically identifies the most interesting items for each user from his or her non-recommended item pool by aggregating over his or her recommended items via a low-rank coefficient matrix. Moreover, it also simultaneously builds linear regression models from the item-based side information such as item reviews to predict the item recommendation scores for the users. The proposed approach is formulated as a rank constrained joint minimization problem with integrated least squares losses, for which an efficient analytical solution can be derived. In the third model, we also propose a joint discriminative prediction model that exploits both the partially observed user-item recommendation matrix and the item-based side information to build top-N recommender systems. This joint model aggregates observed user-item recommendation activities to predict the missing/new user-item recommendation scores while simultaneously training a linear regression model to predict the user-item recommendation scores from auxiliary item features. We evaluate the proposed approach on a variety of recommendation tasks. The experimental results show that the proposed joint model is very effective for producing top-N recommendation systems.
    • Machine Learning Algorithms for Characterization and Prediction of Protein Structural Properties

      Vucetic, Slobodan; Vucetic, Slobodan; Obradovic, Zoran; Zhang, Kai; Dunbrack, Roland L.; Carnevale, Vincenzo (Temple University. Libraries, 2019)
      Proteins are large biomolecules which are functional building blocks of living organisms. There are about 22,000 protein-coding genes in the human genome. Each gene encodes a unique protein sequence of a typical 100-1000 length which is built using a 20-letter alphabet of amino acids. Each protein folds up into a unique 3D shape that enables it to perform its function. Each protein structure consists of some number of helical segments, extended segments called sheets, and loops that connect these elements. In the last two decades, machine learning methods coupled with exponentially expanding biological knowledge databases and computational power are enabling significant progress in the field of computational biology. In this dissertation, I carry out machine learning research for three major interconnected problems to advance protein structural biology as a field. A separate chapter in this dissertation is devoted to each problem. After the three chapters I conclude this doctoral research with a summary and direction of our future work. Chapter 1 describes design, training and application of a convolutional neural network (SecNet) to achieve 84% accuracy for the 60-year-old problem of predicting protein secondary structure given a protein sequence. Our accuracy is 2-3% better than any previous result, which had only risen 5% in last 20 years. We identified the key factors for successful prediction in a detailed ablation study. A paper submitted for publication includes our secondary-structure prediction software, data set generation, and training and testing protocols [1]. Chapter 2 characterizes the design and development of a protocol for clustering of beta turns, i.e. short structural motifs responsible for U-turns in protein loops. We identified 18 turn types, 11 of which are newly described [2]. We also developed a turn library and cross-platform software for turn assignment in new structures. In Chapter 3 I build upon the results from these two problems and predict geometries in loops of unknown structure with custom Residual Neural Networks (ResNet). I demonstrate solid results on (a) locating turns and predicting 18 types and (b) prediction of backbone torsion angles in loops. Given the recent progress in machine learning, these two results provide a strong foundation for successful loop modeling and encourage us to develop a new loop structure prediction program, a critical step in protein structure prediction and modeling.

      Bai, Li; Wang, Ze; Kim, Albert; Lu, Xiaonan; Ji, Bo, 1982- (Temple University. Libraries, 2020)
      Arterial spin labeling (ASL) perfusion Magnetic Resonance Imaging (MRI) is a noninvasive technique for measuring quantitative cerebral blood flow (CBF) but subject to an inherently low signal-to-noise-ratio (SNR), resulting in a big challenge for data processing. Traditional post-processing methods have been proposed to reduce artifacts, suppress non-local noise, and remove outliers. However, these methods are based on either implicit or explicit models of the data, which may not be accurate and may change across subjects. Deep learning (DL) is an emerging machine learning technique that can learn a transform function from acquired data without using any explicit hypothesis about that function. Such flexibility may be particularly beneficial for ASL denoising. In this dissertation, three different machine learning-based methods are proposed to improve the image quality of ASL MRI: 1) a learning-from-noise method, which does not require noise-free references for DL training, was proposed for DL-based ASL denoising and BOLD-to-ASL prediction; 2) a novel deep learning neural network that combines dilated convolution and wide activation residual blocks was proposed to improve the image quality of ASL CBF while reducing ASL acquisition time; 3) a prior-guided and slice-wise adaptive outlier cleaning algorithm was developed for ASL MRI. In the first part of this dissertation, a learning-from-noise method is proposed for DL-based method for ASL denoising. The proposed learning-from-noise method shows that DL-based ASL denoising models can be trained using only noisy image pairs, without any deliberate post-processing for obtaining the quasi-noise-free reference during the training process. This learning-from-noise method can also be applied to DL-based ASL perfusion prediction from BOLD fMRI as ASL references are extremely noisy in this BOLD-to-ASL prediction. Experimental results demonstrate that this learning-from-noise method can reliably denoise ASL MRI and predict ASL perfusion from BOLD fMRI, result in improved signal-to-noise-ration (SNR) of ASL MRI. Moreover, by using this method, more training data can be generated, as it requires fewer samples to generate quasi-noise-free references, which is particularly useful when ASL CBF data are limited. In the second part of this dissertation, we propose a novel deep learning neural network, i.e., Dilated Wide Activation Network (DWAN), that is optimized for ASL denoising. Our method presents two novelties: first, we incorporated the wide activation residual blocks with a dilated convolution neural network to achieve improved denoising performance in term of several quantitative and qualitative measurements; second, we evaluated our proposed model given different inputs and references to show that our denoising model can be generalized to input with different levels of SNR and yields images with better quality than other methods. In the final part of this dissertation, a prior-guided and slice-wise adaptive outlier cleaning (PAOCSL) method is proposed to improve the original Adaptive Outlier Cleaning (AOC) method. Prior information guided reference CBF maps are used to avoid bias from extreme outliers in the early iterations of outlier cleaning, ensuring correct identification of the true outliers. Slice-wise outlier rejection is adapted to reserve slices with CBF values in the reasonable range even they are within the outlier volumes. Experimental results show that the proposed outlier cleaning method improves both CBF quantification quality and CBF measurement stability.
    • Multi-label Learning under Different Labeling Scenarios

      Guo, Yuhong; Vucetic, Slobodan; Dragut, Eduard Constantin; Dong, Yuexiao (Temple University. Libraries, 2015)
      Traditional multi-class classification problems assume that each instance is associated with a single label from category set Y where |Y| > 2. Multi-label classification generalizes multi-class classification by allowing each instance to be associated with multiple labels from Y. In many real world data analysis problems, data objects can be assigned into multiple categories and hence produce multi-label classification problems. For example, an image for object categorization can be labeled as 'desk' and 'chair' simultaneously if it contains both objects. A news article talking about the effect of Olympic games on tourism industry might belong to multiple categories such as 'sports', 'economy', and 'travel', since it may cover multiple topics. Regardless of the approach used, multi-label learning in general requires a sufficient amount of labeled data to recover high quality classification models. However due to the label sparsity, i.e. each instance only carries a small number of labels among the label set Y, it is difficult to prepare sufficient well-labeled data for each class. Many approaches have been developed in the literature to overcome such challenge by exploiting label correlation or label dependency. In this dissertation, we propose a probabilistic model to capture the pairwise interaction between labels so as to alleviate the label sparsity. Besides of the traditional setting that assumes training data is fully labeled, we also study multi-label learning under other scenarios. For instance, training data can be unreliable due to missing values. A conditional Restricted Boltzmann Machine (CRBM) is proposed to take care of such challenge. Furthermore, labeled training data can be very scarce due to the cost of labeling but unlabeled data are redundant. We proposed two novel multi-label learning algorithms under active setting to relieve the pain, one for standard single level problem and one for hierarchical problem. Our empirical results on multiple multi-label data sets demonstrate the efficacy of the proposed methods.

      Dragut, Eduard Constantin; Guo, Yuhong; Zhang, Kai; Shi, Justin Y.; Meng, Weiyi (Temple University. Libraries, 2020)
      Data plays the key role in almost every field of computer sciences, including knowledge graph field. The type of data varies across fields. For example, the data type of knowledge graph field is knowledge triples, while it is visual data like images and videos in computer vision field, and textual data like articles and news in natural language processing field. Data could not be utilized directly by machine learning models, thus data representation learning and feature design for various types of data are two critical tasks in many computer sciences fields. Researchers develop various models and frameworks to learn and extract features, and aim to represent information in defined embedding spaces. The classic models usually embed the data in a low-dimensional space, while neural network models are able to generate more meaningful and complex high-dimensional deep features in recent years. In knowledge graph field, almost every approach represent entities and relations in a low-dimensional space, because there are too many knowledge and triples in real-world. Recently a few approaches apply neural networks on knowledge graph learning. However, these models are only able to capture local and shallow features. We observe the following three important issues with the development of feature learning with neural networks. On one side, neural networks are not black boxes that work well in every case without specific design. There is still a lot of work to do about how to design and propose more powerful and robust neural networks for different types of data. On the other side, more studies about utilizing these representations and features in many applications are necessary. What's more, traditional representations and features work better in some domains, while deep representations and features perform better on other domains. Transfer learning is introduced to bridge the gap between domains and adapt various type of features for many tasks. In this dissertation, we aim to solve the above issues. For knowledge graph learning task, we propose a few important observations both theoretically and practically for current knowledge graph learning approaches, especially for knowledge graph learning based on Convolutional Neural Networks. Besides the work in knowledge graph field, we not only develop different types of feature and representation learning frameworks for various data types, but also develop effective transfer learning algorithm to utilize the features and representations. The obtained features and representations by neural networks are utilized successfully in multiple fields. Firstly, we analyze the current issues on knowledge graph learning models, and present eight observations for existing knowledge graph embedding approaches, especially for approaches based on Convolutional Neural Networks. Secondly, we proposed a novel unsupervised heterogeneous domain adaptation framework that could deal with features in various types. Multimedia features are able to be adapted, and the proposed algorithm could bridge the representation gap between the source and target domains. Thirdly, we propose a novel framework to learn and embed user comments and online news data in unit of sessions. We predict the article of interest for users with deep neural networks and attention models. Lastly, we design and analyze a large number of features to represent dynamics of user comments and news article. The features span a broad spectrum of facets including news article and comment contents, temporal dynamics, sentiment/linguistic features, and user behaviors. Our main insight is that the early dynamics from user comments contribute the most to an accurate prediction, while news article specific factors have surprisingly little influence.
    • On Leveraging Representation Learning Techniques for Data Analytics in Biomedical Informatics

      Obradovic, Zoran; Vucetic, Slobodan; Souvenir, Richard M.; Kaplan, Avi (Temple University. Libraries, 2019)
      Representation Learning is ubiquitous in state-of-the-art machine learning workflow, including data exploration/visualization, data preprocessing, data model learning, and model interpretations. However, the majority of the newly proposed Representation Learning methods are more suitable for problems with a large amount of data. Applying these methods to problems with a limited amount of data may lead to unsatisfactory performance. Therefore, there is a need for developing Representation Learning methods which are tailored for problems with ``small data", such as, clinical and biomedical data analytics. In this dissertation, we describe our studies of tackling the challenging clinical and biomedical data analytics problem from four perspectives: data preprocessing, temporal data representation learning, output representation learning, and joint input-output representation learning. Data scaling is an important component in data preprocessing. The objective in data scaling is to scale/transform the raw features into reasonable ranges such that each feature of an instance will be equally exploited by the machine learning model. For example, in a credit flaw detection task, a machine learning model may utilize a person's credit score and annual income as features, but because the ranges of these two features are different, a machine learning model may consider one more heavily than another. In this dissertation, I thoroughly introduce the problem in data scaling and describe an approach for data scaling which can intrinsically handle the outlier problem and lead to better model prediction performance. Learning new representations for data in the unstandardized form is a common task in data analytics and data science applications. Usually, data come in a tubular form, namely, the data is represented by a table in which each row is a feature (row) vector of an instance. However, it is also common that the data are not in this form; for example, texts, images, and video/audio records. In this dissertation, I describe the challenge of analyzing imperfect multivariate time series data in healthcare and biomedical research and show that the proposed method can learn a powerful representation to encounter various imperfections and lead to an improvement of prediction performance. Learning output representations is a new aspect of Representation Learning, and its applications have shown promising results in complex tasks, including computer vision and recommendation systems. The main objective of an output representation algorithm is to explore the relationship among the target variables, such that a prediction model can efficiently exploit the similarities and potentially improve prediction performance. In this dissertation, I describe a learning framework which incorporates output representation learning to time-to-event estimation. Particularly, the approach learns the model parameters and time vectors simultaneously. Experimental results do not only show the effectiveness of this approach but also show the interpretability of this approach from the visualizations of the time vectors in 2-D space. Learning the input (feature) representation, output representation, and predictive modeling are closely related to each other. Therefore, it is a very natural extension of the state-of-the-art by considering them together in a joint framework. In this dissertation, I describe a large-margin ranking-based learning framework for time-to-event estimation with joint input embedding learning, output embedding learning, and model parameter learning. In the framework, I cast the functional learning problem to a kernel learning problem, and by adopting the theories in Multiple Kernel Learning, I propose an efficient optimization algorithm. Empirical results also show its effectiveness on several benchmark datasets.