• ROBUST ESTIMATION OF THE PARAMETERS OF g - and - h DISTRIBUTIONS, WITH APPLICATIONS TO OUTLIER DETECTION

      Iglewicz, Boris; Zhao, Zhigen; Chervoneva, Inna; Dong, Yuexiao; Heyse, Joseph F. (Temple University. Libraries, 2014)
      The g - and - h distributional family is generated from a relatively simple transformation of the standard normal. By changing the skewness and elongation parameters g and h, this distributional family can approximate a broad spectrum of commonly used distributional shapes, such as normal, lognormal, Weibull and exponential. Consequently, it is easy to use in simulation studies and has been applied in multiple areas, including risk management, stock return analysis and missing data imputation studies. The current available methods to estimate the g - and - h distributional family include: letter value based method (LV), numerical maximal likelihood method (NMLE), and moment methods. Although these methods work well when no outliers or contaminations exist, they are not resistant to a moderate amount of contaminated observations or outliers. Meanwhile, NMLE is a computational time consuming method when data sample size is large. In this dissertation a quantile based least squares (QLS) estimation method is proposed to fit the g - and - h distributional family parameters and then derive its basic properties. Then QLS method is extended to a robust version (rQLS). Simulation studies are performed to compare the performance of QLS and rQLS methods with LV and NMLE methods to estimate the g - and - h parameters from random samples with or without outliers. In random samples without outliers, QLS and rQLS estimates are comparable to LV and NMLE in terms of bias and standard error. On the other hand, rQLS performs better than other non-robust method to estimate the g - and - h parameters when moderate amount of contaminated observations or outliers exist. The flexibility of the g - and - h distribution and the robustness of rQLS method make it a useful tool in various fields. The boxplot (BP) method had been used in multiple outlier detections by controlling the some-outside rate, which is the probability of one or more observations, in an outlier-free sample, falling into the outlier region. The BP method is distribution dependent. Usually the random sample is assumed normally distributed; however, this assumption may not be valid in many applications. The robustly estimated g - and - h distribution provides an alternative approach without distributional assumptions. Simulation studies indicate that the BP method based on robustly estimated g - and - h distribution identified reasonable number of true outliers while controlling number of false outliers and some-outside rate compared to normal distributional assumption when it is not valid. Another application of the robust g - and - h distribution is as an empirical null distribution in false discovery rate method (denoted as BH method thereafter). The performance of BH method depends on the accuracy of the null distribution. It has been found that theoretical null distributions were often not valid when simultaneously performing many thousands, even millions, of hypothesis tests. Therefore, an empirical null distribution approach is introduced that uses estimated distribution from the data. This is recommended as a substitute to the currently used empirical null methods of fitting a normal distribution or another member of the exponential family. Similar to BP outlier detection method, the robustly estimated g - and - h distribution can be used as empirical null distribution without any distributional assumptions. Several real data examples of microarray are used as illustrations. The QLS and rQLS methods are useful tools to estimate g - and - h parameters, especially rQLS because it noticeably reduces the effect of outliers on the estimates. The robustly estimated g - and - h distributions have multiple applications where distributional assumptions are required, such as boxplot outlier detection or BH methods.