Department of Mathematics and Statistics
111 Cummington Street, Boston, MA 02215
Phone: (617) 353-5209, Fax: (617) 353-8100
Published Documents
Working Papers
Medical Imaging
My Collaborators


Published Documents

Model selection in High-Dimensions: A Quadratic-risk Based Approach

with B.G. Lindsay. Journal of the Royal Statistical Society - Series B: 70(1) Page 95-118, February 2008
In this article we propose a general class of risk measures which can be used for data based evaluation of parametric models. The loss function is defined as generalized quadratic distance between the true density and the proposed model. These distances are characterized by a simple quadratic form structure that is adaptable through the choice of a nonnegative definite kernel and a bandwidth parameter. Using asymptotic results for the quadratic distances we build a quick-to-compute approximation for the risk function. Its derivation is analogous to the Akaike Information Criterion (AIC), but unlike AIC, the quadratic risk is a global comparison tool. The method does not require resampling, a great advantage when point estimators are expensive to comput$

Keywords:Global comparison of models, high dimensional data, model selection, mixture models, quadratic distance, quadratic risk, spectral degrees of freedom.

Quadratic distances on probabilities: a unified foundation

with B.G. Lindsay, M. Markatou, K. Yang, S.C. To Appear in The Annals of Statistics
This work builds a unified framework for the study of quadratic form distance measures as they are used in assessing the goodness of fit of models. Many important procedures have this structure, but the theory for these methods is dispersed and incomplete. Central to the statistical analysis of these distances is the spectral decomposition of the kernel that generates the distance. We show how this determines the limiting distribution of natural goodness of fit tests. Additionally, we develop a new notion, the spectral degrees of freedom of the test, based on this decomposition. The degrees of freedom are easy to compute and estimate, and can be used as a guide in the construction of useful procedures in this class.

Amino acid biophysical properties in the statistical prediction of peptide-MHC class I binding

with Tom Kepler Immunome Research 2007, Oct 29;3(1):9
Background: A key step in the development of an adaptive immune response to pathogens or vaccines is the binding of short peptides to molecules of the Major Histocompatibility Complex (MHC) for presentation to T lymphocytes, which are thereby activated and dierentiate into eector and memory cells. The rational design of vaccines consists in part in the identication of appropriate peptides to eect this process. There are several algorithms currently in use for making such predictions, but these are limited to a small number of MHC molecules and have good but imperfect prediction power.
Results: We have undertaken an exploration of the power gained by taking advantage of a natural representation of the amino acids in terms of their biophysical properties. We used several well-known statistical classiers using either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding by biophysical properties leads to substantially lower misclassication error.
Conclusion Representation of amino acids using a few important bio-physio-chemical property provide a natural basis for representing peptides and greatly improves peptide-MHC class I binding prediction.

A Nonparametric Statistical Approach to Clustering via Mode Identification

with Bruce G Lindsay and Jia Li Journal of Machine Learning Research 8(Aug):1687--1723, 2007
In this paper, we develop a mode-based clustering approach applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This clustering method shares the major advantages of mixture model based clustering. Moreover, it requires no model fitting and ensures that every cluster corresponds to a bump of the density. A hierarchical clustering algorithm is also developed by applying MEM recursively to kernel density estimators with increasing bandwidths. The issue of diagnosing clustering results is investigated. Specifically, a pairwise cluster separability measure is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is developed to guarantee strong separation between clusters. Experiments demonstrate that our clustering approach tends to combine the strengths of mixture-model-based and linkage-based clustering. Tests on both simulated and real data show that the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling.

Statistics on Anatomic Objects Reflecting Inter-Object Relations

Ja-Yeon Jeong and Stephen M.Pizer roceedings of International Workshop on Mathematical Foundations of Computational Anatomy
Describing the probability densities of multi-object complexes by describing individual objects and their inter-object relationships leads to desirable locality without ignoring the context of an object. We de- scribe a means of decomposing object variations into self effects and neighbor effects. We describe an approach for estimating the self and neighbor effect probability densities for each object in the complex using augmentation and prediction, supported by PGA on m-reps. We apply this method to the inter-day variation of m-reps of male pelvic organs within an individual patient.

The Topography of Multivariate Normal Mixtures

with B.G. Lindsay. The Annals of Statistics Vol. 33, No. 5 - October 2005
Multivariate normal mixtures provide a flexible method of fitting high-dimensional data. It is shown that their topography, in the sense of their key features as a density, can be analyzed rigorously in lower dimensions by use of a ridgeline manifold that contains all critical points as well as the ridges of the density. A plot of the elevations on the ridgeline shows the key features of the mixed density. In addition, by use of the ridgeline we uncover a function that determines the number of modes of the mixed density when there are two components being mixed. A followup analysis then gives a curvature function that can be used to prove a set of modality theorems.

Distance-based Model-Selection with application to the Analysis of Gene Expression Data

Electronic Thesis June, 2003
Multivariate mixture models provide a convenient method of density estimation and model based clustering as well as providing possible explanations for the actual data generation process. But the problem of choosing the number of components ($g$) in a statistically meaningful way is still a subject of considerable research . Available methods for estimating $g$ include, optimizing AIC and BIC, estimating the number through nonparametric maximum likelihood, hypothesis testing and Bayesian approaches with entropy distances. In our current research we present several rules for selecting a finite mixture model, and hence $g$, based on estimation and inference using a quadratic distance measure. In one methodology the goal is to find the minimal number of components that are needed to adequately describe the true distribution based on a nonparametric confidence set for the true distribution. We also present results for selecting $g$ based on a risk analysis that includes a penalty for overfitting. Another less formal methodology is based on the concordance measure which is analogous to $R^2$ in regression. Moreover, we find develop diagnostics for purposes of outlier detection. These diagnostics help to distinguish between outliers and true clusters, and they provide insight into the initial values for iterative estimation of additional components. In this dissertation we also develop tools for determining the number of modes in a mixture of multivariate normal densities. We use these criterion to select clusters which display distinct modes. Finally we fine tune our methods to analyze gene-expression data from micro-arrays, and compare them with other competitive methods.

Improved power and sensitivity in multinomial goodness-of-fit tests.

with A, Basu., C, Park. and S, Basu. Journal of the Royal Statistical Society Series D. (The Statistician), 51 - 2002. 381 - 393.
The Pearson's chi-square and the log likelihood ratio chi-square statistics are fundamental tools in goodness-of-fit testing. Cressie and Read (1984) constructed a general family of divergences which includes both statistics as special cases. This family is indexed by a single parameter, and divergences at either end of the scale are more powerful against alternatives of one type while being rather poor against the opposite type. Here we present several new goodness-of-fit testing procedures which have reasonably high power at both kinds of alternatives. Graphical studies illustrate the advantages of the new methods.

Working Papers

Diffusion kernels and quadratic distances as building blocks for high dimensional inference.

with B.G. Lindsay, S.C. Chen, M. Markatou, K. Yang.
Modern scientific work has presented statistics with many important challenges, but of particular importance are the challenges presented by "large magnitude", both in the dimension of data vectors and in the number of vectors. (see, e.g., {Lindsay et al 2004}). This paper develops statistical distances that are especially designed for assessment of model fit in high dimensional data. At the same time it develops a new set of probability models, based on mixtures of diffusion kernels, for fitting high dimensional data.

Spectral Degrees of freedom and highdimensional smoothing.

with B.G. Lindsay, S.C. Chen, K. Yang.
This paper is concerned with the selection of tuning parameters in statistical distance measures, where the tuning parameter plays a key role in determining the tradeoff between the sensitivity of the theoretical distance versus the variability generated by its estimation. Our focus will be on quadratic distances ( Lindsay et al 2005), a large and general class whose relevance is even further enhanced by the fact that many other distances are locally quadratic. The problem of choosing the tuning parameter is analogous to choosing the bin-width of each cell (or, equivalently choosing the number of cells) in a $\chi^2$ goodness-of-fit test. If the number of cells is small, then the test may be unable to detect important discrepancies between two distributions because too much has been ``smoothed out''. It should be noted that here we want to define the degrees of freedom in a multivariate situation. where chi-squared methods would replace constructing higher dimensional bins (hypercubes or hyperballs). We use kernels to define the bins. But, our method requires no selection of the number of bins, but rather the effective bin-width. Key to this selection is the notion of spectral degrees of freedom.

Bayes Factors in Structural Equation Models (SEMs): Schwarz's BIC and Other Approximations

with K. Bollen and J. Zavisca
Model fit and comparisons are subjects of much debate in the Structural Equation Models (SEMs) literature. Researchers typically apply likelihood ratio tests and numerous fit indices to assess the adequacy of a model's fit. Schwarz's (1978) Bayesian Information Criterion or BIC is one such measure of fit. The BIC measure is an approximation to the Bayes Factor. The Bayes Factor is B12=Pr(D$|$H1)/ Pr(D$|$H2) where Pr(D$|$Hk) is the probability of the data (D) if hypothesis or model Hk is true. The BIC measure is the best known approximation to the Bayes Factor in SEM. However, the BIC is derived under simplifying assumptions that permit its calculation without explicit prior probabilities. Furthermore, the BIC derives from other approximations to make it simple to apply in SEMs [BIC=T-t ln(N) where T is the likelihood ratio test statistic, t is the number of independent estimated parameters, ln is the natural log, and N is the sample size]. It is possible to develop other approximations to the Bayes Factor that make use of fewer approximations than the BIC and thus, hold the potential to be more accurate. In this paper, we develop two such approximations, Approximate Bayes Factor 1 and 2, or ABF1 and ABF2. The paper provides the rationale for the BIC, ABF1, and ABF2, discusses their calculations using standard SEM software, illustrates and compares these measures for simulation examples, and finally discusses the evidence in favor of or against these approximations to the Bayes Factor. Our position is that the Bayes Factor could be a useful addition to the SEM literature, yet we need to evaluate the quality of measures that approximate it. We conclude with recommendations for the researcher.

HDLSS geometry and two way mixtures of normals for analyzing microarray data analysis

with J.S. Marron
This research was initiated by the analysis of NCI60 cancer dataset . The dataset contains gene expression values (from cDNA arrays) corresponding to 3509 genes collected from 60 different patients diagnosed with 8 different cancer types (assumed unknown in the following discussion). The goal is to provide a model based approach for simultaneously clustering cancer types (columns) and the genes (rows) involved in differentiating these cancer types. We formulate a novel two-way mixture framework and adapt our distance-based model selection tool to determine the unknown number of row and column clusters. This methodology avoids two major pitfalls of using model-based clustering in high-dimensions. First, the two-way mixture has a considerably smaller parameter set, compared to the full multivariate analysis, making all parameters estimable. Second, unlike the complex distribution of likelihood-ratio-based tools under the composite null hypothesis of fixed row and column clusters, the distribution of our distance-based model selection tool is well defined, even for composite hypotheses. Finally, based on the geometry of pure Gaussian HDLSS data, we provide an effective visual diagnostic tool to uncover any remaining structure in the data. Through our analysis, we uncovered some interesting sets of gene clusters. But some of our cancer-type clusters did not match the initial cancer labels. On later verification we found that such discordance was due to the close similarity in symptoms and pathological test results of the two types of cancer in question.

Collaborations on Medical Imaging

| MIDAG |Object Shape Tutorial| Select Bibliography |Mixture talk|