111 Cummington Street, Boston, MA 02215
Phone: (617) 353-5209, Fax: (617) 353-8100
Email: sray at followed by bu dot edu
Recently Published and Working Papers
PLoS one (May, 2012)
Journal Page | PDF | BibTex
Results: To address this, we developed a new framework flowScape for emulating certain key aspects of the human perspective in analyzing flow data, which we implemented in multiple steps. First, flowScape begins with creating a mathematically rigorous map of the high-dimensional flow data landscape based on dense and sparse regions defined by relative concentrations of events around modes. In the second step, these modal clusters are connected with a global hierarchical structure. This representation allows flowScape to perform ridgeline analysis for both traversing the landscape and isolating cell populations at different levels of resolution. Finally, we extended manual gating with a new capacity for constructing templates that can identify target populations in terms of their relative parameters, as opposed to the more commonly used absolute or physical parameters. This allows flowScape to apply such templates in batch mode for detecting the corresponding populations in a flexible, sample-specific manner. We also demonstrated different applications of our framework to flow data analysis and show its superiority over other analytical methods.
Conclusions: The human perspective, built on top of intuition and experience, is a very important component of flow cytometric data analysis. By emulating some of its approaches and extending these with automation and rigor, flowScape provides a flexible and robust framework for computational cytomics.
Annals of Applied Statistics 2012, Vol 6, No 2
Journal Page | PDF | Supplement
To appear in Sociological Methods & Research (2012)
Preprint
Journal of Multivariate Analysis Volume 108, July, 2012 Pages 41-52
Preprint | Journal Page
Preprint | Software available from CRAN
 
BMC Bioinformatics 2011, 12:375
Journal Page | pdf |
Results: We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher’s discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to three cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in two of the three cancer datasets.
Conclusions: The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.
Submitted to Sociological Methodology (2011)
Sankhya, Series A. (2011) Vol 72
Journal Page | pdf |
Methods Mol Biol. (2011) 723:337-47. pmid 21370075
pdf | PubMed | Journal Page |
 
BMC Immunology 2008, 9:8
pdf | PubMed | Journal Page |
Tumor-specific antigens and their specific epitopes are formulation targets for patientspecific cancer vaccines. A selection of prediction servers are available for identification of peptides that bind major histocompatibility complex class I (MHC-I) molecules. However, the lack of standardized methodology and large number of human MHC-I molecules, make the selection of appropriate prediction servers difficult. This study reports a comparative evaluation of thirty prediction servers for seven human MHC-I molecules.
Results
Of 147 individual predictors 39 have shown excellent, 47 good, 33 marginal, and 28 poor ability to classify binders from non-binders. The classifiers for HLA-A*0201, A*0301, A*1101, B*0702, B*0801, and B*1501 have excellent, and for A*2402 moderate classification accuracy. In addition, 16 prediction servers predict peptide binding affinity to MHC-I molecules with high accuracy; correlation coefficients ranging from r=0.55 (B*0801) to r=0.87 (A*0201).
Conclusions
Non-linear predictors outperform matrix-based predictors, and majority of predictors can be improved by non-linear transformations of their raw prediction scores. The best predictors of peptide binding (both classification and binding affinity) show the best performance in prediction of T-cell epitopes. We propose a new standard for prediction of MHC-I binding Ð a common scale for normalization of prediction scores, that is applicable to both experimental and predicted scores.
Journal of the Royal Statistical Society - Series B: 70(1) Page 95-118, February 2008
pdf | ps | arxiv | Journal Page |
Keywords:Global comparison of models, high dimensional data, model selection, mixture models, quadratic distance, quadratic risk, spectral degrees of freedom.
Annals of Statistics 2008, Vol. 36, No. 2, page 983--1006
pdf | ps | Journal Page |
Immunome Research 2007, Oct 29;3(1):9
 
pdf | PubMed | Journal Page |
Results: We have undertaken an exploration of the power gained by taking advantage of a natural representation of the amino acids in terms of their biophysical properties. We used several well-known statistical classiers using either a naive encoding of amino acids by name or an encoding by biophysical properties. In all cases, the encoding by biophysical properties leads to substantially lower misclassication error.
Conclusion Representation of amino acids using a few important bio-physio-chemical property provide a natural basis for representing peptides and greatly improves peptide-MHC class I binding prediction.
Journal of Machine Learning Research 8(Aug):1687--1723, 2007
pdf | Journal Page | Software
Annals of Statistics 2005, Vol. 33, No. 5, page 2042-2065
pdf | ps | Journal Page |
Proceedings of International Workshop on Mathematical Foundations of Computational Anatomy pp. 136-145, 2006
pdf | Poster |
Proceedings of the SPIE, Vol. 6512, 2007
pdf | Journal Page |