BU-Keio-Tsinghua 2023 Abstracts

BOSTON UNIVERSITY-KEIO UNIVERSITY-TSINGHUA WORKSHOP 2023

Probability and Statistics

June 26-30, 2023

Home

Registration and funding

Housing and travel

Local food and workshop dinner

Videos

Monday morning: Click here Passcode: A27%X6*c

Monday Afternoon: Click here Passcode: wS.*G^e6

Tuesday morning: Click here Passcode: 3amp?c.p

Tuesday afternoon: Click here Passcode: v!q*Dn8P

Wednesday morning: Click here Passcode: ET4*1Vh#

Thursday morning: Click here Passcode: GY$GX3UJ

Thursday afternoon: Click here Passcode: 8B4&H.oY

Abstracts and slides for talks

Yves Atchade (Boston University) Markov Chain Monte Carlo methods for sparse deep learning Slides
This talk describes two Markov chain Monte Carlo (MCMC) techniques — asynchronous MCMC and cyclical MCMC — that are applicable to many high-dimensional and multimodal distributions that arise in sparse Bayesian modeling. We present some theoretical results of the algorithm in the context of the linear regression model. The analysis also captures well some of the observed shortcoming of the approach. We use the sampler to gain some valuable insights on posterior distributions in deep learning models.

Atsuji, Atsushi (Keio) Some analogues of function theoretic objects and stochastic processes on infinite graphs Slides
Recently several discrete analogues of the objects in classical function theory have been considered in the context of tropical mathematics. Baker and Norine gave a discrete Riemann-Roch theorem on finite graphs. Halburd and Southall, and Laine and Tohge gave a one-dimensional tropical Nevanlinna theory. In this talk we discuss probabilistic aspects of these objects and give some extensions using stochastic calculus on some infinite graphs. This talk is based on a joint work with H. Kaneko (Tokyo University of Science).

Solesne Bourguin (BU) Quantitative fluctuation analysis of multiscale dynamical systems Slides
In this talk, we consider multiscale dynamical systems perturbed by a small Brownian noise and study the limiting behavior of the fluctuations around their deterministic limit from a quantitative standpoint. Using PDE techniques and a second order Poincaré inequality based on Malliavin calculus, we obtain rates of convergence for the central limit theorem satisfied by the slow component in the Wasserstein metric.

Huimin Cheng (Boston University) Masked Mirror Validation in Graphon Estimation Slides
Graphon, short for graph function, provides a generative model for networks. In recent decades, various methods for graphon estimation have been proposed. The success of most graphon estimation methods depends on a proper specification of hyperparameters. Some network cross-validation methods have been proposed, but they suffer from restrictive model assumptions, expensive computational costs, and a lack of theoretical guarantees. To address these issues, we propose a masked mirror validation (MMV) method. Asymptotic properties of the MMV are established. The effectiveness of the proposed method in terms of both computation and accuracy is demonstrated by extensive simulation studies and real experiments.

Mamikon Ginovyan (BU) On the Prediction Error for Singular Stationary Processes Slides
Abstract

Chenlin Gu (Tsinghua) Quantitative homogenization of interacting particle systems
This talk presents that, for a class of interacting particle systems in continuous space, the finite-volume approximations of the bulk diffusion matrix converge at an algebraic rate. The models we consider are reversible with respect to the Poisson measures with constant density, and are of non-gradient type. This approach is inspired by recent progress in the quantitative homogenization of elliptic equations. Along the way, a modified Caccioppoli inequality and a multiscale Poincare inequality are developed, which are of independent interest. The talk is based on a joint work with Arianna Giunti and Jean-Christophe Mourrat.

Hayashi, Kenichi (Keio) Odds-based predictive improvement index for binary regression models Slides
Consider that adding new covariates to an established binary regression model for improving prediction performance. Although the difference of the areas under the ROC curve (delta AUC) is typically used to evaluate degree of the improvement under such situations, its power is not high due to the nature of the rank-based statistic. As an alternative of delta AUC, integrated discrimination improvement (IDI) has been proposed. However, several research pointed out that IDI erroneously detect meaningless improvement. In the present study, we propose a novel index for prediction improvement. Our proposed index has Fisher consistency, implying that it overcomes the problems in delta AUC and IDI. Moreover, our proposal also has more attractive properties which are not seen in our previous study.

Jonathan Huggins (BU) Reproducible Statistical Inference Slides
If slightly changing a model specification or including more data results in contradictory inferences, then the validity of any conclusions drawn from such inferences is put in doubt: they are not, in a statistical sense, reproducible. Motivated by examples ranging from phylogenetic tree reconstruction to cell type identification, I’ll discuss three ways standard likelihood-based inference methods can produce such non-reproducible results. I will describe the source of these problems – all of which implicate model misspecification – and propose some easy-to-implement solutions to fix them.

Imoto, Tomoaki (University of Shizuoka) A new distribution for modeling cylindrical data Slides
In diverse scientific fields, there often appears to be the observation which is represented as a point in the circumference of a unit circle. Typical examples are wind direction and event time measured on a 24-h clock. Such data are called circular data. The circular data is often obtained with linear data like a wind direction and its speed at some point. In this talk, a new distribution for modeling a combination of linear and circular observations is proposed. The properties and method of statistical estimation are also shown, and the illustrative applications are provided.

Jianping Jiang (YMSC, Tsinghua) Thermodynamic limit of the first Lee-Yang zero
For the standard ferromagnetic Ising model on Z^d, we completed the rigorous verification of the proposal by Yang-Lee (1952) and Lee-Yang (1952) that singularities of thermodynamic functions are exactly the limits in the real physical parameter space of finite-volume singularities in the complex plane. Based on joint works with Federico Camia and Charles M. Newman.

Kobayashi, Kei (Keio) Novel geometric methods for data analysis focusing on curvature and geodesics in data space Slides
We propose methods for data analysis that involves two types of transformations to the data space metric. The first transformation is based on powered density integration, which can be implemented approximately using empirical graphs. The second transformation involves computing the extrinsic distance after embedding the data space into a metric cone. We present some statistical applications of these transformations, along with their theoretical justification. This study is a collaboration with Henry P. Wynn (LSE).

Ma, Ping (University of Georgia) Subsampling in Large Graphs Slides
In the past decades, many large graphs with millions of nodes have been collected/constructed. The high computational cost and significant visualization difficulty hinder the analysis of large graphs. Researchers have developed many graph subsampling approaches to provide a rough sketch that preserves global properties. By selecting representative nodes, these graph subsampling methods can help researchers estimate the graph statistics, e.g., the number of communities, of the large graph from the subsample. However, the available subsampling methods, e.g., degree node sampler and random walk sampler, tend to leave out minority communities because nodes with high degrees are more likely to be sampled. In this talk, I will present a novel subsampling method based on Ollivier Ricci curvature to overcome the aforementioned shortcomings. Experiments on synthetic and benchmark datasets will be used to demonstrate the advantages of our algorithm.

Kun Meng (Brown) "Randomness and Statistical Inference of Shapes via the Smooth Euler Characteristic Transform" Slides
We provide the foundations for deriving the distributional properties of the smooth Euler characteristic transform. Motivated by functional data analysis, we propose two algorithms for testing hypotheses on random shapes based on these foundations. Simulation studies are provided to support our mathematical derivations and show the performance of our hypothesis testing framework. We apply our proposed algorithms to analyze a data set of mandibular molars from four genera of primates to test for shape differences and interpret the corresponding results from the morphology viewpoint. Our discussions connect the following fields: algebraic and computational topology, probability theory and stochastic processes, Sobolev spaces and functional analysis, statistical inference, morphology, and medical imaging.

Minami, Mihoko and Cleridy, E. Lennert-Cody (Keio) Regression Tree and Clustering for Distributions, and Homogeneous Structure of Population Characteristics Slides
Scientists often collect samples on characteristics of different observation units and wonder whether the characteristics of the observation units have similar distributional structure. We consider methods to find homogeneous subpopulations using regression tree and clustering for distributions based on a modified Jensen-Shannon divergence. We present a standardized measure of distance between clusters and propose a hierarchical testing procedure to find the minimal homogeneous or near-homogeneous tree structure of the distributions of a population characteristic. As a motivational example, we introduce yellowfin tuna fork length data collected from the tuna catch of purse-seine vessels that operated in the eastern Pacific Ocean.

Nakamura, Tomoshige (Juntendo) Variable importance for causal forest Slides
In this presentation, we propose a variable importance measure for causal forests, which are methods for estimating conditional causal effects using random forests. Variable importance is an approach used to prioritize and rank variables that are effective for prediction when employing random forests, and is often utilized due to its intuitiveness and ease of interpretation in data analysis. We will extend the variable importance measure from traditional random forests to causal forests, and propose a two types of computation procedure for variable importance to identify variables influencing the variation in conditional causal effects.

Mickey Salins (BU) Stochastic partial differential equations with superlinear forcing Slides
I present some recent global existence and uniqueness results about stochastic partial differential equations with superlinear forcing.

Yoshihiro Shirai (Maryland) Empirical bounds of log-returns local characteristics Slides
Bounds on a set K are defined and estimated with quantile regression based on time series of market prices of equities and options for the purpose of dynamically hedging and pricing future payoffs via the nonlinear expectation corresponding to the collection of laws of semimartingales on Skorohod space whose differential characteristics evolve in the set Θ = {(0, 0, κ_p)}_{p∈K}, where κ_p is the Levy density of a bilateral gamma process with parameters p. The estimated set K is assessed based on its implied performance measures, risks-reward relationships and valuations. A nonlinear expectation with distorted bilateral gamma characteristics, which introduces uncertainty in the chosen distribution family, is also constructed and empirically estimated.

Shiraishi, Hiroshi (Keio) Asymptotic Property for Generalized Random Forests Slides
We develop asymptotic properties of estimators constructed by Generalized Random Forests (GRF); a method introduced by Athey et al (2019) to statistically estimate an unknown function defined as a solution to a local estimating equation. In this talk, by using the theory of empirical processes, we discuss the uniform consistency, rate of convergence and weak convergence of the estimator by GRF.

Dongyuan Song (UCLA) In silico data generation and statistical model inference for single-cell and spatial omics Slides
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs, and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories, and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.

Sriram, Tharuvai N (University of Georgia) Robust Recovery of the Central Subspace for Regression Using the Influence Function of the Rényi Divergence Slides
A considerable amount of research in the literature has focused on quantifying the effect of extreme observations on classical methods for estimating the Central Subspace (CS) for regression through the study of influence functions and their sample estimates. Alternatively, a method that is inherently robust to data contamination is also important and desirable for increased reliability in the estimation of the CS without relying on the identification and removal of influential values. To this end, we develop a new method that is innately resistant to outlying observations in recovering a dimension reduction subspace for regression based on the Rényi divergence. In addition to deriving the theoretical Influence Function (IF), the Sample Influence Function (SIF) values are directly utilized to provide new powerful and efficient methods for both estimating the dimension of the CS and selecting an optimal level of the tuning parameter to decrease the impact of extreme observations. The model-free approach is detailed theoretically, its performance is investigated through simulation, and the application in practice is demonstrated through real data analysis.

Emily Stephen (BU) Using state space models to understand rhythmic dynamics in the brain Slides
Rhythmic dynamics in the brain are key identifiers of different brain states in health and disease, and understanding their biophysical mechanisms is an important area of neuroscientific research. Because of their non-sinusoidal shapes, interaction across frequencies, and complex interactions across space, classical frequency-domain statistics are often insufficient to fully characterize the repertoire of observed rhythmic dynamics. Here I will present the state-space oscillator framework, a family of time-domain stochastic process models that can be used to construct tailored models of neural rhythms. I will show how these models can be used to separate multiple overlapping rhythms, capture interactions between rhythms at multiple frequencies, and describe functional networks across space in several different ways.

Dan Sussman (BU) Matching embeddings via shuffled total least squares regression Slides
A frequently used approach for graph matching is first to embed the networks as points in Euclidean space and then match the embeddings. We consider the case that the two graphs have related but not identical distributions that necessitate a more complex alignment in the matching step. This is related to the problem known as shuffled linear regression. We consider a modified shuffled regression setting where there is noise in both the response and the predictor variables. This setting better matches the graph matching problem and we provide convergence rates for a shuffled total least squares method in terms of the normalized Procrustes quadratic loss.

Takahashi, Hiroshi (Keio) Diffusion processes in random environments on disconnected selfsimilar fractal sets in R Slides
We study the limiting behavior of diffusion processes in random environments on disconnected self-similar fractal sets in R. Due to the effect of random environments, the diffusion processes exhibit an ultraslow diffusive behavior. We show that the limiting distributions are given under suitable scalings determined by self-similar fractal sets and measures associated with the sets.This talk is based on joint work with Y. Tamura.

Tanemura, Hideki (Keio) On a model of evolution of subspecies Slides
Ben-Ari and Schinazi (2016) introduced a stochastic model to study `virus-like evolving population with high mutation rate'. This model is a birth and death model with an individual at birth being either a mutant with a random fitness parameter in [0,1] or having one of the existing fitness parameters with uniform probability; whereas a death event removes the entire population of the least fit site. We change this to incorporate the notion of `survival of the fittest', by requiring that a non-mutant individual, at birth, has a fitness according to a preferential attachment mechanism, i.e., it has a fitness f with a probability proportional to the size of the population of fitness f. Also death just removes one individual at the least fit site. This preferential attachment rule leads to a power law behaviour in the asymptotics, unlike the exponential behaviour obtained by Ben-Ari and Schinazi (2016).

Wu, Hao (Tsinghua) Connection probabilities for 2D critical lattice models Slides
Conformal invariance of critical lattice models in two-dimensional has been vigorously studied for decades. The first example where the conformal invariance was rigorously verified was the planar uniform spanning tree (together with loop-erased random walk), proved by Lawler, Schramm and Werner around 2000. Later, the conformal invariance was also verified for Bernoulli percolation (Smirnov 2001), level lines of Gaussian free field (Schramm-Sheffield 2009), and Ising model and FK-Ising model (Chelkak-Smirnov et al 2012). In this talk, we focus on connection probabilities of these critical lattice models in polygons with alternating boundary conditions.

Wu, Rongling (BIMSA) The hypernetwork model of complex systems Slides
Network models have been widely used as a powerful tool to study complex systems. Existing approaches reconstruct pairwise networks whose interacting pairs of nodes are connected by edges, failing to characterize the high-order architecture of complex systems. In this talk, I will be presenting a statistical physics model that marries evolutionary game theory and ecology theory to leverage the definition and estimation of high-order interactions (HOI). This model can quantitatively reveal both patterns of how a node shapes interactions between pairs of other nodes (active HOI) and how a pairwise interaction influences the third nodes (passive HOI). We coalesce active and passive HOI into hypernetworks, shedding light on the mechanistic understanding of emergent properties of complex systems.

Beibei Xu (UGA) Tail Spectral Density Estimation and Its Uncertainty Quantification: Another Look at Tail Dependent Time Series Analysis Slides

Yang, Fan (Tsinghua) Mediation analysis with the mediator and outcome missing not at random Slides
Mediation analysis is widely used for investigating direct and indirect causal pathways through which an effect arises. However, many mediation analysis studies are challenged by missingness in the mediator and outcome. In general, when the mediator and outcome are missing not at random, the direct and indirect effects are not identifiable without further assumptions. In this work, we study the identifiability of the direct and indirect effects under some interpretable missing not at random mechanisms. We evaluate the performance of statistical inference under those assumptions through simulation studies and illustrate the proposed methods via the National Job Corps Study.

Yang, Fan (Tsinghua) SIMPLE-RC: Group Network Inference with Non-Sharp Nulls and Weak Signals Slides
The recent work of Fan, Fan, Han and Lv (2022) introduced a general framework of statistical inference on membership profiles in large networks (SIMPLE) for testing the sharp null hypothesis that a pair of given nodes share the same membership profiles. In real applications, there are often groups of nodes under investigation that may share similar membership profiles at the presence of relatively weaker signals than the setting considered in SIMPLE. To address these practical challenges, we propose a SIMPLE method with random coupling (SIMPLE-RC) for testing the non-sharp null hypothesis that a group of given nodes share similar (not necessarily identical) membership profiles under weaker signals. Utilizing the idea of random coupling, we construct our test as the maximum of the SIMPLE tests for subsampled node pairs from the group. Such technique reduces significantly the correlation among individual SIMPLE tests while largely maintaining the power, enabling delicate analysis on the asymptotic distributions of the SIMPLE-RC test. Our method and theory cover both the cases with and without node degree heterogeneity. These new theoretical developments are empowered by a second-order expansion of spiked eigenvectors, built upon our work for random matrices with weak spikes. Based on joint work with Jianqing Fan, Yingying Fan and Jinchi Lv.

Zhang, Ting (University of Georgia) High-Quantile Regression for Tail-Dependent Time Series Slides
Quantile regression is a popular and powerful method for studying the effect of regressors on quantiles of a response distribution. However, existing results on quantile regression were mainly developed for cases in which the quantile level is fixed, and the data are often assumed to be independent. Motivated by recent applications, we consider the situation where (i) the quantile level is not fixed and can grow with the sample size to capture the tail phenomena, and (ii) the data are no longer independent, but collected as a time series that can exhibit serial dependence in both tail and non-tail regions. To study the asymptotic theory for high-quantile regression estimators in the time series setting, we introduce a tail adversarial stability condition, which had not previously been described, and show that it leads to an interpretable and convenient framework for obtaining limit theorems for time series that exhibit serial dependence in the tail region, but are not necessarily strongly mixing. Numerical experiments are conducted to illustrate the effect of tail dependence on high-quantile regression estimators, for which simply ignoring the tail dependence may yield misleading p-values.

Wenxuan Zhong (University of Georgia) Med-Reader: A query-based multisource learner using complex network
As the volume and velocity of medical publications have increased at an unprecedented pace, extracting knowledge that is buried in these publications is becoming more and more essential for conducting medical research. By integrating an enormous number of published studies and discoveries, we can comprehensively interpret biological processes, cross-validate biological reasonings, contribute multisource biological evidence, and propose advisable biological hypotheses.

Poster presentations

Lin, Nian (Michigan State University) Cross Smoothness Parameter Estimation for Bivariate Gaussian Processes Poster

Lin, Xuanan (Keio University) Application of Particle Filter on the Estimation of Epidemic Statistics with Non-Gaussian Errors
In recent years, the world has witnessed the devastating impact of epidemics on global health and socioeconomic systems. Accurate estimation and prediction of epidemic dynamics play a crucial role in guiding public health interventions and mitigating the spread of infectious diseases. Particle filters have emerged as a valuable tool in modelling and estimating complex epidemic processes. The ability to handle nonlinearity, non-Gaussianity, and complex dynamics makes particle filters particularly well-suited for capturing the intricacies of epidemic spread. Throughout the presentation, we will showcase COVID-19 examples of particle filters applied to epidemic estimation. These case studies will demonstrate the practical utility of particle filters in capturing epidemic dynamics and informing public health decision-making.

Takeuchi, Yutaka (Keio University) Quenched Invariance Principle for a Reflecting Diffusion in a Continuum Percolation Cluster
After 2000s, many researchers showed quenched results in the study of the random media. For instance, quenched invariance principle for random walks in random conductance models and diffusions in random environments were shown. We study the continuum percolation. Assuming that the occupied region has a unique unbounded cluster and the cluster satisfies some geometric conditions, we prove a quenched invariance principle for the reflecting diffusions on the continuum percolation cluster.

Yoneyama, Shintaro (Keio University) Variable Selection for the Extended Propensity Score to Adjust Missing Not at Random
The pattern of missingness which depends not only on the observed values but also on the missing values themselves is called Missing Not at Random (MNAR). For example, in a test of cognitive function for the elderly, the more cognitively declined the more likely they were to drop out of the test. In this case, inferences about cognitive function that ignore missing data are not valid because this leads to overestimation of average cognitive function. Therefore, for MNAR data, adjustments for missingness are necessary for valid inference. Sun et al. (2018) proposed a method to estimate the population mean when the outcome has MNAR. They assume that covariates and an instrumental variable (IV), which are used to adjust for missing data, are observed. The IV is a variable that is conditionally independent of the outcome given covariates and is related to the missing pattern. This method adopts a two-step estimation process: estimating the extended propensity score, which is the probability that the outcome is observed conditional on the covariates, outcome and IV, and then estimating the population mean with the extended propensity score estimates. In this study, we examine which covariates should be included in the model for estimating the extended propensity score to obtain unbiased and low variance estimates of the population mean when using the method of Sun et al. (2018).

Our thanks to the following organizations for workshop funding:

Boston University Department of Mathematics and Statistics
Boston University Graduate School of Arts and Sciences
Keio University
Tsinghua University
National Science Foundation