I only list, reverse chronologically, rigorously peer-reviewed/submitted, full-length papers at reputable journals or conferences. Note that I exclude all short workshop/conference papers, poster/talk abstracts and papers with little/no review.
I only list, reverse chronologically, rigorously peer-reviewed/submitted, full-length papers at reputable journals or conferences. Note that I exclude all short workshop/conference papers, poster/talk abstracts and papers with little/no review.
Given the widely available online customer ratings on products, the individual-level rating prediction and clustering of customers and products are increasingly important for sellers to create targeting strategies for expanding the customer base and improving product ratings. However, the massive missing data problem is a significant challenge for modeling online product ratings. To address this issue, we propose a new co-clustering methodology based on a bipartite network modeling of large-scale ordinal product ratings. Our method extends existing co-clustering methods by incorporating covariates and ordinal ratings in the modelbased co-clustering of a weighted bipartite network. We devise an efficient variational EM algorithm for model estimation. A simulation study demonstrates that our methodology is scalable for modeling large datasets and provides accurate estimation and clustering results. We further show that our model can successfully identify different groups of customers and products with meaningful interpretations and achieve promising predictive performance in a real application for customer targeting.
Model-based clustering of time-evolving networks has emerged as one of the important research topics in statistical network analysis. It is a fundamental research question to model time-varying network parameters. However, due to difficulties in modeling functional network parameters, there is little progress in the current literature to model time-varying network parameters effectively. In this work, we model network parameters as univariate nonparametric functions instead of constants. We effectively estimate those functional network parameters in temporal exponential-family random graph models using a kernel regression technique and a local likelihood approach. Furthermore, we propose a semi-parametric finite mixture of temporal exponential-family random graph models by adopting finite mixture models, which simultaneously allows both modeling and detecting groups in time-evolving networks. Also, we use a conditional likelihood to construct an effective model selection criterion and network cross-validation to choose an optimal bandwidth. The power of our method is demonstrated in simulation studies and real-world applications to dynamic international trade networks and dynamic arm trade networks.
Model-based clustering of dynamic networks has emerged as an important research topic in statistical network analysis. It is critical to effectively and efficiently model the time-evolving latent block structure of dynamic networks in practice. However, the focus of most existing methods is on the static or dynamicly invariant block structure. We present a principled statistical clustering of large-scale dynamic networks through the dynamic exponential-family random graph models with a hidden Markov structure. The hidden Markov structure is used to effectively infer the time-evolving block structure of dynamic networks. We prove the identification conditions for both network parameters and transition matrix in our proposed model-based clustering. We propose an effective model selection criterion based on the integrated classification likelihood to choosing an appropriate number of clusters. We develop a scalable variational expectation-maximization algorithm to efficiently solve the approximate maximum likelihood estimate. The numerical performance of our proposed method is demonstrated in simulation studies and two real data applications to dynamic international trade networks and dynamic email networks of a large institute.
Water pollution is a major global environmental problem, and it poses a great environmental risk to public health and biological diversity. This work is motivated by assessing the potential environmental threat of coal mining through increased sulfate concentrations in river networks, which do not belong to any simple parametric distribution. However, existing network models mainly focus on binary or discrete networks and weighted networks with known parametric weight distributions. We propose a principled nonparametric weighted network model based on exponential-family random graph models and local likelihood estimation, and study its model-based clustering with application to large-scale water pollution network analysis. We do not require any parametric distribution assumption on network weights. The proposed method greatly extends the methodology and applicability of statistical network models. Furthermore, it is scalable to large and complex networks in large-scale environmental studies and geoscientific research. The power of our proposed methods is demonstrated in simulation studies and a real application to sulfate pollution network analysis in Ohio watershed located in Pennsylvania, United States.
Chemical spills in streams can impact ecosystem or human health. Typically, the public learns of spills from industry, media, or government reporting rather than monitoring data. For example, ~1300 spills (76 ≥400 gallons or ~1,500 liters) were reported from 2007 to 2014 by the regulator for natural gas wellpads in the Marcellus shale region of Pennsylvania (U.S.), a region of extensive drilling and hydraulic fracturing. Only one such incident of stream contamination in Pennsylvania has been documented with water quality data in peer-reviewed literature. This could indicate that spills (1) were small or contained on wellpads, (2) were diluted, biodegraded, or obscured by other contaminants, (3) were not detected because of sparse monitoring, or (4) were not detected because of the difficulties of inspecting data for complex stream networks. As a first step addressing the last problem, we developed a geospatial-analysis tool, GeoNet, that analyzes stream networks to detectstatistically significant changes between background and potentially-impacted sites. GeoNet was used on data in the Water Quality Portal for the Pennsylvania Marcellus region. With the most stringentstatistical tests, GeoNet detected 0.2 to 2% of the known contamination incidents (Na±Cl) in streams. With denser sensor networks, tools like GeoNet could allow real-time detection of polluting events.
With recent improvements in high-volume hydraulic fracturing (HVHF, known to the public as fracking), vast new reservoirs of natural gas and oil are now being tapped. As HVHF has expanded into the populous northeastern USA, some residents have become concerned about impacts on water quality. Scientists have addressed this concern by investigating individual case studies or by statistically assessing the rate of problems. In general, however, the lack of access to new or historical water quality data hinders the latter assessments. We introduce a new statistical approach to assess water quality datasets – especially sets that differ in data volume and variance – and apply the technique to one region of intense shale gas development in northeastern Pennsylvania (PA) and one with fewer shale gas wells in northwestern PA. The new analysis for the intensely developed region corroborates an earlier analysis based on a different statistical test: in that area, changes in groundwater chemistry show no degradation despite that area’s dense development of shale gas. In contrast, in the region with fewer shale gas wells, we observe slight but statistically significant increases in concentrations in some solutes in groundwaters. One potential explanation for the slight changes in groundwater chemistry in that area (northwestern PA) is that it is the regional focus of the earliest commercial development of conventional oil and gas (O&G) in the USA. Alternate explanations include the use of brines from conventional O&G wells as well as other salt mixtures on roads in that area for dust abatement or de-icing, respectively.
To understand how extraction of different energy sources impacts water resources requires assessment of how water chemistry has changed in comparison with the background values of pristine streams. With such understanding, we can develop better water quality standards and ecological interpretations. However, determination of pristine background chemistry is difficult in areas with heavy human impact. To learn to do this, we compiled a master dataset of sulfate and barium concentrations ([SO4], [Ba]) in Pennsylvania (PA, USA) streams from publically available sources. These elements were chosen because they can represent contamination related to oil/gas and coal, respectively. We applied changepoint analysis (i.e., likelihood ratio test) to identify pristine streams, which we defined as streams with a low variability in concentrations as measured over years. From these pristine streams, we estimated the baseline concentrations for major bedrock types in PA. Overall, we found that 48,471 data values are available for [SO4] from 1904 to 2014 and 3243 data for [Ba] from 1963 to 2014. Statewide [SO4] baseline was estimated to be 15.8 ± 9.6 mg/L, but values range from 12.4 to 26.7 mg/L for different bedrock types. The statewide [Ba] baseline is 27.7 ± 10.6 µg/L and values range from 25.8 to 38.7 µg/L. Results show that most increases in [SO4] from the baseline occurred in areas with intensive coal mining activities, confirming previous studies. Sulfate inputs from acid rain were also documented. Slight increases in [Ba] since 2007 and higher [Ba] in areas with higher densities of gas wells when compared to other areas could document impacts from shale gas development, the prevalence of basin brines, or decreases in acid rain and its coupled effects on [Ba] related to barite solubility. The largest impacts on PA stream [Ba] and [SO4] are related to releases from coal mining or burning rather than oil and gas development.
Causality analysis, beyond “mere” correlations, has become increasingly important for scientific discoveries and policy decisions. Many of these real-world applications involve time series data. A key observation is that the causality between time series could vary significantly over time. For example, a rain could cause severe traffic jams during the rush hours, but has little impact on the traffic at midnight. However, previous studies mostly look at the whole time series when determining the causal relationship between them. Instead, we propose to detect the partial time intervals with causality. As it is time consuming to enumerate all time intervals and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary computations based on the bounds of F-test in the Granger causality test. We use both synthetic datasets and real datasets to demonstrate the efficiency of our pruning techniques and that our method can effectively discover interesting causal intervals in the time series data.
Eukaryotic cells undergo shape changes during their division and growth. This involves flow of material both in the cell membrane and in the cytoskeletal layer beneath the membrane. Such flows result in redistribution of phospholipid at the cell surface and actomyosin in the cortex. Here we focus on the growth of the intercellular surface during cell division in a Caenorhabditis elegans embryo. The growth of this surface leads to the formation of a double-layer of separating membranes between the two daughter cells. The division plane typically has a circular periphery and the growth starts from the periphery as a membrane invagination, which grows radially inward like the shutter of a camera. The growth is typically not concentric, in the sense that the closing internal ring is located off-center. Cytoskeletal proteins anillin and septin have been found to be responsible for initiating and maintaining the asymmetry of ring closure but the role of possible asymmetry in the material flow into the growing membrane has not been investigated yet. Motivated by experimental evidence of such flow asymmetry, here we explore the patterns of internal ring closure in the growing membrane in response to asymmetric boundary fluxes. We highlight the importance of the flow asymmetry by showing that many of the asymmetric growth patterns observed experimentally can be reproduced by our model, which incorporates the viscous nature of the membrane and contractility of the associated cortex.
This paper demonstrates that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesian approach of parameter estimation. The proposed approach is applied in the context of nearfield acoustic holography based on the equivalent source method (ESM). A state-space model is formulated in light of the ESM. The parameters to estimate are amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. The implementation of recursive Bayesian filters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. It is evident from the results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.