Publications

Model-Based Co-Clustering in Customer Targeting Utilizing Large-Scale Online Product Rating Networks

Journal Paper

Qian Chen, Amal Agarwal (co-first author), Duncan Fong, Wayne DeSarbo and Lingzhou Xue.

Journal of Business & Economic Statistics

Publication year: 2024

Abstract

Given the widely available online customer ratings on products, the individual-level rating prediction and clustering of customers and products are increasingly important for sellers to create targeting strategies for expanding the customer base and improving product ratings. However, the massive missing data problem is a significant challenge for modeling online product ratings. To address this issue, we propose a new co-clustering methodology based on a bipartite network modeling of large-scale ordinal product ratings. Our method extends existing co-clustering methods by incorporating covariates and ordinal ratings in the modelbased co-clustering of a weighted bipartite network. We devise an efficient variational EM algorithm for model estimation. A simulation study demonstrates that our methodology is scalable for modeling large datasets and provides accurate estimation and clustering results. We further show that our model can successfully identify different groups of customers and products with meaningful interpretations and achieve promising predictive performance in a real application for customer targeting.

Model‐Based Clustering of Semiparametric Temporal Exponential‐Family Random Graph Models

Journal Paper

Kevin H. Lee, Amal Agarwal, Anna Y. Zhang, Lingzhou Xue

Stat Journal

Publication year: 2022

Model-based clustering of time-evolving networks has emerged as one of the important research topics in statistical network analysis. It is a fundamental research question to model time-varying network parameters. However, due to difficulties in modeling functional network parameters, there is little progress in the current literature to model time-varying network parameters effectively. In this work, we model network parameters as univariate nonparametric functions instead of constants. We effectively estimate those functional network parameters in temporal exponential-family random graph models using a kernel regression technique and a local likelihood approach. Furthermore, we propose a semi-parametric finite mixture of temporal exponential-family random graph models by adopting finite mixture models, which simultaneously allows both modeling and detecting groups in time-evolving networks. Also, we use a conditional likelihood to construct an effective model selection criterion and network cross-validation to choose an optimal bandwidth. The power of our method is demonstrated in simulation studies and real-world applications to dynamic international trade networks and dynamic arm trade networks.

Temporal Exponential-Family Random Graph Models with Time-Evolving Latent Block Structure for Dynamic Networks

Journal Paper

Amal Agarwal, Kevin Lee and Lingzhou Xue

Submitted in Journal of Business & Economic Statistics

Publication year: 2020

Abstract

Model-based clustering of dynamic networks has emerged as an important research topic in statistical network analysis. It is critical to effectively and efficiently model the time-evolving latent block structure of dynamic networks in practice. However, the focus of most existing methods is on the static or dynamicly invariant block structure. We present a principled statistical clustering of large-scale dynamic networks through the dynamic exponential-family random graph models with a hidden Markov structure. The hidden Markov structure is used to effectively infer the time-evolving block structure of dynamic networks. We prove the identification conditions for both network parameters and transition matrix in our proposed model-based clustering. We propose an effective model selection criterion based on the integrated classification likelihood to choosing an appropriate number of clusters. We develop a scalable variational expectation-maximization algorithm to efficiently solve the approximate maximum likelihood estimate. The numerical performance of our proposed method is demonstrated in simulation studies and two real data applications to dynamic international trade networks and dynamic email networks of a large institute.

Model-Based Clustering of Nonparametric Weighted Networks with Application to Water Pollution Analysis

Journal Paper

Amal Agarwal and Lingzhou Xue

Technometrics, Volume 62, Issue 2, Pages 161-172

Publication year: 2020

Abstract

Water pollution is a major global environmental problem, and it poses a great environmental risk to public health and biological diversity. This work is motivated by assessing the potential environmental threat of coal mining through increased sulfate concentrations in river networks, which do not belong to any simple parametric distribution. However, existing network models mainly focus on binary or discrete networks and weighted networks with known parametric weight distributions. We propose a principled nonparametric weighted network model based on exponential-family random graph models and local likelihood estimation, and study its model-based clustering with application to large-scale water pollution network analysis. We do not require any parametric distribution assumption on network weights. The proposed method greatly extends the methodology and applicability of statistical network models. Furthermore, it is scalable to large and complex networks in large-scale environmental studies and geoscientific research. The power of our proposed methods is demonstrated in simulation studies and a real application to sulfate pollution network analysis in Ohio watershed located in Pennsylvania, United States.

Assessing Contamination of Stream Networks Near Shale Gas Development Using a New Geospatial Tool

Amal Agarwal, Tao Wen, Alex Chen , Anna Yinqi Zhang , Xianzeng Niu , Xiang Zhan , Lingzhou Xue, Susan L. Brantley

Environmental Science & Technology, Volume 54, Issue 14, Pages 8632-8639

Publication year: 2020

Chemical spills in streams can impact ecosystem or human health. Typically, the public learns of spills from industry, media, or government reporting rather than monitoring data. For example, ~1300 spills (76 ≥400 gallons or ~1,500 liters) were reported from 2007 to 2014 by the regulator for natural gas wellpads in the Marcellus shale region of Pennsylvania (U.S.), a region of extensive drilling and hydraulic fracturing. Only one such incident of stream contamination in Pennsylvania has been documented with water quality data in peer-reviewed literature. This could indicate that spills (1) were small or contained on wellpads, (2) were diluted, biodegraded, or obscured by other contaminants, (3) were not detected because of sparse monitoring, or (4) were not detected because of the difficulties of inspecting data for complex stream networks. As a first step addressing the last problem, we developed a geospatial-analysis tool, GeoNet, that analyzes stream networks to detectstatistically significant changes between background and potentially-impacted sites. GeoNet was used on data in the Water Quality Portal for the Pennsylvania Marcellus region. With the most stringentstatistical tests, GeoNet detected 0.2 to 2% of the known contamination incidents (Na±Cl) in streams. With denser sensor networks, tools like GeoNet could allow real-time detection of polluting events.

Assessing changes in groundwater chemistry in landscapes with more than 100 years of oil and gas development

Journal Paper

Tao Wen, Amal Agarwal (co-first author), Lingzhou Xue, Alex Chen, Alison Herman, Zhenhui Li and Susan L. Brantley

Environmental Science: Processes & Impacts, Volume 21, Issue 2, Pages 384-396

Publication year: 2019

With recent improvements in high-volume hydraulic fracturing (HVHF, known to the public as fracking), vast new reservoirs of natural gas and oil are now being tapped. As HVHF has expanded into the populous northeastern USA, some residents have become concerned about impacts on water quality. Scientists have addressed this concern by investigating individual case studies or by statistically assessing the rate of problems. In general, however, the lack of access to new or historical water quality data hinders the latter assessments. We introduce a new statistical approach to assess water quality datasets – especially sets that differ in data volume and variance – and apply the technique to one region of intense shale gas development in northeastern Pennsylvania (PA) and one with fewer shale gas wells in northwestern PA. The new analysis for the intensely developed region corroborates an earlier analysis based on a different statistical test: in that area, changes in groundwater chemistry show no degradation despite that area’s dense development of shale gas. In contrast, in the region with fewer shale gas wells, we observe slight but statistically significant increases in concentrations in some solutes in groundwaters. One potential explanation for the slight changes in groundwater chemistry in that area (northwestern PA) is that it is the regional focus of the earliest commercial development of conventional oil and gas (O&G) in the USA. Alternate explanations include the use of brines from conventional O&G wells as well as other salt mixtures on roads in that area for dust abatement or de-icing, respectively.

Detecting the effects of coal mining, acid rain, and natural gas extraction in Appalachian basin streams in Pennsylvania (USA) through analysis of barium and sulfate concentrations

Journal Paper

Xianzeng Niu, Anna Wendt, Zhenhui Li, Amal Agarwal, Lingzhou Xue, and Susan L. Brantley

Environmental Geochemistry and Health Journal, Volume 40, Issue 2, Pages 865-885

Publication year: 2018

Abstract

To understand how extraction of different energy sources impacts water resources requires assessment of how water chemistry has changed in comparison with the background values of pristine streams. With such understanding, we can develop better water quality standards and ecological interpretations. However, determination of pristine background chemistry is difficult in areas with heavy human impact. To learn to do this, we compiled a master dataset of sulfate and barium concentrations ([SO₄], [Ba]) in Pennsylvania (PA, USA) streams from publically available sources. These elements were chosen because they can represent contamination related to oil/gas and coal, respectively. We applied changepoint analysis (i.e., likelihood ratio test) to identify pristine streams, which we defined as streams with a low variability in concentrations as measured over years. From these pristine streams, we estimated the baseline concentrations for major bedrock types in PA. Overall, we found that 48,471 data values are available for [SO₄] from 1904 to 2014 and 3243 data for [Ba] from 1963 to 2014. Statewide [SO₄] baseline was estimated to be 15.8 ± 9.6 mg/L, but values range from 12.4 to 26.7 mg/L for different bedrock types. The statewide [Ba] baseline is 27.7 ± 10.6 µg/L and values range from 25.8 to 38.7 µg/L. Results show that most increases in [SO₄] from the baseline occurred in areas with intensive coal mining activities, confirming previous studies. Sulfate inputs from acid rain were also documented. Slight increases in [Ba] since 2007 and higher [Ba] in areas with higher densities of gas wells when compared to other areas could document impacts from shale gas development, the prevalence of basin brines, or decreases in acid rain and its coupled effects on [Ba] related to barite solubility. The largest impacts on PA stream [Ba] and [SO₄] are related to releases from coal mining or burning rather than oil and gas development.

Discovery of Causal Time Intervals

Conference

Zhenhui Li, Guanjie Zheng, Amal Agarwal and Lingzhou Xue

SDM’17: the Seventeenth SIAM International Conference on Data Mining, Pages 804-812

Publication year: 2017

Abstract

Causality analysis, beyond “mere” correlations, has become increasingly important for scientific discoveries and policy decisions. Many of these real-world applications involve time series data. A key observation is that the causality between time series could vary significantly over time. For example, a rain could cause severe traffic jams during the rush hours, but has little impact on the traffic at midnight. However, previous studies mostly look at the whole time series when determining the causal relationship between them. Instead, we propose to detect the partial time intervals with causality. As it is time consuming to enumerate all time intervals and test causality for each interval, we further propose an efficient algorithm that can avoid unnecessary computations based on the bounds of F-test in the Granger causality test. We use both synthetic datasets and real datasets to demonstrate the efficiency of our pruning techniques and that our method can effectively discover interesting causal intervals in the time series data.

Asymmetric flows in the intercellular membrane during cytokinesis

Journal Paper

Vidya V. Menon, Soumya S S, Amal Agarwal, Sundar R. Naganathan, Mandar M. Inamdar and Anirban Sain

Biophysical Journal, Volume 113, Issue 12, Pages 2787-2795

Publication year: 2017

Abstract

Eukaryotic cells undergo shape changes during their division and growth. This involves flow of material both in the cell membrane and in the cytoskeletal layer beneath the membrane. Such flows result in redistribution of phospholipid at the cell surface and actomyosin in the cortex. Here we focus on the growth of the intercellular surface during cell division in a Caenorhabditis elegans embryo. The growth of this surface leads to the formation of a double-layer of separating membranes between the two daughter cells. The division plane typically has a circular periphery and the growth starts from the periphery as a membrane invagination, which grows radially inward like the shutter of a camera. The growth is typically not concentric, in the sense that the closing internal ring is located off-center. Cytoskeletal proteins anillin and septin have been found to be responsible for initiating and maintaining the asymmetry of ring closure but the role of possible asymmetry in the material flow into the growing membrane has not been investigated yet. Motivated by experimental evidence of such flow asymmetry, here we explore the patterns of internal ring closure in the growing membrane in response to asymmetric boundary fluxes. We highlight the importance of the flow asymmetry by showing that many of the asymmetric growth patterns observed experimentally can be reproduced by our model, which incorporates the viscous nature of the membrane and contractility of the associated cortex.

Bayesian inversion and sequential Monte Carlo sampling techniques applied to nearfield acoustic sensor arrays

Journal Paper

MR Bai, A Agarwal, CC Chen and YC Wang

Journal of the Acoustical Society of America, Volume 136, Issue 4, Page 2084

Publication year: 2014

Abstract

This paper demonstrates that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesian approach of parameter estimation. The proposed approach is applied in the context of nearfield acoustic holography based on the equivalent source method (ESM). A state-space model is formulated in light of the ESM. The parameters to estimate are amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. The implementation of recursive Bayesian filters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. It is evident from the results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.

Bayesian approach of nearfield acoustic reconstruction with particle filters

Journal Paper

MR Bai, A Agarwal, CC Chen and YC Wang

Journal of the Acoustical Society of America, Volume 133, Issue 6, Pages 4032-4043

Publication year: 2013

Abstract

This paper demonstrates that inverse source reconstruction can be performed using a methodology of particle filters that relies primarily on the Bayesianapproach of parameter estimation. In particular, the proposed approach is applied in the context of nearfield acoustic holography based on the equivalent source method (ESM). A state-space model is formulated in light of the ESM. The parameters to estimate are amplitudes and locations of the equivalent sources. The parameters constitute the state vector which follows a first-order Markov process with the transition matrix being the identity for every frequency-domain data frame. Filtered estimates of the state vector obtained are assigned weights adaptively. The implementation of recursive Bayesianfilters involves a sequential Monte Carlo sampling procedure that treats the estimates as point masses with a discrete probability mass function (PMF) which evolves with iteration. The weight update equation governs the evolution of this PMF and depends primarily on the likelihood function and the prior distribution. It is evident from the simulation results that the inclusion of the appropriate prior distribution is crucial in the parameter estimation.

Amal

ML Engineer

Filter by type:

Model-Based Co-Clustering in Customer Targeting Utilizing Large-Scale Online Product Rating Networks

Abstract

Model‐Based Clustering of Semiparametric Temporal Exponential‐Family Random Graph Models

Temporal Exponential-Family Random Graph Models with Time-Evolving Latent Block Structure for Dynamic Networks

Abstract

Model-Based Clustering of Nonparametric Weighted Networks with Application to Water Pollution Analysis

Abstract

Assessing Contamination of Stream Networks Near Shale Gas Development Using a New Geospatial Tool

Assessing changes in groundwater chemistry in landscapes with more than 100 years of oil and gas development

Detecting the effects of coal mining, acid rain, and natural gas extraction in Appalachian basin streams in Pennsylvania (USA) through analysis of barium and sulfate concentrations

Abstract

Discovery of Causal Time Intervals

Abstract

Asymmetric flows in the intercellular membrane during cytokinesis

Abstract

Bayesian inversion and sequential Monte Carlo sampling techniques applied to nearfield acoustic sensor arrays

Abstract

Bayesian approach of nearfield acoustic reconstruction with particle filters

Abstract