Projects

Journal papers and conference proceedings

Regularized Simple Graph Convolution (SGC) for improved interpretability of large datasets

Attributed graph is a powerful tool to model many complex systems in our world. It contains two valuable sources of information, i.e the node feautures and the network strucutre, which can be analyzed efficiently by Graph Neural Networks (GNNs) to produce accurate inference for many downstream tasks such as node classification, link prediction, etc. The exponential growth of graph structured data drives the complexity of GNNs models causing concerns about processing time and interpretability of the results. This work extended a fast GNNs framework, namely the Simple Graph Convolution (SGC) by incorporating a flexible shrinkage scheme to produce a sparser model whose fitted parameters are capable of highlighting key node features determining the class characteristics.


Application of SGC on Cora citation dataset. The fitted weight vectors of each class is plotted as a heatmap	The proposed regularization significantly reduces the amount of weights and hightlights important features defining class membership

Exploring the value of nodes with multicommunity membership for classification with graph convolutional neural networks

Sampling process plays an important role in machine learning pipeline as it provides quality training samples to build the model. However, the process of determining the best sampling method has been rarely studied in the context of graph neural networks. In this work, we evaluate the effect of multiple sampling strategies on node classification task using the SGC. Our discoveries include an indicative measure of sampling efficiency and a heuristic criterion based on network topology which can be utilized to suggest best sampling strategy for practitioners.

Dataset	Predicted optimal strategy	Actual optimal strategy
Cora	Descending	Descending
Citeseer	Descending	Descending
Pubmed	Ascending	Ascending
Amazon-pc	Descending	Ascending
Amazon-photo	Ascending	Ascending
Coauthor-cs	Descending	Descending
Coauthor-physics	Ascending	Descending
Lastfm_Asia	Ascending	Ascending
Deezer_Europe	Descending	Descending

Exploring a link between network topology and active learning

Upon the findings of previous work on the importance of network topology in the selection of training samples, we further study the proposed heuristic measure. We design a comprehensive set of synthetic datasets covering a wide range of network structures and input features. Utilize both synthetic and real life datasets, we then derive effective sampling approaches based on the proposed measure to facilitate the task of predicting node label.

Simulation study suggests a threshold upon which optimal sampling strategy can be determined

Online social networks have a significant effect on human society and have become an important research topic for maintaining the integrity of the common social understanding. In this work, we study the application of Simple Graph Convolution (SGC) and our proposed extension (regularized SGC) on a social network dataset as well as a comprehensive set of synthetic attributed graphs with varying network topologies. Our proposed framework has high predictive capability and also produces sparser model facilitating practitioner in investigation of important features relevant to class assignment.


Simulation study covers a wide range of network topologies	Regularized framework has high predictive power and is capable of enhancing users understanding of features contribution to class assignment

Link prediction with Simple Graph Convolution and regularized Simple Graph Convolution

Link prediction is an important task on attributed graph with a wide range of useful applications.In work, we adapt the flexible regularization mechanism of regularized SGC for link prediction with the goal of improving the interpretability of the fitted model.

Personal projects

OUC Meter Data Science Competition

The goal of this competition is to identify complex electric theft/tampering pattern in Orlando Utilities Commission(OUC) meter reading database. We utilize temporal and spatial outlier detection approaches to discover potential suspects based on the montly ratio of water-to-electric usage. Next, the daily likelihood of tampering is obtained via the analysis of the variation of daily consumption trend for a given month.

2019 Microsoft Scholarship - Predicting Major Depressive Disorder using genotyping data

In this project, we develope statistical models to classify patients’ disease status using over twenty thousand categorical biomarkers (SNPs). The data presents two major challenges: high dimensionality and large proportion of missing data. We design a novel analytics process including effective treatment for missing information and automatic variable selection. The final model has high predictive power and selects a sparse set of informative SNPs. Simulation study validates our proposed approach under a wide range of missing rates and signal-to-noise ratios.

How to wash white clothes? Segregation or No Segregation

The most common advice to care for white clothes is to wash them separately from colored ones. Given recent advance on laundry detergents and garment dyeing, is the longtime hangover still relevant? In this project, we employ a blocked \(2^3\) factorial experiment and two-sample t-test to study the effect of segregation on whiteness of the garments.

Topic Modeling with Glassdoor Company Reviews

Employment-oriented online platforms such as Glassdoor, LinkedIn, and Indeed provide anonymous employee reviews which are extremely helpful for job seekers. However, the number of online reviews is increasing exponentially and reading all available reviews takes significant amount of time. Hence, we employ topic modeling to explore useful topics to assist potential candidates in accessing hiring companies quickly.


Google’s employees are positive about working environment and job benefits	On the other hand, they express negative opinion about high living expenses and unclear career growth

Face recognition system

Face recognition system is utilized to detect identity of an individual by detecting their face from digital image or video frame and matching it against a database. It has numerous application in biometrics, information security, etc. In this project, we build a simplified system capable of accessing image of an arbitrary individual and determining whether they are the target person or not.

Touring Plan – 2017 Big Data Competition

The competition was held by Touring Plans Co. and the Department of Statistics, University of Central Florida with the goal of predicting wait times for 4 popular rides at Disney World (Splash Mountain, Kilimanjaro Safaris, Toy Story Mania, Soarin’). The real-life data span over a hundred of variables with more than half-million observations. By employing various feature engineering strategies and regularization techniques, we obtain final sparse models highlighting the impact of key features while achieving best predictive power.