Projects

Journal papers and conference proceedings

Regularized Simple Graph Convolution (SGC) for improved interpretability of large datasets

Attributed graph is a powerful tool to model many complex systems in our world. It contains two valuable sources of information, i.e the node feautures and the network strucutre, which can be analyzed efficiently by Graph Neural Networks (GNNs) to produce accurate inference for many downstream tasks such as node classification, link prediction, etc. The exponential growth of graph structured data drives the complexity of GNNs models causing concerns about processing time and interpretability of the results. This work extended a fast GNNs framework, namely the Simple Graph Convolution (SGC) by incorporating a flexible shrinkage scheme to produce a sparser model whose fitted parameters are capable of highlighting key node features determining the class characteristics.

sgc.png reg-sgc.png
Application of SGC on Cora citation dataset. The fitted weight vectors of each class is plotted as a heatmap The proposed regularization significantly reduces the amount of weights and hightlights important features defining class membership

Exploring the value of nodes with multicommunity membership for classification with graph convolutional neural networks

Sampling process plays an important role in machine learning pipeline as it provides quality training samples to build the model. However, the process of determining the best sampling method has been rarely studied in the context of graph neural networks. In this work, we evaluate the effect of multiple sampling strategies on node classification task using the SGC. Our discoveries include an indicative measure of sampling efficiency and a heuristic criterion based on network topology which can be utilized to suggest best sampling strategy for practitioners.

Dataset Predicted optimal strategy Actual optimal strategy
Cora Descending Descending
Citeseer Descending Descending
Pubmed Ascending Ascending
Amazon-pc Descending Ascending
Amazon-photo Ascending Ascending
Coauthor-cs Descending Descending
Coauthor-physics Ascending Descending
Lastfm_Asia Ascending Ascending
Deezer_Europe Descending Descending

Upon the findings of previous work on the importance of network topology in the selection of training samples, we further study the proposed heuristic measure. We design a comprehensive set of synthetic datasets covering a wide range of network structures and input features. Utilize both synthetic and real life datasets, we then derive effective sampling approaches based on the proposed measure to facilitate the task of predicting node label.

exp-link.png
Simulation study suggests a threshold upon which optimal sampling strategy can be determined

Exploring the Regularized SGC in application to social network data

Online social networks have a significant effect on human society and have become an important research topic for maintaining the integrity of the common social understanding. In this work, we study the application of Simple Graph Convolution (SGC) and our proposed extension (regularized SGC) on a social network dataset as well as a comprehensive set of synthetic attributed graphs with varying network topologies. Our proposed framework has high predictive capability and also produces sparser model facilitating practitioner in investigation of important features relevant to class assignment.

regsgc_graphtopo.png regsgc_output.png
Simulation study covers a wide range of network topologies Regularized framework has high predictive power and is capable of enhancing users understanding of features contribution to class assignment

Link prediction is an important task on attributed graph with a wide range of useful applications.In work, we adapt the flexible regularization mechanism of regularized SGC for link prediction with the goal of improving the interpretability of the fitted model.

link_pred_diagram.png

Personal projects

OUC Meter Data Science Competition

The goal of this competition is to identify complex electric theft/tampering pattern in Orlando Utilities Commission(OUC) meter reading database. We utilize temporal and spatial outlier detection approaches to discover potential suspects based on the montly ratio of water-to-electric usage. Next, the daily likelihood of tampering is obtained via the analysis of the variation of daily consumption trend for a given month.

ouc_dashboard.png

2019 Microsoft Scholarship - Predicting Major Depressive Disorder using genotyping data

In this project, we develope statistical models to classify patients’ disease status using over twenty thousand categorical biomarkers (SNPs). The data presents two major challenges: high dimensionality and large proportion of missing data. We design a novel analytics process including effective treatment for missing information and automatic variable selection. The final model has high predictive power and selects a sparse set of informative SNPs. Simulation study validates our proposed approach under a wide range of missing rates and signal-to-noise ratios.

snp_result.png

How to wash white clothes? Segregation or No Segregation

The most common advice to care for white clothes is to wash them separately from colored ones. Given recent advance on laundry detergents and garment dyeing, is the longtime hangover still relevant? In this project, we employ a blocked \(2^3\) factorial experiment and two-sample t-test to study the effect of segregation on whiteness of the garments.

doe_anova.png

Topic Modeling with Glassdoor Company Reviews

Employment-oriented online platforms such as Glassdoor, LinkedIn, and Indeed provide anonymous employee reviews which are extremely helpful for job seekers. However, the number of online reviews is increasing exponentially and reading all available reviews takes significant amount of time. Hence, we employ topic modeling to explore useful topics to assist potential candidates in accessing hiring companies quickly.

google_pros.png google_cons.png
Google’s employees are positive about working environment and job benefits On the other hand, they express negative opinion about high living expenses and unclear career growth

Face recognition system

Face recognition system is utilized to detect identity of an individual by detecting their face from digital image or video frame and matching it against a database. It has numerous application in biometrics, information security, etc. In this project, we build a simplified system capable of accessing image of an arbitrary individual and determining whether they are the target person or not.

face_test.png

Touring Plan – 2017 Big Data Competition

The competition was held by Touring Plans Co. and the Department of Statistics, University of Central Florida with the goal of predicting wait times for 4 popular rides at Disney World (Splash Mountain, Kilimanjaro Safaris, Toy Story Mania, Soarin’). The real-life data span over a hundred of variables with more than half-million observations. By employing various feature engineering strategies and regularization techniques, we obtain final sparse models highlighting the impact of key features while achieving best predictive power.