Projects
Journal papers and conference proceedings
Regularized Simple Graph Convolution (SGC) for improved interpretability of large datasets
Attributed graph is a powerful tool to model many complex systems in our world. It contains two valuable sources of information, i.e the node feautures and the network strucutre, which can be analyzed efficiently by Graph Neural Networks (GNNs) to produce accurate inference for many downstream tasks such as node classification, link prediction, etc. The exponential growth of graph structured data drives the complexity of GNNs models causing concerns about processing time and interpretability of the results. This work extended a fast GNNs framework, namely the Simple Graph Convolution (SGC) by incorporating a flexible shrinkage scheme to produce a sparser model whose fitted parameters are capable of highlighting key node features determining the class characteristics.
Application of SGC on Cora citation dataset. The fitted weight vectors of each class is plotted as a heatmap | The proposed regularization significantly reduces the amount of weights and hightlights important features defining class membership |
Exploring the value of nodes with multicommunity membership for classification with graph convolutional neural networks
Sampling process plays an important role in machine learning pipeline as it provides quality training samples to build the model. However, the process of determining the best sampling method has been rarely studied in the context of graph neural networks. In this work, we evaluate the effect of multiple sampling strategies on node classification task using the SGC. Our discoveries include an indicative measure of sampling efficiency and a heuristic criterion based on network topology which can be utilized to suggest best sampling strategy for practitioners.
Dataset | Predicted optimal strategy | Actual optimal strategy |
---|---|---|
Cora | Descending | Descending |
Citeseer | Descending | Descending |
Pubmed | Ascending | Ascending |
Amazon-pc | Descending | Ascending |
Amazon-photo | Ascending | Ascending |
Coauthor-cs | Descending | Descending |
Coauthor-physics | Ascending | Descending |
Lastfm_Asia | Ascending | Ascending |
Deezer_Europe | Descending | Descending |
Exploring a link between network topology and active learning
Upon the findings of previous work on the importance of network topology in the selection of training samples, we further study the proposed heuristic measure. We design a comprehensive set of synthetic datasets covering a wide range of network structures and input features. Utilize both synthetic and real life datasets, we then derive effective sampling approaches based on the proposed measure to facilitate the task of predicting node label.
Simulation study suggests a threshold upon which optimal sampling strategy can be determined |
Exploring the Regularized SGC in application to social network data
Online social networks have a significant effect on human society and have become an important research topic for maintaining the integrity of the common social understanding. In this work, we study the application of Simple Graph Convolution (SGC) and our proposed extension (regularized SGC) on a social network dataset as well as a comprehensive set of synthetic attributed graphs with varying network topologies. Our proposed framework has high predictive capability and also produces sparser model facilitating practitioner in investigation of important features relevant to class assignment.
Simulation study covers a wide range of network topologies | Regularized framework has high predictive power and is capable of enhancing users understanding of features contribution to class assignment |
Link prediction with Simple Graph Convolution and regularized Simple Graph Convolution
Link prediction is an important task on attributed graph with a wide range of useful applications.In work, we adapt the flexible regularization mechanism of regularized SGC for link prediction with the goal of improving the interpretability of the fitted model.
Personal projects
OUC Meter Data Science Competition
The goal of this competition is to identify complex electric theft/tampering pattern in Orlando Utilities Commission(OUC) meter reading database. We utilize temporal and spatial outlier detection approaches to discover potential suspects based on the montly ratio of water-to-electric usage. Next, the daily likelihood of tampering is obtained via the analysis of the variation of daily consumption trend for a given month.
2019 Microsoft Scholarship - Predicting Major Depressive Disorder using genotyping data
In this project, we develope statistical models to classify patients’ disease status using over twenty thousand categorical biomarkers (SNPs). The data presents two major challenges: high dimensionality and large proportion of missing data. We design a novel analytics process including effective treatment for missing information and automatic variable selection. The final model has high predictive power and selects a sparse set of informative SNPs. Simulation study validates our proposed approach under a wide range of missing rates and signal-to-noise ratios.
How to wash white clothes? Segregation or No Segregation
The most common advice to care for white clothes is to wash them separately from colored ones. Given recent advance on laundry detergents and garment dyeing, is the longtime hangover still relevant? In this project, we employ a blocked \(2^3\) factorial experiment and two-sample t-test to study the effect of segregation on whiteness of the garments.
Topic Modeling with Glassdoor Company Reviews
Employment-oriented online platforms such as Glassdoor, LinkedIn, and Indeed provide anonymous employee reviews which are extremely helpful for job seekers. However, the number of online reviews is increasing exponentially and reading all available reviews takes significant amount of time. Hence, we employ topic modeling to explore useful topics to assist potential candidates in accessing hiring companies quickly.
Google’s employees are positive about working environment and job benefits | On the other hand, they express negative opinion about high living expenses and unclear career growth |
Face recognition system
Face recognition system is utilized to detect identity of an individual by detecting their face from digital image or video frame and matching it against a database. It has numerous application in biometrics, information security, etc. In this project, we build a simplified system capable of accessing image of an arbitrary individual and determining whether they are the target person or not.
Touring Plan – 2017 Big Data Competition
The competition was held by Touring Plans Co. and the Department of Statistics, University of Central Florida with the goal of predicting wait times for 4 popular rides at Disney World (Splash Mountain, Kilimanjaro Safaris, Toy Story Mania, Soarin’). The real-life data span over a hundred of variables with more than half-million observations. By employing various feature engineering strategies and regularization techniques, we obtain final sparse models highlighting the impact of key features while achieving best predictive power.