PUBLICATIONS
2024
- FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning ParticipationYuechun Gu , Jiajie He , and Keke Chen2024
Training data privacy has been a top concern in AI modeling. While methods like differentiated private learning allow data contributors to quantify acceptable privacy loss, model utility is often significantly damaged. In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. In controlled data access, authorized model builders work in a restricted environment to access sensitive data, which can fully preserve data utility with reduced risk of data leak. However, unlike differential privacy, there is no quantitative measure for individual data contributors to tell their privacy risk before participating in a machine learning task. We developed the demo prototype FT-PrivacyScore to show that it’s possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task. The demo source code will be available at \urlthis https URL.
- Calibrating Practical Privacy Risks for Differentially Private Machine LearningYuechun Gu , and Keke Chen2024
Differential privacy quantifies privacy through the privacy budget ϵ, yet its practical interpretation is complicated by variations across models and datasets. Recent research on differentially private machine learning and membership inference has highlighted that with the same theoretical ϵ setting, the likelihood-ratio-based membership inference (LiRA) attacking success rate (ASR) may vary according to specific datasets and models, which might be a better indicator for evaluating real-world privacy risks. Inspired by this practical privacy measure, we study the approaches that can lower the attacking success rate to allow for more flexible privacy budget settings in model training. We find that by selectively suppressing privacy-sensitive features, we can achieve lower ASR values without compromising application-specific data utility. We use the SHAP and LIME model explainer to evaluate feature sensitivities and develop feature-masking strategies. Our findings demonstrate that the LiRA ASRM on model M can properly indicate the inherent privacy risk of a dataset for modeling, and it’s possible to modify datasets to enable the use of larger theoretical ϵ settings to achieve equivalent practical privacy protection. We have conducted extensive experiments to show the inherent link between ASR and the dataset’s privacy risk. By carefully selecting features to mask, we can preserve more data utility with equivalent practical privacy protection and relaxed ϵ settings. The implementation details are shared online at the provided GitHub URL \urlthis https URL.
2023
- DisguisedNets: Secure Image Outsourcing for Confidential Model Training in CloudsYuechun Gu , Keke Chen , and Sagar SharmaACM Trans. Internet Technol., Aug 2023
Large training data and expensive model tweaking are standard features of deep learning with images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which also raises privacy concerns. Existing cryptographic solutions for training deep neural networks (DNNs) are too expensive, cannot effectively utilize cloud GPU resources, and also put a significant burden on client-side pre-processing. This article presents an image disguising approach: DisguisedNets, which allows users to securely outsource images to the cloud and enables confidential, efficient GPU-based model training. DisguisedNets uses a novel combination of image blocktization, block-level random permutation, and block-level secure transformations: random multidimensional projection (RMT) or AES pixel-level encryption (AES) to transform training data. Users can use existing DNN training methods and GPU resources without any modification to training models with disguised images. We have analyzed and evaluated the methods under a multi-level threat model and compared them with another similar method—InstaHide. We also show that the image disguising approach, including both DisguisedNets and InstaHide, can effectively protect models from model-targeted attacks.
- Demo: Image Disguising for Scalable GPU-accelerated Confidential Deep LearningYuechun Gu , Sagar Sharma , and Keke ChenIn Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security , Aug 2023
Deep learning training involves large training data and expensive model tweaking, for which cloud GPU resources can be a popular option. However, outsourcing data often raises privacy concerns. The challenge is to preserve data and model confidentiality without sacrificing GPU-based scalable training and low-cost client-side preprocessing, which is difficult for conventional cryptographic solutions to achieve. This demonstration shows a new approach, image disguising, represented by recent work: DisguisedNets, NeuraCrypt, and InstaHide, which aim to securely transform training images while still enabling the desired scalability and efficiency. We present an interactive system for visually and comparatively exploring these methods. Users can view disguised images, note low client-side processing costs, and observe the maintained efficiency and model quality during server-side GPU-accelerated training. This demo aids researchers and practitioners in swiftly grasping the advantages and limitations of image-disguising methods.
- GAN-based domain inference attackYuechun Gu , and Keke ChenIn Proceedings of the AAAI Conference on Artificial Intelligence , Aug 2023
Model-based attacks can infer training data information from deep neural network models. These attacks heavily depend on the attacker’s knowledge of the application domain, eg, using it to determine the auxiliary data for model-inversion attacks. However, attackers may not know what the model is used for in practice. We propose a generative adversarial network (GAN) based method to explore likely or similar domains of a target model–the model domain inference (MDI) attack. For a given target (classification) model, we assume that the attacker knows nothing but the input and output formats and can use the model to derive the prediction for any input in the desired form. Our basic idea is to use the target model to affect a GAN training process for a candidate domain’s dataset that is easy to obtain. We find that the target model may distort the training procedure less if the domain is more similar to the target domain. We then measure the distortion level with the distance between GAN-generated datasets, which can be used to rank candidate domains for the target model. Our experiments show that the auxiliary dataset from an MDI top-ranked domain can effectively boost the result of model-inversion attacks.
- Adaptive Domain Inference AttackYuechun Gu , and Keke ChenAug 2023
As deep neural networks are increasingly deployed in sensitive application domains, such as healthcare and security, it’s necessary to understand what kind of sensitive information can be inferred from these models. Existing model-targeted attacks all assume the attacker has known the application domain or training data distribution, which plays an essential role in successful attacks. Can removing the domain information from model APIs protect models from these attacks? This paper studies this critical problem. Unfortunately, even with minimal knowledge, i.e., accessing the model as an unnamed function without leaking the meaning of input and output, the proposed adaptive domain inference attack (ADI) can still successfully estimate relevant subsets of training data. We show that the extracted relevant data can significantly improve, for instance, the performance of model-inversion attacks. Specifically, the ADI method utilizes a concept hierarchy built on top of a large collection of available public and private datasets and a novel algorithm to adaptively tune the likelihood of leaf concepts showing up in the unseen training data. The ADI attack not only extracts partial training data at the concept level, but also converges fast and requires much fewer target-model accesses than another domain inference attack, GDI.
2022
- A Comparative Study of Image Disguising Methods for Confidential Outsourced LearningAug 2022
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel image disguising mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
2020
- Price Forecast with High-Frequency Finance Data: An Autoregressive Recurrent Neural Network Model with Technical IndicatorsYuechun Gu , Da Yan , Sibo Yan , and 1 more authorIn Proceedings of the 29th ACM International Conference on Information & Knowledge Management , Aug 2020
The availability of high-frequency trade data has made it possible for the intraday forecast of price patterns. With the help of technical indicators, recent studies have shown that LSTM based deep learning models are able to predict price directions (a binary classification problem) with performance better than a random guess. However, only naive recurrent networks were adopted, and these works did not compare with the tools used by finance practitioners. Our experiments show that GARCH beats their LSTM models by a large margin.We propose to adopt an autoregressive recurrent network instead so that the loss of the prediction at every time step contributes to the model training; we also treat a rich set of technical indicators at each time step as covariates to enhance the model input. Finally, we treat the problem of price pattern forecast as a regression problem on the price itself; even for price direction prediction, we show that our performance is much better than if we model the problem as binary classification. We show that only when all these designs are adopted, an LSTM model can beat GARCH (and by a large margin).This work corrects the poor use of LSTM networks in recent studies, and provides "the" baseline that is able to fully unleash the power of LSTM for future work to compare with. Moreover, since our model is a price regressor with very good prediction performance, it can serve as a valuable tool for designing trading strategies (including day trading). Our model has been used by quantitative analysts in Freddie Mac for over one quarter, and is found to be more effective than traditional GARCH variants in market prediction.