Tune out
Membership inference attacks on fine tuned language models
In this project we aim to demonstrate membership inference attacks on fine tuned language models. The long term goal is to add this capability to the LeakPro platform.
Lots of work has gone into devising membership inference algorithms for models based on private data. A good example of this is research done on the security of models trained on hospital data. This has led to the conclusion that these models currently cannot be considered 100% private. Querying the models and conducting statistical analysis one can devise attacks to determine what data the model has been trained on. [1, 2, 3, 4]
In this project we explore how susceptible LLMs are to these types of attacks. The techniques for assessing language models are quite different from those used for classic machine learning models since the text and language is not as easily quantifiable and comparable. The size of large language models also presents new challenges since they have very limited introspection and require significant resources.
References
[1] Kaneko, M., Ma, Y., Wata, Y., & Okazaki, N. (2024, April 17). Sampling-based Pseudo-Likelihood for membership inference attacks. arXiv.org. https://arxiv.org/abs/2404.11262
[2] Shi, W., Ajith, A., Xia, M., Huang, Y., Liu, D., Blevins, T., Chen, D., & Zettlemoyer, L. (2023, October 25). Detecting Pretraining Data from Large Language Models. arXiv.org. https://arxiv.org/abs/2310.16789
[3] Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2016, October 18). Membership Inference Attacks against Machine Learning Models. arXiv.org. https://arxiv.org/abs/1610.05820
[4] Zhang, J., Sun, J., Yeats, E., Ouyang, Y., Kuo, M., Zhang, J., Yang, H. F., & Li, H. (2024, April 3). Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models. arXiv.org. https://arxiv.org/abs/2404.02936