Status: at work / Type of Theses: Master theses / Location: Dresden
Accelerating the text generation of Large language models [2,3,4] is a highly relevant topic for real time question answering applications. Though Inference via OpenAI API reaches already a low latency, the user is charged per input and output token. To save additional time and money spent for input token, several prompt compression techniques arised. One representative, LLMLingua[0][1] is using a smaller model to compress the prompt before sending it to the target model.
Although latency was measured within the literature, the analysis lacks completeness and an extensive investigation, under which circumstances prompt compression is really beneficial from a latency and cost perspective.
The following questions shall be answered in the thesis:
[0] – Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.
[1] – Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., Rühle, V., Yang, Y., Lin, C.-Y., Zhao, H. V., Qiu, L., & Zhang, D. (2024). LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. http://arxiv.org/abs/2403.12968
[2] – Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., & Zhang, M. (2023). Efficient Large Language Models: A Survey.
[3] – Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.
[4] – Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners.