Latency and Cost Analysis of Prompt Compression for LLM’s

Type of thesis: Masterarbeit / location: Dresden / Status of thesis: Open theses

Accelerating the text generation of Large language models [2,3,4] is a highly relevant topic for real time question answering applications. Though Inference via OpenAI API reaches already a low latency, the user is charged per input and output token. To save additional time and money spent for input token, several prompt compression techniques arised. One representative, LLMLingua[0][1] is using a smaller model to compress the prompt before sending it to the target model.

Although latency was measured within the literature, the analysis lacks completeness and an extensive investigation, under which circumstances prompt compression is really beneficial from a latency and cost perspective.

The following questions shall be answered in the thesis:

  1. When is LLMLingua beneficial from latency and costs perspective?
  2. Which latency have different compression setups?
  3. How many costs can be saved, using LLMLingua prompt compression?

Literature

[0] – Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., & Qiu, L. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.

[1] – Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., Rühle, V., Yang, Y., Lin, C.-Y., Zhao, H. V., Qiu, L., & Zhang, D. (2024). LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. http://arxiv.org/abs/2403.12968

[2] – Wan, Z., Wang, X., Liu, C., Alam, S., Zheng, Y., Liu, J., Qu, Z., Yan, S., Zhu, Y., Zhang, Q., Chowdhury, M., & Zhang, M. (2023). Efficient Large Language Models: A Survey.

[3] – Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.

[4] – Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners.

Counterpart

Lena Jurkschat

TU Dresden

GPT-X, Natural Language Processing

TU
Universität
Max
Leibnitz-Institut
Helmholtz
Hemholtz
Institut
Fraunhofer-Institut
Fraunhofer-Institut
Max-Planck-Institut
Institute
Max-Plank-Institut