Why are embedding and vectorisation still relevant for RAG, even with powerful LLMs like Mistral ?

Introduction

Large Language Models (LLMs) have revolutionised the field of natural language processing. However, to fully exploit their potential in real applications, it is essential to provide them with access to relevant contextual information. This is where RAG (Retrieval Augmented Generation) architectures come in, combining the capabilities of LLMs with information retrieval systems.

At the heart of these architectures, embedding and vectorisation play a crucial role in semantically representing texts and facilitating the search for relevant information.

Before delving into the heart of the matter, let’s briefly recall what a Retrieval Augmented Generation (RAG) system is.

It is a model that combines the capabilities of a large language model (LLM) with an information retrieval system. In concrete terms, when a question is asked, the RAG searches a database for the most relevant information before providing it to the LLM so that it can generate a complete and coherent response.

  • Improved accuracy of answers :
    • Fine-grained contextualisation : By pre-calculating high-quality embeddings for the documents in the corpus, we can ensure that the LLM retrieves the most relevant information to respond to a query.
    • Reducing hallucinations : By providing a solid context, we reduce the risk of the LLM generating factually incorrect information.
  • Optimising performance :
    • Faster search : Prior indexing of vectors enables semantic searches to be carried out very quickly, which is crucial for real-time applications.
  • Flexibility and control :
    • Customisable embeddings : By using specific embedding models, you can adapt the vector representation to the needs of a particular task.
    • Integration of external knowledge : embeddings can be enriched with additional information (e.g. named entities, semantic relationships) to improve understanding of the LLM.
  • Large and dynamic corpora: When the corpus is updated frequently, it may be more efficient to keep a vector index up to date rather than recalculating embeddings for each query.
  • Tasks requiring high precision: For critical applications (for example, in the medical or legal fields), it is important to guarantee the quality and reliability of responses.
  • Low-latency applications: If response times are a limiting factor, pre-processing can speed up the process considerably.

In conclusion,

Although LLMs such as Mistral are capable of generating embeddings in real time, there are many advantages to carrying out embedding and vectorisation work upstream as part of RAG. Choosing the best approach will depend on your specific constraints and objectives.