RAG

Overview

Blog

Retrieval Augmented Generation

A soft summary into Retrieval Augmented Generation

Kaushik Tiwari

Founder @SNR.Audio

April 22, 2024

Introduction and Pretext.

Instruction Tune Large Language Models have become pervasive in various industries, necessitating companies to adapt their business practices accordingly. This technology is truly remarkable and has a thriving research community dedicated to its advancement. Currently, the community is focused on addressing three main challenges:

Lack of awareness of domain-specific information → The models are not equipped with the knowledge of recent events or specific information that has emerged after the model was trained.
Limited reasoning capability → The ability of the large language model to reason is quite constrained.
Hallucinations → The dis(ability) of the large language model to generate answers which go beyond their knowledge cutoffs , they just make up stuff to generate information.

To tackle these issues, the concept of Retrieval Augmented Generation was introduced. This approach aims to address both outdated knowledge and hallucinations, the term was coined by lewis et al in the paper **Retrieval Augmented Generation for knowledege Intensive NLP Tasks,** to summarise LLMS are good at answering surface level information but fail to keep up when the user wants deep-dive into stuff

Is RAG and Finetuning the Same ?

According to the original authors of the paper, they describe RAG as a "general purpose fine-tuning technique" or a recipe. This technique can be employed by almost any language model to establish connections with various external resources. By incorporating user-provided information in its responses, language models can assist in building trust with enterprises that possess crucial information and processes they wish to automate and digitize.

Stages of Retrieval Augmented Generation System

‍

There are 5 key stages of RAG Systems

Loading → This stage involves consuming data from the available sources, whether it's structured, semi-structured, or unstructured data, and passing it into the pipeline.
Indexing → This is one of the fundamental processes in RAG. It involves creating data structures that enable querying the data for LLMS. Initially, we need to convert the data into vector embeddings, which can be stored in special databases called Vector DBs, as our LLMS cannot consume the data in its current form.
Storing → This stage starts working once the data has been indexed. The user will always need to store an index and its corresponding metadata to avoid reindexing it.
Querying → This stage occurs after indexing, where the user queries the indexed vector that is closest to their query.
Evaluation → The most important step in any pipeline is checking the accuracy, reliability, and speed of our responses to the queries in a RAG Pipeline.

How To Evaluate A RAG system

Evaluating Language Models is one task, but evaluating an entire system is a significant undertaking. We aim to break it down to identify where the issues lie in the system.

Evaluate the Retriever → In a RAG system, the retriever is responsible for fetching the most relevant query in the system and retrieving the best possible information. The two main metrics for evaluating retrievers are Context Relevancy and Context Recall.
Context Relevancy → This metric assesses how relevant the retrieved information's context is to the query's context. It is measured on a scale of 0 to 1, with higher scores indicating better performance in the RAG Pipeline retriever.
Context Recall → Context Recall measures how closely the retrieved context aligns with the ground truth. It is also measured on a scale of 0 to 1, with higher scores indicating better performance. Each sentence in the ground truth is analyzed to determine whether it can be attributed to the retrieved context or not.
Evaluating Answer Generation → Answer generation generates the answer, with the retrieved information. Some metrics to evaluate generation are faithfulness, answer relevancy, and answer similarity.
Faithfulness → Faithfulness measures the factual consistency of the generated answer with the given context. It is calculated by comparing the answer and retrieved context. Faithfulness is given a score between 0 and 1, with a higher score indicating better consistency. The generated answer is considered faithful if it aligns completely with the context and does not contain any false information. To calculate faithfulness, the claims made in the generated answer are identified and cross-checked with the provided context.
Answer Relevancy → Answer Relevancy focuses on evaluating how relevant the generated answer is to the given prompt. if the answers are incomplete or contain redundant information lower score is given. It is calculated based on the generated answer and question. Answer Relevancy is returned between the range of 0–1, the higher the better. An answer is relevant when it directly addresses the original question. To calculate this score, generated answer is given and LLM is asked to create question for it multiple times. Cosine Similarity is measured between generated and original questions.
Answer Similarity → Answer Similarity or Answer Semantic Similarity checks how semantically similar are answer and ground truth to each other. Answer Similarity utilises a cross-encoder model to calculate the semantic similarity score. Evaluation is based on the ground truth and the answer. Result is returned between the range of 0–1, the higher the better. Answer Similarity gives valuable insights into the quality of the generated response.