LLMS Evaluation and Metrics

A DeepDive into LLMs and their Evaluation and top 5 LLM evaluation frameworks
Kaushik Tiwari
Founder @SNR.Audio
April 16, 2024

The Large Language Model

After gaining a deep understanding of the complexities of the ASR system that interacts with the user and accurately interprets their intentions, the baton is passed to the second player, which acts as the central processing unit. A large language model is basically a text generation model that creates textual content. These models are now the key technologies of advanced generative artificial intelligence and have gained widespread recognition due to innovative architectures such as Llama2, Gemini Ultra, Claude, Zephyr and the GPT series of Models. These models undergo extensive training on vast amounts of internet data for months, enabling them to understand text, audio, and even generate images with impressive proficiency. They can even perform reasoning-based tasks like solving simple equations. These models are unique and have many complexities. Companies often need to adjust their entire business logic to fully utilize these systems.

Before we explore how we make use of this technology, let's answer an important question: what makes a language model "chattable"? Is it some internal optimization, the data, the training process, or something completely different? The answer encompasses a bit of everything. The underlying algorithms in a large language model like RLHF and DPO give it the ability to analyze for the best response. Modifying the data in the form of questions and answers gives it the ability to reply with answers. This process is called instruction tuning. These are the basic components that make up large language models, on which companies like Microsoft, OpenAI, and Google have developed advanced features like function calling, structured data extraction, and processing of unstructured data from documents.

Lets Evaluate the Brains of the Operation

There are various commonly used metrics for evaluating LLMs, which include Perplexity, Accuracy, F1-score, ROUGE score, BLEU score, METEOR score, question answering metrics, sentiment analysis metrics, named entity recognition metrics, and contextualized word embeddings. These metrics assist in assessing the performance of LLMs by measuring different aspects of the generated text, such as fluency, coherence, accuracy, and relevance.

A thorough examination of the following factors would easily help you assess the quality of a large language model:

  • Reliability → LLMs must produce accurate and truthful outputs, avoiding misinformation and ensuring factual accuracy.
  • Safety → LLMs should avoid generating outputs that could be harmful, illegal, or sensitive, prioritizing user privacy and mental health awareness.
  • Fairness → The model must remain neutral, avoiding biases and ensuring equal performance across different inputs.
  • Resistance to Misuse → LLMs should resist malicious use, such as promoting propaganda or infringing on copyrighted materials.
  • Explainability and Reasoning → LLMs need to be able to analyze and reason logically, providing step-by-step explanations.
  • Robustness → LLMs must be able to withstand adversarial tactics, including prompt injection, base prompt overriding, and data poisoning.
  • Performance and Efficiency → LLMs should deliver results quickly and efficiently, adapting to new data as needed.
  • Exactness and Linguistic Accuracy → LLMs should demonstrate linguistic accuracy and provide precise responses to diverse prompts.
  • Comprehensible Intelligence → LLMs should exhibit comprehensive intelligence, showcasing versatility across various tasks and domains.

Methods for evaluating LLMs( Textual Quality and Subjective Quality)

Perplexity-> perplexity is the measure of how well the model responses predicted , distribution aligns with the actual distribution of the words ,  thus in a larger context it basically is the metric which measures how uncertain , the response is in relation to your query for eg , if you ask a question about refunds and purchases to a LLM packed , customer support system , the response in general should try to answer questions related to pinpoint your exact query,not about what can we do , for you we have the following deals going on , that response distorts from your query and will have a higher perplexity and the system which will answer your exact query will have lower uncertainty of the response and will have a lower perplexity , in short the thumb rule is  

higher the perplexity the more unfit the model is

Perplexity Formula where b = 2 and p(w) is the probability of every word

BLEU → Bilingual Evaluation Understudy is a metric , used to evaluate the quality of machine translated texts against one or more reference ,translations , it measures , the similarity between the machine generated translation and the ground truth based on the n-grams (contiguous sequences of n words) present in both , BLUE score ranges from 0 to 1 with a higher score indicating it is a better response / translation .

ROUGE → Recall-Oriented Understudy is a widely used evaluation metric for assessing the quality of automatic summarises , generated , by the text summaries generated by the text summarisation system , it measures , the similarity b/w the generated summary and one or more of the ground truth summaries , rouge score focuses on recall but calculates both precision score and recall score , by comparing the n-gram units

METEOR → METEOR takes a more in-depth and broad sized approach , it starts by evaluating translations , by considering accuracy ,synonymy , stemming , and word order.This metric paints a holistic picture , of the models translations by considering accuracy , synonymy , stemming and  then emphasising meaning preservation , a METEOR score highlights our LLMs to generate meaningful things that make complete sense in general a semantically sound response , it aggregates exact matches , stemmed matches ,and paraphrase matching , and the overall score is the harmonic mean of these factors, While the correlation between METEOR and human judgments was measured for Chinese and Arabic and found to be significant, further experimentation is needed to check its correlation for other languages.Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy – this has been proposed in more recent publications on the subject.

ZERO-Shot Evaluation

When assessing the performance of language models, traditional evaluation metrics such as perplexity or accuracy on specific datasets might only partially capture their capabilities or generalisation power. This is where zero-shot evaluation metrics come into play. Zero-shot learning refers to the ability of a model to understand and perform tasks it has never seen during its training phase. In the context of large language models, zero-shot evaluation means assessing the model's capability to handle prompts or questions not explicitly represented in the training data. This metric could help in the following evaluation keys:

  • Reliability: LLM users can interact with models unexpectedly during training , but zero-shot evaluation ensures , reliable responses to unforeseen inputs , creating a trustworthy , experiences for end-users
  • Intelligence & Capability: Zero-shot metrics are essential for evaluating , how well a model can apply its training to new tasks , especially for transfer learning models , in addition , this evaluation is unbiased because it does not rely , on the model ,being fine tuned ,on a specific dataset , which can result ,in overfitting , Instead , it showcases, the model’s inherent ,understanding and capability to solve diverse , problems without tailored training
  • Safety: Zero-shot evaluations can reveal biases , in a models responses, which may reflect biases in training data , Identifying these biases , is crucial for improvising models and recognising potential safety concerns.

Top 5 LLM Evaluation Frameworks

To conclude stellar Quality assurance and battle tested evaluation process for the notorious of problems LLMs face is the crux of a naturally responsive voice ai solution

Share

Explore related blogs

Blog

LLAMA3

A look into key advancements Details and Performance numbers of Meta's Flagship Model
Kaushik Tiwari
Blog

Retrieval Augmented Generation

A soft summary into Retrieval Augmented Generation
Kaushik Tiwari
Blog

CLIP

An Overview on CLIP a model which connected Image and Text as modalities by Aligning text to the image
Kaushik Tiwari