At a Glance

Name: Use DeepEval and Traditional Metrics to assess RAG responses
Price: Free CAD
Rating: 5.0 (2 reviews)
Author: joshua_zhou, joseph_santarcangelo, wojciech_fulmyk

Explore Large Language Model (LLM) evaluation techniques in this hands-on project that compares LLaMA and Granite for Retrieval-Augmented Generation (RAG) and textual analysis. Leverage HuggingFace's Evaluate library for computing traditional metrics and DeepEval, a modern, LLM-based framework for evaluating complex metrics. Through step-by-step guidance, you’ll set up RAG and metric evaluation pipelines, interpret the results, and discover how modular metrics adapt to any LLM use case. Enroll now to gain essential data science expertise and confidently deploy robust RAG applications.

Evaluating large language models (LLMs) is crucial for ensuring they deliver accurate, reliable, and contextually appropriate outputs. By comparing models (LLaMA and Granite) in a Retrieval-Augmented Generation (RAG) setup, you’ll gain hands-on experience with the end-to-end evaluation process. You’ll also learn to leverage DeepEval, an LLM based framework, and traditional metrics (ROUGE, BERTScore) via Hugging Face’s Evaluate library. These skills will empower you to rigorously assess any LLM pipeline, identify strengths and weaknesses, and iterate toward better model and overall application performance.

A Look at the Project Ahead

In this guided project, you will:

Set Up a RAG Pipeline: Integrate LLaMA and Granite with vector stores to retrieve relevant context for narrative QA.
Compute and Compare Metrics: Apply ROUGE and BERTScore to quantify model and retrieval quality, then interpret results.
Implement Evaluation Workflows: Use DeepEval to orchestrate human-like judgments alongside automatic metrics.
Explore Modularity: See how easily you can swap in new models, datasets, or metrics for future experiments.
Visualize and Interpret Results: Plot computed scores in comprehensive graphs to compare model performance on different metrics.

By the end of this project, you will be able to:

Design and deploy a retrieval-augmented generation pipeline using popular open-source LLMs.
Build a flexible evaluation framework that combines automatic scoring with LLM-driven judgment, and analyze metric outputs to guide model selection.

What You'll Need

Basic Python proficiency: Comfortable with common data structures and writing simple scripts.
Modern web browser: Latest version of Chrome, Edge, Firefox, or Safari for the optimal notebook experience.
(Optional) Library knowledge: Minimal knowledge of Pandas DataFrame data structure and Matplotlib visualization.

Offered By: IBMSkillsNetwork

Use DeepEval and Traditional Metrics to assess RAG responses

At a Glance

A Look at the Project Ahead

What You'll Need