Offered By: IBMSkillsNetwork
Use DeepEval and Traditional Metrics to assess RAG responses
Explore Large Language Model (LLM) evaluation techniques in this hands-on project that compares LLaMA and Granite for Retrieval-Augmented Generation (RAG) and textual analysis. Leverage HuggingFace's Evaluate library for computing traditional metrics and DeepEval, a modern, LLM-based framework for evaluating complex metrics. Through step-by-step guidance, you’ll set up RAG and metric evaluation pipelines, interpret the results, and discover how modular metrics adapt to any LLM use case. Enroll now to gain essential data science expertise and confidently deploy robust RAG applications.
Continue readingGuided Project
Artificial Intelligence
At a Glance
Explore Large Language Model (LLM) evaluation techniques in this hands-on project that compares LLaMA and Granite for Retrieval-Augmented Generation (RAG) and textual analysis. Leverage HuggingFace's Evaluate library for computing traditional metrics and DeepEval, a modern, LLM-based framework for evaluating complex metrics. Through step-by-step guidance, you’ll set up RAG and metric evaluation pipelines, interpret the results, and discover how modular metrics adapt to any LLM use case. Enroll now to gain essential data science expertise and confidently deploy robust RAG applications.
A Look at the Project Ahead
- Set Up a RAG Pipeline: Integrate LLaMA and Granite with vector stores to retrieve relevant context for narrative QA.
- Compute and Compare Metrics: Apply ROUGE and BERTScore to quantify model and retrieval quality, then interpret results.
- Implement Evaluation Workflows: Use DeepEval to orchestrate human-like judgments alongside automatic metrics.
- Explore Modularity: See how easily you can swap in new models, datasets, or metrics for future experiments.
- Visualize and Interpret Results: Plot computed scores in comprehensive graphs to compare model performance on different metrics.
- Design and deploy a retrieval-augmented generation pipeline using popular open-source LLMs.
- Build a flexible evaluation framework that combines automatic scoring with LLM-driven judgment, and analyze metric outputs to guide model selection.
What You'll Need
- Basic Python proficiency: Comfortable with common data structures and writing simple scripts.
- Modern web browser: Latest version of Chrome, Edge, Firefox, or Safari for the optimal notebook experience.
- (Optional) Library knowledge: Minimal knowledge of Pandas DataFrame data structure and Matplotlib visualization.
Estimated Effort
20 Mins
Level
Beginner
Skills You Will Learn
BERTScore, DeepEval, Generative AI, LLM Evaluation, RAG, ROUGE
Language
English
Course Code
GPXX05H7EN