Evaluate anything you want | Creating advanced evaluators with LLMs

Author:Murphy | View: 29933 | Time: 2025-03-22 22:00:22

Image generated by DALLE-3 | Robot Inspections in the isometric style

Considering the rapid advancements in the field of LLM "chains", "agents", chatbots and other use cases of text-generative AI, evaluating the performance of language models is crucial for understanding their capabilities and limitations. Especially crucial to be able to adapt those metrics according to the business goals.

While standard metrics like perplexity, BLEU scores and Sentence distance provide a general indication of model performance, based on my experience, they often underperform in capturing the nuances and specific requirements of real-world applications.

For example, take a simple RAG QA application. When building a question-answering system, factors of the so-called "RAG Triad" like context relevance, groundedness in facts, and language consistency between the query and response are important as well. Standard metrics simply cannot capture these nuanced aspects effectively.

This is where LLM-based "Blackbox" metrics come in handy. While the idea can sound naive the concept behind LLM-based "blackbox" metrics is quite compelling. These metrics utilise the power of large language models themselves to evaluate the quality and other aspects of the generated text. By using a pre-trained language model as a "judge", we can assess the generated text according to the language model's understanding of the language and pre-defined criteria.

In this article, I will show the end-to-end example of constructing the prompt, running and tracking the evaluation.

Since LangChain is kinda of de-facto the most popular framework to build chatbots and RAG, i will build the application example on it. It will be easier to integrate into MVP and it has simple evaluation capabilities inside. However, you can use any other frameworks you want ot build you own. Main value of article – pipeline and prompts.

How to-do?

Let's dive into the code and explore the process of creating custom evaluators. We'll walk through a few key examples and discuss their implementations.

Example #1 | Translation Quality

Let's take the simple translation LLM chain like this

from langchain_openai import OpenAI

sys_msg = """You are a helpful assistant that translates English to French. 
Your task is to translate as a fluent speaker of both languages.
Translate this sentence from English to French: {sentence}
"""

translator_chain = LLMChain(llm=OpenAI(), prompt=template)

translator_chain.invoke({"sentence": "Hello, how are you?"})
>> {'sentence': 'Hello, how are you?', 'text': 'nBonjour, comment vas-tu ?'}

Ok, now we want to estimate it using "smarter" LLM, for example, GPT-4 or Claude Opus. First and foremost – we need a proper prompt.

PromptingCreating the proper evaluation prompt is 80% of success. Key things to remember:

Specify criterias
Define a numeric scoring scale with clear meanings
Request score justification for insight into reasoning
Provide query and context in prompt for reference
Require strict response format for easy parsing

Coding

First of all, we need to parse the Score and Reasoning out of the LLMs answer.

def _parse_string_eval_output(text: str) -> dict:
    score_pattern = r"SCORE: (d+)"
    reasoning_pattern = r"REASONING: (.*)"

    score_match = re.search(score_pattern, text)
    reasoning_match = re.search(reasoning_pattern, text)

    score = int(score_match.group(1)) if score_match else None
    reasoning = reasoning_match.group(1).strip() if reasoning_match else None

    return {"score": score, "reasoning": reasoning}pyp

Since we will use LLM for evaluation, we need to make a simple wrapper for it to parse the output scores.

class BaseEvalChain(LLMChain, StringEvaluator, LLMEvalChain):
    def _prepare_output(self, result: dict) -> dict:
        # parsing the output to extract the score and reasoning 
        parsed_result = _parse_string_eval_output(result[self.output_key])

        if RUN_KEY in result:
            parsed_result[RUN_KEY] = result[RUN_KEY]
        return parsed_result

And now we can instantiate and make our custom eval chain with a proper name of inputs and etc.

class LanguageConsistencyEvalChain(BaseEvalChain):
    @property
    def evaluation_name(self) -> str:
        return "Language Consistency"

    def _evaluate_strings(self, *, prediction: str, reference: Optional[str] = None, input: Optional[str] = None, callbacks: Callbacks = None, include_run_info: bool = False, **kwargs: Any) -> dict:
        result = self({"query": input, "result": prediction}, callbacks=callbacks, include_run_info=include_run_info)
        return self._prepare_output(result)

You can actually now use this evaluation chain as-is. However, you would probably want to add things like:

K-runs and score average to account the element's randomness
Integration with a framework to utilise most of the benefits (optional)

In this case, as in previous ones, we can instantiate from the default "RunEvaluator" and customise it, like LK-async runs and score averaging, input names and etc.

class BaseEvaluator(RunEvaluator):
    ...
    async def evaluate_run_async(self, run: Run, example: Optional[Example] = None) -> EvaluationResult:
        tasks = []
        for _ in range(self.k):
            task = asyncio.create_task(self._evaluate_run_async(run, example))
            tasks.append(task)

        evaluations = await asyncio.gather(*tasks)

        scores = [eval["score"] for eval in evaluations]
        reasonings = [eval["reasoning"] for eval in evaluations]

        avg_score = sum(scores) / len(scores)
        closest_reasoning_index = min(range(len(scores)), key=lambda i: abs(scores[i] - avg_score))
        closest_reasoning = reasonings[closest_reasoning_index]

        return EvaluationResult(
            key=self.evaluator.evaluation_name.lower().replace(" ", "_"),
            score=avg_score,
            comment=closest_reasoning,
        )

    def evaluate_run(self, run: Run, example: Optional[Example] = None) -> EvaluationResult:
            return asyncio.run(self.evaluate_run_async(run, example))

Let's run

Tags: Artificial Intelligence Generative Ai Tools Langchain Langsmith Llm

Add Fav

Comment

Murphy

Add friends

View space

Message

Recommend

◦ What is the difference between UNION and JOIN in SQL?

◦ A Visual Explanation of Variance, Covariance, Correlation and Causation

◦ Implementing Simple Neural Network Backpropagation from Scratch

◦ Is Less More? Do Deep Learning Forecasting Models Need Feature Reduction?

◦ The Method of Moments Estimator for Gaussian Mixture Models

◦ How to Look at Common Machine Learning Tasks with Fresh Eyes

◦ Automatic Labeling With GroundingDino

◦ The Power of Optimization in Designing Experiments Involving Small Samples

◦ Embeddings + Knowledge Graphs: The Ultimate Tools for RAG Systems

◦ Learning from Machine Learning | Vincent Warmerdam: Calmcode, Explosion, Data Science

◦ Linear Regression In Depth (Part 2)

◦ Tiny Time Mixers (TTM): A Powerful Zero-Shot Forecasting Model by IBM