Evaluate anything you want | Creating advanced evaluators with LLMs

Considering the rapid advancements in the field of LLM "chains", "agents", chatbots and other use cases of text-generative AI, evaluating the performance of language models is crucial for understanding their capabilities and limitations. Especially crucial to be able to adapt those metrics according to the business goals.
While standard metrics like perplexity, BLEU scores and Sentence distance provide a general indication of model performance, based on my experience, they often underperform in capturing the nuances and specific requirements of real-world applications.
For example, take a simple RAG QA application. When building a question-answering system, factors of the so-called "RAG Triad" like context relevance, groundedness in facts, and language consistency between the query and response are important as well. Standard metrics simply cannot capture these nuanced aspects effectively.
This is where LLM-based "Blackbox" metrics come in handy. While the idea can sound naive the concept behind LLM-based "blackbox" metrics is quite compelling. These metrics utilise the power of large language models themselves to evaluate the quality and other aspects of the generated text. By using a pre-trained language model as a "judge", we can assess the generated text according to the language model's understanding of the language and pre-defined criteria.
In this article, I will show the end-to-end example of constructing the prompt, running and tracking the evaluation.
Since LangChain is kinda of de-facto the most popular framework to build chatbots and RAG, i will build the application example on it. It will be easier to integrate into MVP and it has simple evaluation capabilities inside. However, you can use any other frameworks you want ot build you own. Main value of article – pipeline and prompts.
How to-do?
Let's dive into the code and explore the process of creating custom evaluators. We'll walk through a few key examples and discuss their implementations.
Example #1 | Translation Quality
Let's take the simple translation LLM chain like this
from langchain_openai import OpenAI
sys_msg = """You are a helpful assistant that translates English to French.
Your task is to translate as a fluent speaker of both languages.
Translate this sentence from English to French: {sentence}
"""
translator_chain = LLMChain(llm=OpenAI(), prompt=template)
translator_chain.invoke({"sentence": "Hello, how are you?"})
>> {'sentence': 'Hello, how are you?', 'text': 'nBonjour, comment vas-tu ?'}
Ok, now we want to estimate it using "smarter" LLM, for example, GPT-4 or Claude Opus. First and foremost – we need a proper prompt.
PromptingCreating the proper evaluation prompt is 80% of success. Key things to remember:
- Specify criterias
- Define a numeric scoring scale with clear meanings
- Request score justification for insight into reasoning
- Provide query and context in prompt for reference
- Require strict response format for easy parsing

Coding
First of all, we need to parse the Score and Reasoning out of the LLMs answer.
def _parse_string_eval_output(text: str) -> dict:
score_pattern = r"SCORE: (d+)"
reasoning_pattern = r"REASONING: (.*)"
score_match = re.search(score_pattern, text)
reasoning_match = re.search(reasoning_pattern, text)
score = int(score_match.group(1)) if score_match else None
reasoning = reasoning_match.group(1).strip() if reasoning_match else None
return {"score": score, "reasoning": reasoning}pyp
Since we will use LLM for evaluation, we need to make a simple wrapper for it to parse the output scores.
class BaseEvalChain(LLMChain, StringEvaluator, LLMEvalChain):
def _prepare_output(self, result: dict) -> dict:
# parsing the output to extract the score and reasoning
parsed_result = _parse_string_eval_output(result[self.output_key])
if RUN_KEY in result:
parsed_result[RUN_KEY] = result[RUN_KEY]
return parsed_result
And now we can instantiate and make our custom eval chain with a proper name of inputs and etc.
class LanguageConsistencyEvalChain(BaseEvalChain):
@property
def evaluation_name(self) -> str:
return "Language Consistency"
def _evaluate_strings(self, *, prediction: str, reference: Optional[str] = None, input: Optional[str] = None, callbacks: Callbacks = None, include_run_info: bool = False, **kwargs: Any) -> dict:
result = self({"query": input, "result": prediction}, callbacks=callbacks, include_run_info=include_run_info)
return self._prepare_output(result)
You can actually now use this evaluation chain as-is. However, you would probably want to add things like:
- K-runs and score average to account the element's randomness
- Integration with a framework to utilise most of the benefits (optional)
In this case, as in previous ones, we can instantiate from the default "RunEvaluator" and customise it, like LK-async runs and score averaging, input names and etc.
class BaseEvaluator(RunEvaluator):
...
async def evaluate_run_async(self, run: Run, example: Optional[Example] = None) -> EvaluationResult:
tasks = []
for _ in range(self.k):
task = asyncio.create_task(self._evaluate_run_async(run, example))
tasks.append(task)
evaluations = await asyncio.gather(*tasks)
scores = [eval["score"] for eval in evaluations]
reasonings = [eval["reasoning"] for eval in evaluations]
avg_score = sum(scores) / len(scores)
closest_reasoning_index = min(range(len(scores)), key=lambda i: abs(scores[i] - avg_score))
closest_reasoning = reasonings[closest_reasoning_index]
return EvaluationResult(
key=self.evaluator.evaluation_name.lower().replace(" ", "_"),
score=avg_score,
comment=closest_reasoning,
)
def evaluate_run(self, run: Run, example: Optional[Example] = None) -> EvaluationResult:
return asyncio.run(self.evaluate_run_async(run, example))
Let's run