Introduction
We are witnessing a significant proliferation of AI assistants and agents designed to interact with humans across all domains. This rapid advancement has brought a critical challenge: the lack of rigorous, systematic approaches to evaluating these AI systems.
Current industry practices for evaluating Large Language Models (LLMs) systems often rely on domain-expert evaluations. While human evaluation may be reliable, it is prohibitively expensive, time-consuming, and challenging to scale. Moreover, if not carefully executed, such assessments can be prone to subjectivity, potentially compromising the evaluations.
A recent trend has emerged to address these limitations: LLM-as-a-judge, or Language Model (LM) evaluators, to simulate human evaluation and automatically assess the quality of responses from LLM-powered systems.
Frontier models, such as GPT-4 and Claude 3.5 Sonnet, have become the de facto standards for LLM-based evaluations. However, using proprietary models as evaluators of AI applications raises significant concerns:
The release of Llama 3.1 405B Instruct [1] opened the door for a frontier open-source model that could be used as a powerful evaluator. However, the model's size makes running evaluations impractical and expensive.
While there have been significant strides in the open-source community to train robust LM evaluators, notably the Prometheus models and other alternatives [2-5], there remains room for innovation and research, especially in LLM system evaluations. This is particularly true with flexible, small LM evaluators targeting LLM systems instead of general model evaluations.
Over the past months, we have been building the first small open LM evaluator trained on evaluation data for LLM systems.
Flow Judge: an open small language model for LLM system evaluations
Today, we introduce Flow Judge, an open, smaller-scale LM evaluator that achieves comparable performance to larger models such as GPT-4o and Claude 3.5 Sonnet, and popular open models like Llama 3.1 8B Instruct [1] on a variety of evaluation benchmarks.
Our model offers several unique advantages:
- Small but mighty: Flow Judge is a 3.8B model – much smaller than existing models used for LM-based evaluations – making it more accessible and easier to deploy in various environments. Despite its smaller size, Flow Judge achieves a high correlation with standard evaluators like GPT-4o and Claude 3.5 Sonnet, rivaling the performance of larger models in our held-out dataset and out-of-domain benchmarks such as RAGTruth [6] and HaluEval [7].
- Customizability: Flow Judge can follow custom evaluation criteria established by domain experts, enabling tailored assessments across various fields.
- Supports various scoring scales: Flow Judge can grade responses using rubrics in three different scoring scales: Pass / Fail or error detection, 3-Likert, and 5-Likert.
- Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
- Structured evaluation outputs: Flow Judge was trained to produce structured evaluations with
<feedback>
and<score>
tags. - Open and accessible model: Flow Judge is released under the Apache 2.0 license, so it is free for any developer or company. Given its small size, it is perfect for anyone who wants to run evaluations cheaply and quickly with their own rubrics.
Flow Judge can contribute to the community by improving LM-based evaluations and establishing evaluation strategies that can systematically produce better LLM systems.
Dataset construction
Our dataset construction process draws inspiration from [2,3,8-11], specifically focusing on aspects crucial for LLM system evaluations in generative AI products rather than model evaluations.
Previous work [2,3] has demonstrated that evaluation rubrics can be used to synthetically generate evaluations for fine-tuning LM evaluators. However, the datasets used in this work centered around model evaluations, containing single instructions and responses from a language model.
We devoted significant effort to synthetically generating training instances with several inputs, not only instructions. These included, for example, user queries and retrieved contextual information. We also generated custom rubrics that AI practitioners could utilize for AI products in different industry verticals, such as legal, manufacturing, and more.
To minimize human intervention and reduce ambiguity in the creation of the evaluations, we employed a dual-evaluation strategy with consensus to create the evaluation data used for training and assessment of Flow Judge.
In the subsequent sections, we briefly explore the curation of seed rubrics, followed by the synthetic generation of domain-adapted rubrics and training data. Finally, we present our consensus method.
Manual data curation of seeds
Our data creation method relies on curating seed metrics and rubrics, which serve as the basis for creating domain-adapted evaluation criteria and rubrics. At the same time, these will serve as seeds for generating input data for different domains and tasks.
After conducting extensive research on the most commonly used evaluation metrics in LLM systems, we categorized metrics relevant to LLM systems as follows:
We set a consistent format for the seeds, typically as one-line statements that measure specific aspects of performance. These descriptions were designed to be domain-agnostic, concise, and clear, avoiding aggregated metrics that combine multiple concepts.
A small team of human annotators was provided with formatting instructions that aimed for clarity and consistency across metrics, emphasizing simplicity and avoiding ambiguity.
Three scoring rubrics, one for each supported scoring scale, were curated for each metric. To enhance the quality and efficiency of this process, strong LLMs were employed to augment human capabilities in curating the seed rubrics. However, it's important to note that all seed rubrics underwent a careful human review to ensure the high quality of the curation process.
The curation process yielded 31 distinct metrics with three scoring rubrics per metric. A total of 93 unique data points for subsequent augmentation with domain metadata.
Seen and unseen rubrics
To prevent data leakage and ensure robust evaluation of our held-out dataset, we split our dataset into train, validation, and test sets. By creating the splits at this stage in the data creation pipeline, we guaranteed that the validation and test sets contained entirely unseen metrics and rubrics that the model would not encounter during training.
Domain-adapted metrics
Building upon the work in [9,11], we used a systematic approach to generate high-quality, domain-adapted evaluation criteria and rubrics. Our approach utilizes predefined domains as metadata to transform generic seed rubrics into domain-tailored versions.
We choose the domains based on the potential of generative AI systems being used there. We aimed to tailor the generic seed metric to the target domain, creating a more diverse set. We selected 14 domains: Legal, Healthcare, Finance, Education, Customer service, Marketing, Human resources, E-commerce, Travel and Tourism, Technical support, Personal assistant, Biomedical, Manufacturing, and Logistics.
We employed strong language models to adapt the seed metrics and rubrics to specific domains while maintaining the core evaluation criteria involved in the evaluation.
To ensure diversity in the augmented metrics, we randomly sampled the seed metrics and routed them to GPT-4o and Claude 3.5 Sonnet with temperatures ranging from 0.7 to 0.9.
The augmentation process was applied to each 93 metrics across the 14 domains, resulting in a dataset containing 1,302 domain-adapted metrics and rubrics.
Increasing the complexity and diversity
During our systematic quality assurance review, we identified that the synthetically generated domain-adapted criteria and rubrics were usually of low to moderate complexity compared to the evaluation needs in the real world. We made this assessment based on our extensive experience building LLM systems.
Previous works [8] demonstrate that varying complexity levels in the training dataset can yield better results. We decided to enhance the complexity and language diversity of our evaluation metrics. We used LLMs to rewrite evaluation criteria and rubrics to include more complex evaluation setups. After this, we combined these newly generated criteria and rubrics with the original dataset.
The complexity evolution was designed to generate more intricate data points. The language diversity evolution sought to include various formulations of evaluation criteria and rubrics, enhancing the language diversity of the dataset.
The prompts in [8] were unsuitable for our case, so we carefully crafted prompts that instructed the models to rewrite the criteria and rubric.
We divided our metrics dataset, allocating 50% of the metrics to complexity evolution and the remaining 50% to language diversity evolution.
For the complexity evolution, we leveraged the strong reasoning capabilities of GPT-4o and Claude 3.5 Sonnet, randomly routing samples to these models with consistent temperature settings to ensure controlled variability. For the evolution of language diversity, we utilized Llama 3.1 70B instruct [1] as a paraphraser since the task requires less reasoning.
This data creation step resulted in an additional 1,302 metrics and rubrics. These were then mixed with the original 1,302 domain-specific metrics, yielding a dataset of 2,604 metrics and rubrics.
These are the splits of the final metrics dataset:
Synthetic generation of instances from rubrics
We then implemented the next step in the data creation process, which used all the metrics as seeds to generate data across the 14 domains and for different tasks. Each instance was derived from a specific metric, incorporating the evaluation criteria, rubric, metric description, and the input-output specifications of the generative task to be evaluated.
One of the primary challenges in creating a synthetic dataset with evaluations is the need to prompt language models to produce outputs of varying quality, including those that would receive lower scores according to the rubric. To address this, we adopted an approach similar to that described in [2] for creating the Feedback Collection dataset:
- We incorporated a target score in the generation prompt.
- For each score level defined in the rubric, we generated a corresponding instance, including inputs and output.
We carefully engineered prompts to instruct Claude 3.5 Sonnet to generate realistic examples that include both inputs and responses. This step required a few iterations, including human reviews, to engineer a prompt that consistently generated the type of data we had envisioned.
The result was a substantial dataset of 7531 instances spanning different metrics, domains, and scoring scales.
Addressing verbosity biases
Previous research [12, 13] has demonstrated that LLMs tend to prefer longer, more verbose outputs. To account for this, we conducted a thorough analysis of our generated dataset to identify whether there were strong correlations between high scores and longer responses.
Our analysis revealed significant verbosity biases across different scoring scales. We observed a consistent positive correlation between higher scores and longer text outputs or responses, suggesting that the models generally rated more detailed and extensive responses higher. Conversely, concise outputs correlated with lower scores.
To address these verbosity biases, we implemented a straightforward yet effective transformation:
- We prompted a language model to increase the lengths of outputs with scores different from the highest by a specific number of words but preserve the content and style of the response.
- The number of additional words was determined by calculating the difference in median output lengths between each score level and the highest score.
Since the difference was usually between 20 and 40 words, the risk of varying the quality of the response in this transformation was low. This verbosity bias mitigation strategy sought to train our model to focus on substantive aspects of the responses rather than being influenced by verbosity.
Synthetically generating evaluations for training
The final step in our synthetic data creation process involved creating actual evaluation data for training Flow Judge. We prompted LLMs to produce verbal qualitative feedback before assigning the final score to the response since it has been proven to have a positive effect similar to chain-of-thought prompting techniques [2,3]. This sequence aims to emulate human evaluation processes, where reasoning precedes numerical scoring.
The two components of the evaluations are:
- Feedback: A detailed paragraph containing the LM evaluator's reasoning for the assigned score. This feedback also highlights any problematic areas in the generation, providing a rich context for the evaluation.
- Score: A numerical score derived from the specific rubric associated with the evaluation metric.
An example output of the evaluation:
We prompted the LLM to produce structured outputs using XML tags to create a reliable parser.
To reduce ambiguity in our training labels, we implemented a dual-evaluation strategy. We obtained two independent evaluations: one from GPT-4o and another from Claude 3.5 Sonnet. Both models were configured with a temperature of 0.0 to promote consistency in their outputs.
The final label for each evaluation was determined through a consensus method, combining the insights from both evaluators. We established specific criteria for handling cases of significant disagreement. For instance, assessments with completely opposing scores were discarded.
In the final phase of our data creation process, we analyzed score distributions within our dataset. This revealed score imbalances in all scoring scales. To address this issue and create a more balanced dataset, we applied undersampling techniques to reduce the prevalence of majority classes. This step was crucial in mitigating potential biases in the training data and ensuring that Flow Judge would be equally adept at identifying and evaluating responses across the entire spectrum of scores.
After the consensus and discarded evaluations, we obtained a total of 5,102 evaluations. Below is a breakdown of the dataset.
Human review to ensure quality
After each step in the data generation pipeline, we randomly selected 100 data points for human review. This process focused on identifying problematic patterns or inconsistencies in the synthetically generated data.
We employed an iterative approach: first, we generated a small batch of data, reviewed it, addressed any identified issues, and then proceeded to generate the entire dataset. This method allowed us to catch and rectify potential problems early in the process, ensuring higher overall quality.
We integrated Argilla's user interface into our workflow to streamline the review process. By implementing this focused QA process, we maintained high data quality standards throughout the creation of our synthetic dataset.
Model Information
Architecture inherited from Phi-3.5-mini
Flow Judge is based on the Phi-3.5-mini architecture, and the base model checkpoint used is specifically its instruct version. The model uses the same tokenizer, supports MQA and Flash Attention 2, and has weights in bfloat16 precision.
However, post-finetuning, the model's support for languages and long context lengths has not been fully tested. Due to specialized Supervised Fine-Tuning (SFT), Flow Judge might show different benchmark results and support a maximum context length of 8192, shorter than the base model's.
Base model information
Phi-3.5-mini, a Transformer-based language model with 3.8 billion parameters, was unveiled by Microsoft in August 2024. This iteration supports a 128K token context length, enhancing the previous Phi-3 Mini version from June 2024. [14]
Microsoft's efforts build on the "Textbooks Are All You Need" approach, leveraging high-quality training data to boost the performance of smaller language models, diverging from typical scaling laws. [15]
A notable feature is its size and, therefore, its capability to run on resource-constrained devices. The phi-3-mini model can be quantized to 4-bits, reducing its memory footprint to roughly 1.8GB. Tests showed the quantized model efficiently runs on an iPhone 14 with an A16 Bionic chip, achieving a processing rate of over 12 tokens per second offline and natively. [16]
Post pre-training, the model underwent chat fine-tuning through Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), significantly improving its multilingual, multi-turn conversational quality and reasoning abilities. [17]
The chat template is of the following standard format:
Phi-3-mini shares a structural similarity with Llama-2, utilizing the same tokenizer with a 320,641 token vocabulary. This compatibility ensures that Llama-2 packages can be easily adapted for phi-3-mini. The model architecture includes a 3072 hidden dimension, 32 heads, and 32 layers. [18]
Base model architecture
Phi-3.5-mini has 3.8B parameters and is a dense decoder-only Transformer model using the same tokenizer as Phi-3 Mini.
Fine-tuning
Data preparation
Raw inputs formatted and string interpolated to the Phi 3.5 prompt template
We started with raw input data where each required element ['inputs', 'output', 'criteria', 'rubric', 'feedback', 'score']
resides in separate columns. These columns are formatted and interpolated into final prompt strings for supervised fine-tuning.
Since phi-3.5-mini-instruct is already fine-tuned for chat, we adopted the same chat template structure in our finetuning.
Reading the input datasets, preprocessing & double checking results
Once the datasets were formatted into the required chat format, axolotl fine-tuning framework read them in with the correct configuration. It merged and shuffled the dataset splits—3-likert, 5-likert, and binary—separately for both training and validation. [19]
Here is an example of how we read the chat data in Axolotl:
Before starting fine-tuning, we used axolotl's preprocessing to ensure the input training data is consistent. We loaded snapshots of the preprocessed datasets and manually inspected them in a code editor for accuracy.
Training process
We started by loading the base model microsoft/Phi-3.5-mini-instruct in half-precision bfloat16. For supervised finetuning, we used RSLoRa with rank 256 and alpha 128 values. [20,21]
Our finetuning ran for five epochs, employing a cosine scheduler with an extended warmup period. We opted for a low learning rate and incorporated Lora dropout and slight weight decay to prevent overfitting. The finetuning process used the dataset's training split, while the validation split was used to monitor evaluation loss. Post-training, we tested on a separate held-out test split that remained unseen during training.
We also experimented with the DoRA method but found no improvement over RSLoRa. Due to hardware limitations, we had to run DoRA with a smaller batch size, which may have negatively impacted the results. Additionally, DoRA is 20-30% slower than RSLoRa, limiting the number of experiments we could conduct within a given timeframe. [22,23]
Post-training
After finetuning, we benchmarked the checkpoints using our held-out and out-of-domain datasets to identify the best-performing model. We selected the checkpoint at finetuning step 231, corresponding to approximately 3.7 epochs out of the total 5 training epochs.
Some checkpoints showed complementary performance on the held-out and out-of-domain datasets. We attempted to merge these checkpoints using the PEFT library's TIES and DARE merging methods. However, this did not result in clear performance improvements, so we did not pursue it further. [24-26]
We used PEFT and transformers libraries to merge the trained LoRA adapter in F32 format into the base model in bfloat16 format, resulting in a bfloat16 model. At this step, we explicitly set flash_attention_2
as the attention backend. [27-29]
Quantization
We offer two quantization options: AWQ and GGUF. To create the quantized models, we used the AutoAWQ library and Llama.cpp. [30-33]
Additionally, Microsoft provides comprehensive instructions for quantizing the Phi 3.5 model architecture, which also apply to Flow Judge. [34]
Inference
Hardware requirements
To run Flow Judge efficiently, ensure your hardware meets the following requirements:
- Modern GPU with at least 4 GB VRAM (e.g., NVIDIA RTX series)
- Minimum of 8 GB of system memory
- At least 10GB of free storage for model files and dependencies.
By default, the model uses flash attention, requiring certain GPU hardware types to run. If you want to run the model on NVIDIA V100 or earlier generation GPUs choose attn_implementation="eager".
Inference: how to run Flow Judge
To run the Flow Judge model for inference, you can use our flow-judge library. Here's a quick-start example:
flow-judge supports multiple model types, including Hugging Face Transformers and vLLM, and offers various pre-defined evaluation metrics. For more advanced usage, custom metrics, and batch evaluations, check out the flow-judge repository.
Performance benchmarks
Flow Judge, running in bfloat16 precision, delivers robust performance on a single NVIDIA 4090 GPU. We benchmarked the model on the halubench_halueval dataset in lm-evaluation-harness, utilizing vllm 0.6.1.post2 and vllm-flash-attn 2.6.1 libraries. [35]
The model processes at an average speed of around 1000 tokens per second, achieving top input speeds of 5870 tokens per second and an output speed of 1045 tokens per second, with throughput accelerating towards the end. Running through the dataset in 21 minutes and 46 seconds.
We find that the AWQ version quantized to 4-bit using GEMM-kernel runs in 18 minutes and 35 seconds on the same setup and scores similar scores in halubench_halueval task, displaying only very slight degradation.
Inference and evaluation results comparison
Flow Judge in bfloat16:
Evaluation metrics for bfloat16 in halubench_halueval task:
Flow Judge in AWQ 4-bit (GEMM) quantized:
Evaluation metrics for AWQ 4-bit (GEMM) in halubench_halueval task:
Overall, the performance of both the unquantized and quantized versions showcases the Flow Judge’s capability to process extensive datasets with ease.
Performance
Datasets
We evaluated Flow Judge on our held-out dataset and a combination of publicly available relevant benchmarks for evaluating LM judges.
Held-out test set
We evaluated Flow Judge on a held-out split of our dataset. This split contains unseen metrics during training and evaluates the correlation of Flow Judge with GPT-4o and Claude 3.5 Sonnet.
We evaluated separately on the 3 different scoring scales: Pass / Fail, 3-Likert, and 5-Likert.
RAGTruth
RAGTruth [6] is an academic dataset that studies word-level hallucinations in LLM applications using Retrieval-Augmented Generation (RAG) frameworks. It contains about 18,000 responses generated by various LLMs using RAG across different domains and tasks. These responses have been carefully annotated manually, both at the response and word levels, including assessments of hallucination intensity.
The dataset is split into three RAG tasks: Question Answering (QA), Data-to-text writing, and News Summarization. The QA and News Summarization test splits contain similar data to our training data, while the data-to-text split contains JSON inputs.
We formatted the Pass / Fail evaluations as described in the original paper and evaluated the ability of our model to detect hallucinations at the response level.
HaluEval
HaluEval [7] is another dataset designed to evaluate the ability of LM evaluators to detect hallucinations. It consists of a large collection of generated and human-annotated hallucinated samples to evaluate the performance of LLMs in recognizing hallucinations.
It contains general user queries with LLM-generated responses and task-specific examples spanning question-answering, knowledge-grounded dialogue, and text summarization.
We used the HaluEval subset from HaluBench [4], which contains 10k questions with knowledge from Wikipedia, as well as question text and ground-truth answers collected from HotpotQA.
PubMedQA
PubMedQA [36] is a specialized dataset for biomedical question-answering tasks. It is compiled from PubMed's abstracts, a comprehensive medical literature database. The dataset presents research questions that require responses in the form of "yes," "no," or "maybe." Additionally, each answer is accompanied by an extended explanation that draws supporting evidence from the provided context, offering a more detailed justification for a given response.
We used the subset from HaluBench [4] formatted as Pass / Fail, which contains additional perturbations to the original dataset to generate hallucinated answers that appear plausible but are not faithful to the context.
Covid-QA
The COVID-QA [37] dataset consists of 2k question-answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19.
We also used the subset from HaluBench [4] formatted as Pass / Fail, which also contains perturbations.
Feedback bench
Feedback Bench is a dataset generated using the same procedure as the Feedback Collection dataset published in [2]. It contains 1,000 evaluations on a 5-Likert scale generated by GPT-4.
We evaluated the correlation of Flow Judge with a frontier model on fine-grained evaluations and compare its performance to prometheus-7b-v2.0 [3].
Baselines
To demonstrate the effectiveness of our fine-tuning process, we used the checkpoint of Phi-3.5-mini-instruct [17] as our primary baseline. Additionally, we selected two larger popular open-source instruction LMs as strong baselines: Llama-3.1-8B-Instruct [38] and Mistral-Nemo-Instruct-2407 [39]. We also compared our model to gpt-4o-mini.
For comparison, we obtained results using the same prompt template, formatting requirements, and a combination of hyperparameters employed for Flow Judge.
To evaluate Flow Judge's generalization capabilities, we also evaluated its performance on publicly available out-of-domain datasets. We compared these results to those of recent LM evaluators, including open models like Prometheus 2 [3] and Lynx [4] or closed-source models like Luna [5]. For these models, we report the metrics as published in their respective original papers.
Metrics
For comparisons on Pass / Fail datasets, we use precision, recall, and F1-score. Additionally, we report accuracy for HaluEval [7] and PubMedQA [36] since that was the metric reported in the Lynx [4] paper.
For comparisons on Likert scales, we use pearsonr, spearmanr, and kendall-tau correlations as reported in the Prometheus 2 [3] paper.
Results
We run our evaluations using our fork of the LM-evaluation-harness [40] by EleutherAI. We made the evaluation code and results publicly available in a private fork of the harness here.
The instructions for reproducing the evaluations can be found in the flow_judge_eval
task folder. We utilized the vLLM engine to run the evaluations. The results generated by the lm-evaluation-harness are included in the results/
folder.
Held-out test sets
We evaluated our models and baselines on the held-out test sets. Since the evaluations in our dataset are obtained from consensus between GPT-4o and Claude Sonnet 3.5, the results are a measure of the correlation or agreement of the models with these reference evaluators.
Flow Judge achieved the best performance on the pass/fail held-out test set, significantly outperforming its foundation model, Phi-3.5-mini-instruct [17], and larger models like Meta-Llama-3.1-8B-Instruct [38] and Mistral-Nemo-Instruct-2407 [39]. Notably, Flow Judge obtained the highest precision and F1 score among all models tested, including the smaller version of the GPT family.
For the 3-Likert and 5-Likert, we measured correlations with the reference evaluators GPT-4o and Claude 3.5 Sonnet. Flow Judge shows strong correlations with these models, surpassing its base model by an average of 18%. Despite its significantly smaller size and more limited training data, it also marginally outperforms larger open models, including the popular and very capable prometheus-eval/prometheus-7b-v2.0 [3]. Our model also shows comparable performance to gpt-4o-mini on these test sets.
RAGTruth
In the RAGTruth QA task [6], Flow Judge achieved comparable performance to larger open models. It also performs at par with its foundation model, microsoft/Phi-3.5-mini-instruct [17].
In the RAGTruth Data-to-Text task [6], Flow Judge shows significant underperformance compared to specialized models like Luna[5], which are fine-tuned specifically for this type of data. This performance gap was anticipated, given that our training dataset did not incorporate structured inputs/outputs.
In RAGTruth Summarization [6], Flow Judge also achieves comparable performance to its baselines, achieving the highest precision score.
These results suggest that Flow Judge can generalize to tasks where the distribution of inputs and outputs is not significantly different from the training data. However, our model faces challenges in domains requiring structured data interpretation and mathematical reasoning, highlighting potential areas for future model improvements and dataset diversification.
Accuracy for Lynx models on the RAGTruth QA subset is reported in [4]. Lynx 8B achieves 80.0% accuracy while Lynx 70B achieves 80.2%. Flow Judge matches the performance of these two models with a slightly higher accuracy of 81.2%, despite being smaller and not trained on the RAGTruth QA train set.
HaluEval, Covid-QA and PubMedQA
Our model demonstrates significant progress in the HaluEval [7] benchmark, surpassing its base Phi-3.5-mini-instruct [17] by approximately 8%. This improvement indicates enhanced capabilities in hallucination detection post fine-tuning. Flow Judge also outperforms the larger Mistral NeMo [39] and closely ranks under Llama 3.1 8B instruct [38] and gpt-4o-mini.
When compared to Lynx [4], a model fine-tuned on hallucination detection data, Flow Judge exhibited a remarkable performance comparable to Lynx's smaller version.
In the Covid-QA subset of HaluBench [4,37], Flow Judge maintains its superior performance over the open-source baselines. However, the Lynx family of models outperforms Flow Judge in terms of accuracy on this benchmark.
Flow Judge shows lower recall performance on the PubMedQA subset of HaluBench [4,36] compared to baselines. This is likely due to PubMedQA's long contexts and answers, as well as its focus on biomedical research requiring quantitative reasoning. These challenges highlight areas for improving the next models.
Feedback bench
Flow Judge demonstrates strong performance on the Feedback Bench dataset [2], outperforming the established baselines Phi-3.5-mini-instruct [17], Meta-Llama-3.1-8B-Instruct [38], and Mistral-Nemo-Instruct-2407 [39] across all correlation metrics. Notably, it achieves results comparable to gpt-4o-mini, showcasing its competitiveness with advanced language models.
While Prometheus-7b-v2.0 achieves the highest scores, it's important to note that Feedback Bench [2,3] is an in-domain dataset for this model, potentially giving it an advantage.
Summary
We demonstrate that Flow Judge achieves strong correlations with standard proprietary evaluators like GPT-4o and Claude 3.5 Sonnet on our held-out test sets, outperforming larger models in pass/fail and Likert scale evaluations, despite being a compact model fine-tuned on a small synthetic dataset.
We also showcased its ability to generalize on other benchmarks such as RAGTruth [6], subsets of HaluBench [4] and Feedback Bench [2].
While it shows some limitations in specialized tasks not emphasized in its training, Flow Judge's overall performance highlights the potential of small fine-tuned LMs as evaluators.
License
We opted for the Apache 2.0 license for Flow Judge to provide the community with an open, small yet powerful LM evaluator. Our goal is to support the wider adoption of rigorous evaluation techniques in LLM system development, making them more accessible to practitioners and researchers.
Limitations and future work
Multilingual evaluation: Flow Judge has been fine-tuned exclusively on English data. While the foundation model (Phi-3.5-mini-instruct [17]) may possess multilingual capabilities, we have not systematically evaluated Flow Judge performance in non-English contexts. We plan to explore multi-lingual LM evaluators in the future.
Long context and structured Inputs: Our training dataset encompasses a wide range of custom metrics relevant to evaluating LLM systems. However, it does not include examples with long context inputs or structured data formats such as JSON, since these are harder to synthetically generate. This limitation may impact Flow Judge's performance when evaluating responses that require processing extensive context or parsing structured input. Extending our model’s capabilities to handle these input types represents an important area for future research.
Math and coding: The current version has not been trained on specific task domains such as arithmetic problems or code evaluation. As a result, its performance in these specialized areas may be limited. Future iterations of the model should address these gaps.
Domain-specific knowledge and complex multi-step evaluations: Flow Judge may struggle with highly specialized domain knowledge or proprietary data outside the training scope of its foundation model. Additionally, evaluation tasks requiring multi-step reasoning or complex logical processes may challenge the model's capabilities. We strongly recommend conducting meta-evaluations of the model performance before deploying it in specialized or highly complex evaluation scenarios.
What’s next for Flow Judge?
We have plans to keep upgrading our model and releasing new improved versions that address the limitations described in this report.
We are also keen to receive feedback from the community, and work together to create more reliable open and specialised LM evaluators.
Together with the model, we will be releasing tutorials for anyone to use Flow Judge in AI application development with frameworks like Llama Index. Also, we will be releasing a tutorial about how to use Flow Judge for real-time monitoring.
Acknowledgments
We would like to express our gratitude to other developers in the community that created several tools that significantly contributed to the success of this project:
- Unsloth library provided essential optimizations for fine-tuning, enhancing the efficiency of our model training process.
- Hugging Face ecosystem, which offers an invaluable foundation for model building and dataset management.
- Argilla team (now HF) for its user-friendly interface, which facilitated efficient data annotation and review. Also, for building and open-sourcing distilabel, which was instrumental in enabling the creation of synthetic data pipelines.
- Maintainers of axolotl for streamlining the fine-tuning of large language models.
- Microsoft Phi team for their contributions to the Phi-3.5-mini architecture and their work on synthetic data generation. The "Textbooks Are All You Need" approach significantly influenced our model's development.
References
[1] A. Dubey et al., ‘The Llama 3 Herd of Models’, ArXiv, vol. abs/2407.21783, 2024.
[2] S. Kim et al., "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models," arXiv preprint arXiv:2310.08491, 2023.
[3] S. Kim et al., "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models," arXiv preprint arXiv:2405.01535, 2024.
[4] S. S. Ravi, B. Mielczarek, A. Kannappan, D. Kiela, and R. Qian, "Lynx: An Open Source Hallucination Evaluation Model," arXiv preprint arXiv:2407.08488, 2024.
[5] M. Belyi, R. Friel, S. Shao, and A. Sanyal, "Luna: An Evaluation Foundation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost," arXiv preprint arXiv:2406.00975, 2024.
[6] Y. Wu et al., "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models," in Annual Meeting of the Association for Computational Linguistics, 2023.
[7] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J.-R. Wen, "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models," arXiv preprint arXiv:2305.11747, 2023.
[8] C. Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions," arXiv preprint arXiv:2304.12244, 2023.
[9] Z. Wang et al., "CodecLM: Aligning Language Models with Tailored Synthetic Data," arXiv preprint arXiv:2404.05875, 2024.
[10] C. Xu et al., "WizardLM: Empowering Large Language Models to Follow Complex Instructions," arXiv preprint arXiv:2304.12244, 2023.
[11] N. Ding et al., "Enhancing Chat Language Models by Scaling High-quality Instructional Conversations," in Conference on Empirical Methods in Natural Language Processing, 2023.
[12] K. Saito, A. Wachi, K. Wataoka, and Y. Akimoto, ‘Verbosity Bias in Preference Labeling by Large Language Models’, ArXiv, vol. abs/2310.10076, 2023.
[13] Y. Dubois, B. Galambosi, P. Liang, and T. Hashimoto, ‘Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators’, ArXiv, vol. abs/2404.04475, 2024.
[14] “Phi-3CookBook/md/01.Introduce/Phi3Family.md at c53fa9fda5df6a42476dd8ba5f1ccb446dd1608c · microsoft/Phi-3CookBook,” GitHub. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/microsoft/Phi-3CookBook/blob/c53fa9fda5df6a42476dd8ba5f1ccb446dd1608c/md/01.Introduce/Phi3Family.md
[15] S. Gunasekar et al., “Textbooks Are All You Need,” Oct. 02, 2023, arXiv: arXiv:2306.11644. doi: 10.48550/arXiv.2306.11644.
[16] M. Abdin et al., “Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone,” Aug. 30, 2024, arXiv: arXiv:2404.14219. doi: 10.48550/arXiv.2404.14219.
[17] “microsoft/Phi-3.5-mini-instruct · Hugging Face.” Accessed: Sep. 16, 2024. [Online]. Available: https://huggingface.co/microsoft/Phi-3.5-mini-instruct
[18] H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” Feb. 27, 2023, arXiv: arXiv:2302.13971. doi: 10.48550/arXiv.2302.13971.
[19] “axolotl-ai-cloud/axolotl: Go ahead and axolotl questions.” Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/axolotl-ai-cloud/axolotl
[20] D. Kalajdzievski, “A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA,” Nov. 27, 2023, arXiv: arXiv:2312.03732. doi: 10.48550/arXiv.2312.03732.
[21] S. R. PhD, “Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation).” Accessed: Sep. 16, 2024. [Online]. Available: https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
[22] S. R. PhD, “Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch.” Accessed: Sep. 16, 2024. [Online]. Available: https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch
[23] S.-Y. Liu et al., “DoRA: Weight-Decomposed Low-Rank Adaptation,” Jul. 09, 2024, arXiv: arXiv:2402.09353. doi: 10.48550/arXiv.2402.09353.
[24] huggingface/peft. (Sep. 16, 2024). Python. Hugging Face. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/huggingface/peft
[25] L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch,” Jun. 13, 2024, arXiv: arXiv:2311.03099. doi: 10.48550/arXiv.2311.03099.
[26] P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “TIES-Merging: Resolving Interference When Merging Models,” Oct. 26, 2023, arXiv: arXiv:2306.01708. doi: 10.48550/arXiv.2306.01708.
[27] T. Wolf et al., Transformers: State-of-the-Art Natural Language Processing. (Oct. 2020). Python. Association for Computational Linguistics. Accessed: Sep. 16, 2024. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
[28] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” Jun. 23, 2022, arXiv: arXiv:2205.14135. doi: 10.48550/arXiv.2205.14135.
[29] Dao-AILab/flash-attention. (Sep. 16, 2024). Python. Dao AI Lab. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/Dao-AILab/flash-attention
[30] J. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” Jul. 18, 2024, arXiv: arXiv:2306.00978. doi: 10.48550/arXiv.2306.00978.
[31] Casper, casper-hansen/AutoAWQ. (Sep. 16, 2024). P*microsoftython. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/casper-hansen/AutoAWQ
[32] G. Gerganov, ggerganov/ggml. (Sep. 16, 2024). C++. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/ggerganov/ggml
[33] “GGUF.” Accessed: Sep. 16, 2024. [Online]. Available: https://huggingface.co/docs/hub/en/gguf
[34] “Phi-3CookBook/md/08.Update/Phi35/021.UsingLlamacppQuantifyingPhi35.md at main · microsoft/Phi-3CookBook,” GitHub. Accessed: Sep. 16, 2024. [Online]. Available: https://github.com/microsoft/Phi-3CookBook/blob/main/md/08.Update/Phi35/021.UsingLlamacppQuantifyingPhi35.md
[35] vllm-project/vllm. (Sep. 17, 2024). Python. vLLM. Accessed: Sep. 17, 2024. [Online]. Available: https://github.com/vllm-project/vllm
[36] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu, "PubMedQA: A Dataset for Biomedical Research Question Answering," in Conference on Empirical Methods in Natural Language Processing, 2019.
[37] T. Möller, A. Reina, R. Jayakumar, and M. Pietsch, ‘COVID-QA: A Question Answering Dataset for COVID-19’, 2020.
[38] “meta-llama/Meta-Llama-3.1-8B-Instruct · Hugging Face.” Accessed: Sep. 16, 2024. [Online]. Available: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
[39] “mistralai/Mistral-Nemo-Instruct-2407 · Hugging Face.” Accessed: Sep. 16, 2024. [Online]. Available: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
[40] L. Gao et al., ‘A framework for few-shot language model evaluation’. Zenodo, 07 2024.
Appendix A. Full list of seed metrics and their definitions
Response Quality
Retrieval Quality
Bias and Toxicity
Style and Guideline Adherence
Language
Miscellaneous
Appendix B. Prompt template and hyperparameters
Flow Judge user prompt template with inputs
Flow Judge user prompt template without inputs
Hyperparameters chosen for evaluation for Flow Judge and baselines
Appendix C. Fine-tuning details
Hardware and software versions of the setup
For fine-tuning, we used the Axolotl framework from Wing Lian.
Our hardware setup comprised three 4090 GPUs.
Key software versions included:
Training hyperparameters
The following hyperparameters were used during training:
And for rsLoRA:
September 17, 2024
Flow AI