Introducing Flow Judge, an open small language model for evaluations →

Build AI with precision

Say goodbye to prototypes – elevate your AI system with open language model judges and merge your own proprietary models.

Let's talk

Building AI systems is broken.
Flow AI is the system for evaluating and improving your LLM application.

fai_problem_memo.docx

Aligned & open LM judges

Create fast, cheap and controllable LM evaluators that are always aligned with your criteria.

Evaluate across your LLM stack

Deploy your custom evaluators with our API and measure your AI features across their entire lifecycle — from prototyping phase to production.

Automatic LM selection & development

With your criteria and evaluations, our system develops a unique LM for your use case. Deploy the model behind an API or download it for external use.

Flow AI
Corporate Vision Department
07/07/2024

The tale of AI evaluation: the birth of Flow AI

Before the inception of Flow AI, a small team of AI engineers faced a significant problem: how to effectively evaluate the outputs of their LLM-powered product. Understanding how the outputs changed across multiple model and system updates was proving difficult.

Initially, the team relied on manual evaluations, scoring each output based on a set criteria.

However, this approach quickly revealed major drawbacks:

Scalability: Manual evaluation took too much time and resources, which slowed down the iteration cycles.
Subjectivity: Human evaluators often gave inconsistent scores, which led to potential bias.

To find a better solution, the team thought about using LLMs to evaluate their own system.

They explored various tools on the market, but none met their specific needs. The available options were too generic and the evaluators lacked meta-evaluation, which resulted in misalignments between human and LM evaluations.

The team also wanted to switch to from closed-source models to custom open models to address privacy concerns and achieve cheaper, faster inference. However, finding a model as powerful as GPT-4 proved to be challenging. Fine-tuning a model would have required a significant investment of resources — half of the team's capacity for six months. A huge risk for a startup.

Exploring experimental community LMs revealed more problems. Many models lacked clear lineage history and transparency about previous changes.

In response to these challenges, the team came up with a solution that combined their needs of automated evaluation and specialized model development. This solution was named Flow AI.

Flow AI

At Flow AI, our mission is to empower modern AI teams with advanced tools for evaluating generative AI products across various domains and use cases. Our approach offers a controllable, transparent, and cost-effective LM-as-a-judge alternative to labor-intensive human evaluations and proprietary model-based evaluations.

Additionally, we aim to revolutionize the development of generative products by promoting the use of smaller, specialized LMs over large, general-purpose proprietary models. We know that the current process of selecting and refining these smaller models is complex and time-consuming, which is a barrier to widespread adoption.

We seek to remove these barriers by automating the selection and enhancement of specialized models. We use rapid, cost-effective, and aligned evaluation techniques, along with model merging for developing new LMs. This makes open LMs more accessible for companies with limited engineering resources and budgets.

Flow AI was born from a blend of necessity and ingenuity. We are here to redefine the standards of AI evaluation and model development, paving the way for a new era of generative AI.

Outperform the competition.
Flow AI has the tools and techniques for building superior AI products.

flow-eval

Evaluation module

Our open evaluation model specialized in evaluating LLM systems. Capable of handling custom criteria and accommodating different evaluation paradigms, including pairwise ranking and direct assessment with scoring scales.

Open judge models

Meta-evaluation for maximal correlation

Automatic generation of evaluation criteria

Model and judge version history

Deployment

flow-merge

Model creation module

With revolutionary model merging techniques, our engine develops new LMs that maximize the key evaluation metrics of your system. Create new model variations without the need for expensive GPU resources or extensive training.

Selection of evals and benchmarks

Popular merge methods included

User-defined LM selection criteria

Piecewise assembly

Open-source merging library

Flow AI product visualization of a scoring rubric

Leave the guesswork to us

Skip the research papers and get to building your custom judge in no-time. We ensure your judges perform to the highest standards with minimal effort from your part.

Flow AI product visualization of a metric comparison

Human-like accuracy

By design, your custom LM judges align closely with your human judgement. Our meta-evaluation process maximizes the correlation between human evals and LM judge outputs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14

import os
from flow_ai.client import FlowAI
‍
os.environ["FLOWAI_BASE_URL"] = "https://flowai.run/v1"

client = FlowAI(token=os.environ["FLOWAI_TOKEN"])

evals = client.batched_eval(
model="your-custom-lm-judge-id",
inputs=[
{"query": "...", "response": "..."},
‍{"query": "...", "response": "..."},
]
)

Integrate effortlessly

Run completions from your judges with ease. With our user-friendly SDK and compatibility with existing frameworks, you can streamline your workflow and achieve results faster.

Open evaluator models

Our open evaluator models provide enhanced control, faster performance, better privacy, and reduced costs compared to closed-source LLMs.

Content for AI builders.
Been there done that.

Prior to Flow AI, our team pioneered the use of LLMs for email communication. Now, we want to help you overcome the challenges that you’re facing while building your AI products.

AI system evaluation

Open LLMs

Model Merging

Research reviews

A/B testing

Domain adaptation

Ready to elevate your AI system?