QualityFlow：LLMが協調してプログラムを高品質化する新たなワークフロー

LLM（大規模言語モデル）を活用してプログラムを自動生成する「プログラム合成」分野が注目を集めています。GitHub CopilotやChatGPTなど、自然言語での指示からコードが生成できるサービスが登場し、開発効率が大幅に向上したという声も少なくありません。
しかし、まだまだ生成されるコードにバグが含まれることも多く、エッジケースを十分にカバーできていないのも実情です。

そんな中、QualityFlowという新しいシステムが登場しました。

QualityFlowは、複数のLLMエージェントが協力し合い、それぞれが専門の役割を持ってプログラムの品質向上を図るワークフローを提案しています。結果的にプログラム合成の成功率（pass@1精度）が大幅に改善し、MBPPやHumanEvalといったベンチマークでも最高スコアを更新したとのことです。

本記事では、このQualityFlowの仕組みや得られた実験結果をかみ砕いてご紹介します。

QualityFlowの基本アイデア

QualityFlowは、大きく以下の5種類のエージェントから構成されます。

コードジェネレーター
問題文とテストケースをもとに初期のプログラムを生成します。
テストデザイナー
プログラムを検証するための追加テストケースを新たに生成します。
セルフデバッガー
プログラムを実行し、失敗したテストケースの原因を分析し、必要に応じてコードを修正します。
品質チェッカー
生成されたプログラムの正しさを判定し、次のステップに進むかどうかを判断します。
問題クラリファイア
何度修正しても解決しない場合に、問題文の解釈自体が誤っている可能性を検討し、新しい見方を提示します。

エージェント同士が有機的に連携することで、多様な生成戦略を試しつつ、精度を高めるという点が特徴です。

全体の流れ：反復的な生成と品質確認

コードジェネレーターが最初のコードを生成
品質チェッカーが「プログラムの品質は十分か？」を判定
- 合格なら終了（プログラムを最終出力）
- 不合格なら次へ
テストデザイナーが新しいテストケースを生成
セルフデバッガーが追加のテストケースを用いて修正
再び品質チェッカーで評価
- 合格なら終了
- 不合格が複数回続くと、問題クラリファイアで問題文を再解釈
- それでもダメなら最初のプログラムに戻って再度やり直し

という形で、プログラムが満足いく品質に達するまで何度もループを回します。

各エージェントの役割

1. コードジェネレーター

問題文とテストケースだけを入力に、まずは初期的なプログラムを生成します。ChatGPTやClaudeなどのLLMに対して、シンプルに「この問題を解くコードを出力して」と指示するイメージです。

2. テストデザイナー

実際にプログラムを試してみるためにはテストケースが重要です。QualityFlowでは、追加テストケースを最大50個まで生成し、極端なエッジケースや一般的なケースをバランスよくカバーしようと試みます。
興味深いのは、一般的なケースを意図的に充実させる方が、最終的なプログラム品質が高まりやすいことが実験で示唆された点。ついエッジケースにこだわりがちですが、学習効果を最大化するには問題文に合った「適度に多様」なテストケースが有効とのことです。

3. セルフデバッガー

セルフデバッガーはテストケースを実行し、エラーや出力を収集して原因を分析します。単に再度コードを生成し直すのではなく、元のコードを理解し、差分を意識した修正をするよう設計されています。
このとき、LLMが「なぜテストが失敗したのか」を段階的に思考（Chain-of-Thought）することで、より妥当な修正へと導く手順をとっています。

4. 品質チェッカー

QualityFlowの“要”とも言えるのが、この品質チェッカーです。プログラムの正誤を判定するほか、生成されたテストケースの品質までも検証します。
MBPPなど、一部のベンチマークでは「評価用のテストケースをプログラムに直接実行してはいけない」といったルールが存在します。そのため、**想像実行（Imagined Execution）**と呼ばれる手法を使い、LLMが人間と同様に「プログラムを頭の中で実行してみる」というステップバイステップの推論で、出力が正しいかどうかをチェックします。

5. 問題クラリファイア

問題文の曖昧さが原因で、いくらデバッグしても正解にたどり着けないケースもあります。そうした場合に備えて、問題クラリファイアが「そもそも問題文の意図は何だったのか」を再解釈します。
デバッグやテストケースの履歴を参照しつつ、改めて別の角度から解釈を行うことで、行き詰まったプロセスを打開する仕組みが設計されています。

多様な生成戦略の効果

QualityFlowでは、従来の自己一貫性（Self-Consistency）アプローチと異なり、「同じ指示を何度も投げる」のではなく、微妙に異なる複数のプロンプトを最初から用意して並行実行する戦略を採用しています。
この多様性のおかげで、一つのアプローチでハマってしまうケースを避けられ、最終的に正解にたどり着く確率が高まるのが大きなメリットです。

実験結果

ベンチマークでの高いpass@1精度

MBPP: 94.2%（従来比+4.8%）
HumanEval: 98.8%

特にpass@1（一発で正しいコードを生成できるか）という指標が改善しており、実用面での価値が高いと言えます。

品質チェッカーの判定精度は98%以上

MBPPの実験で示されたのが、品質チェッカーが正しいプログラムを約98%の精度で識別できるという点。
この高い判定能力が、多様なプログラム生成戦略を支える鍵となりました。

テストケースの品質も大きな影響

生成されるテストケースのうち約62%が不適切（MBPPにおいて）
テスト品質チェッカーはその不適切テストの約80%を除去できる

テストケースの質が高ければ高いほど、セルフデバッガーの修正が正しい方向に進む可能性が上がります。LLMの性能によっては、このチェッカーが逆効果になるケースも報告されており、LLMやドメイン特性に合わせたチューニングも今後の課題だそうです。

問題クラリファイアとリバート機能も有効

問題クラリファイアやリバート（最初のコードに立ち返る）機能をオフにすると、HumanEvalの精度が0.78%〜2.44%低下
問題文の曖昧さを再解釈する機能は目立たないものの、性能改善に寄与していることがデータで裏付けられました。

Colab用サンプルノートブック

Colabにアクセスし、新規ノートブックを作成
下記セルを順番に実行
事前にOpenAIのAPIキーを用意し、対応部分を修正してお使いください

1. 必要ライブラリのインストール＆セットアップ

!pip install openai

import os
import openai
import re
import traceback

# ★★★ ここにあなたのAPIキーを設定してください ★★★
openai.api_key = "sk-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

メモ: セキュリティ上の理由から、Colabの「ノートブックの環境変数」に設定する方法やSecret Managerの使用なども検討してください。

2. プロンプト定義

CODE_GENERATOR_PROMPT = """\
You are a helpful coding assistant. 
Given the following problem statement and test cases, generate a Python function or code to solve the problem.

Problem Statement:
{problem_statement}

Test Cases (example inputs/outputs):
{test_cases}

Constraints:
- Return only the code (no extra explanations).
"""

TEST_DESIGNER_PROMPT = """\
You are a test-case designer. We want more test cases for the following problem:

Problem Statement:
{problem_statement}

Current Code (for reference):
{current_code}

Current Test Cases:
{current_test_cases}

Please generate up to 10 new test cases in Python's unittest-like format (just the function calls and expected values).
"""

SELF_DEBUGGER_PROMPT = """\
You are a self-debugging agent. 
We have a problem statement and some test results (with errors or mismatches).
Analyze the error messages and the current code, then propose a corrected code.

Problem Statement:
{problem_statement}

Current Code:
{current_code}

Test Failures (error outputs or mismatch details):
{test_failures}

Now provide a revised version of the code that fixes these issues.
"""

QUALITY_CHECKER_PROMPT = """\
You are a code quality checker. You will perform an imagined execution of the following code against each test case and reason step-by-step about the output.
Only answer "PASS" if you are absolutely certain the code would produce the correct outputs for all test cases. Otherwise, answer "FAIL".

Code:
{current_code}

Test Cases (input -> expected):
{test_cases}

Your step-by-step reasoning:
"""

TEST_QUALITY_CHECKER_PROMPT = """\
You are a test case quality checker. 
We have newly generated test cases for the following problem:

Problem Statement:
{problem_statement}

Test Cases:
{test_cases}

Analyze if these test cases are valid, relevant, and correctly framed to test the solution. 
List the invalid or irrelevant test cases if any, and explain briefly.
"""

PROBLEM_CLARIFIER_PROMPT = """\
You are a clarifier agent. We have repeatedly failed to solve the following problem. 
Review the original problem statement, and consider alternative interpretations or missing details.

Problem Statement:
{problem_statement}

History of attempts or known issues:
{history}

Provide a revised or clarified problem statement that may resolve the ambiguities.
"""

3. エージェントクラスの定義

# OpenAI APIを呼び出す簡単なラッパ関数
def call_openai_api(prompt, model="gpt-3.5-turbo", temperature=0.7):
    completion = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
    )
    return completion.choices[0].message.content.strip()

class CodeGenerator:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def generate_code(self, problem_statement: str, test_cases: str) -> str:
        prompt = CODE_GENERATOR_PROMPT.format(
            problem_statement=problem_statement,
            test_cases=test_cases
        )
        response = self.api_caller(prompt)
        return response

class TestDesigner:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def generate_test_cases(self, problem_statement: str, current_code: str, current_test_cases: str) -> str:
        prompt = TEST_DESIGNER_PROMPT.format(
            problem_statement=problem_statement,
            current_code=current_code,
            current_test_cases=current_test_cases
        )
        response = self.api_caller(prompt)
        return response

class SelfDebugger:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def debug_code(self, problem_statement: str, current_code: str, test_failures: str) -> str:
        prompt = SELF_DEBUGGER_PROMPT.format(
            problem_statement=problem_statement,
            current_code=current_code,
            test_failures=test_failures
        )
        response = self.api_caller(prompt)
        return response

class QualityChecker:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def check_code_quality(self, current_code: str, test_cases: str) -> bool:
        prompt = QUALITY_CHECKER_PROMPT.format(
            current_code=current_code,
            test_cases=test_cases
        )
        response = self.api_caller(prompt)
        # 非常に単純化した判定
        if "PASS" in response:
            return True
        return False

class TestQualityChecker:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def check_test_quality(self, problem_statement: str, test_cases: str):
        prompt = TEST_QUALITY_CHECKER_PROMPT.format(
            problem_statement=problem_statement,
            test_cases=test_cases
        )
        response = self.api_caller(prompt)
        # 簡易的に「Invalid Test Case: ～」を検索
        invalid_cases = re.findall(r"Invalid Test Case: (.+)", response)
        return invalid_cases

class ProblemClarifier:
    def __init__(self, api_caller):
        self.api_caller = api_caller

    def clarify_problem(self, problem_statement: str, history: str) -> str:
        prompt = PROBLEM_CLARIFIER_PROMPT.format(
            problem_statement=problem_statement,
            history=history
        )
        response = self.api_caller(prompt)
        return response

4. QualityFlowメインクラスの定義

class QualityFlow:
    def __init__(
        self,
        code_generator,
        test_designer,
        self_debugger,
        quality_checker,
        test_quality_checker,
        problem_clarifier,
        max_attempts=3,
        max_clarify=1
    ):
        self.code_generator = code_generator
        self.test_designer = test_designer
        self.self_debugger = self_debugger
        self.quality_checker = quality_checker
        self.test_quality_checker = test_quality_checker
        self.problem_clarifier = problem_clarifier

        self.max_attempts = max_attempts
        self.max_clarify = max_clarify

    def run(self, problem_statement, initial_test_cases):
        """
        QualityFlowを実行する主なメソッド。
        :param problem_statement: 文字列で問題文
        :param initial_test_cases: 文字列またはフォーマット化されたテストケース
        :return: 最終的に得られたプログラム
        """

        # (1) Code Generator: 初期コード生成
        current_code = self.code_generator.generate_code(problem_statement, initial_test_cases)

        attempt_count = 0
        clarify_count = 0
        history_log = []

        while True:
            # (2) Quality Checker: 想像実行（Imaged Execution）による正誤判定
            is_passing = self.quality_checker.check_code_quality(current_code, initial_test_cases)

            if is_passing:
                print("[INFO] Code passed the quality check. Returning result.")
                return current_code  # 合格したので終了

            # 不合格の場合
            attempt_count += 1
            history_log.append(f"Attempt {attempt_count} failed.")
            print(f"[WARN] Quality check failed at attempt {attempt_count}.")

            if attempt_count > self.max_attempts:
                # ある回数以上失敗したら問題クラリファイアを呼ぶ
                clarify_count += 1
                if clarify_count > self.max_clarify:
                    print("[ERROR] Reached max clarify limit. Returning the latest code as fallback.")
                    return current_code  # 妥協策: 最後のコードを返して終了
                else:
                    print("[WARN] Invoking ProblemClarifier...")
                    # 問題文を修正し、最初から作り直す
                    revised_problem = self.problem_clarifier.clarify_problem(problem_statement, "\n".join(history_log))
                    history_log.append("Problem was clarified.")
                    # 改訂された問題文で再度最初からコード生成
                    current_code = self.code_generator.generate_code(revised_problem, initial_test_cases)
                    attempt_count = 0
                    problem_statement = revised_problem
                    continue

            # (3) テストデザイナーで新しいテストケースを生成
            new_test_cases = self.test_designer.generate_test_cases(problem_statement, current_code, initial_test_cases)
            invalid_cases = self.test_quality_checker.check_test_quality(problem_statement, new_test_cases)

            # ※ここでは単純に生成されたテストをそのまま追加
            combined_test_cases = f"{initial_test_cases}\n# Additional test cases:\n{new_test_cases}"

            # (4) Self Debugger: 実行してエラー収集 → コード修正
            test_failures_info = self._run_tests_and_collect_failures(current_code, combined_test_cases)
            current_code = self.self_debugger.debug_code(problem_statement, current_code, test_failures_info)

    def _run_tests_and_collect_failures(self, code_str, test_str):
        """
        簡易実装: exec()でコードを実行してテストする。安全上注意。
        test_strには "function_call -> expected" の形式を想定。
        """
        test_failures = []

        try:
            local_env = {}
            exec(code_str, local_env)  # 関数定義などを取り込む
        except Exception as e:
            return f"[ERROR] Runtime error in code:\n{traceback.format_exc()}"

        lines = test_str.strip().splitlines()
        for line in lines:
            line = line.strip()
            if "->" in line:
                parts = line.split("->")
                func_call = parts[0].strip()
                expected = parts[1].strip()
                try:
                    result = eval(func_call, local_env)
                    # 厳密な比較は用途次第。ここでは文字列比較。
                    if str(result) != expected:
                        test_failures.append(f"Test failed: {func_call}, expected={expected}, got={result}")
                except Exception as e:
                    test_failures.append(f"Runtime error: {func_call}, error={str(e)}")

        if not test_failures:
            return "All tests passed."
        else:
            return "\n".join(test_failures)

5. QualityFlowを実際に動かす

# エージェントのインスタンスを作成
code_generator = CodeGenerator(call_openai_api)
test_designer = TestDesigner(call_openai_api)
self_debugger = SelfDebugger(call_openai_api)
quality_checker = QualityChecker(call_openai_api)
test_quality_checker = TestQualityChecker(call_openai_api)
problem_clarifier = ProblemClarifier(call_openai_api)

# QualityFlowを組み立て
flow = QualityFlow(
    code_generator,
    test_designer,
    self_debugger,
    quality_checker,
    test_quality_checker,
    problem_clarifier,
    max_attempts=3,
    max_clarify=1
)

# 問題文（例：足し算関数）
problem_statement = """\
Write a Python function `add_two_numbers(a, b)` that returns the sum of a and b.
If a or b is not an integer, try to convert them to integer before summing.
"""

# 初期テストケース（例）
initial_test_cases = """\
add_two_numbers(1, 2) -> 3
add_two_numbers(10, 5) -> 15
"""

# 実行
final_code = flow.run(problem_statement, initial_test_cases)

print("=== Final Code Output ===")
print(final_code)

実行の流れ

CodeGenerator が最初にコードを生成し、
QualityChecker が「想像実行」でPASSかFAILかを判定します。
FAILなら、TestDesigner が追加テストを作り、SelfDebugger が実行結果を見てコードを修正します。
これを max_attempts（例では3回）繰り返し、成功しない場合は ProblemClarifier を呼んで問題文を再解釈し、再度チャレンジします。
最終的に合格（PASS）のコードが得られるか、失敗により妥協したコードが返ります。

注意点

試行ごとにOpenAI APIコールが発生するため、クレジット消費にご注意ください。
exec や eval を用いた実行は、セキュリティリスクが大きいため、実運用では安全なサンドボックスなどを用いるべきです。
プロンプト（指示文）を適切にチューニングしないと、正しい回答が得られない場合や無関係な出力が生成される場合があります。
「想像実行（Imagined Execution）」のステップは、LLMが本当にステップバイステップの推論を行っているわけではなく、プロンプトに基づく自己申告的なチェックです。厳密な検証には向き不向きがあります。

考察：プログラム合成以外への応用も期待

QualityFlowのフレームワークは、複数のエージェントがそれぞれ専門分野を持ち、段階的にタスクを仕上げるという考え方が核になっています。今回の研究ではプログラム合成が主題ですが、これをドキュメント生成やデータ分析など、他のLLM応用分野に拡張する可能性もありそうです。

また、品質チェッカーという仕組みは、単なる二値判定に留まらず、想像実行を通じてコードをステップバイステップで追いかけるところがポイント。これは「ツールを使わずにLLMが自分自身で実行結果を予測する」というアプローチであり、ベンチマークの制約を回避するだけでなく、より詳しい誤り検出にも寄与しているようです。

まとめ

QualityFlowは、複数のLLMエージェントが連携し、プログラム合成の精度と信頼性を飛躍的に向上させるワークフローです。特に

品質チェッカーによる高精度な正誤判定
セルフデバッガーの段階的な自己修正
問題クラリファイアによる曖昧さの再解釈
多様な生成戦略の並行実行

といった要素が組み合わさり、ベンチマークでは最高クラスのpass@1精度を達成しています。

プログラミング支援ツールや自動コード生成の現場において、より信頼性の高いコードを素早く得たいというニーズは非常に大きいはずです。QualityFlowのようにエージェント同士が相互チェックを行いながら品質を高めるアプローチは、今後ますます注目を集めるのではないでしょうか。

参考文献

QualityFlow: An Agentic Workflow for Program Synthesis Controlled by LLM Quality Checks
著者: Yaojie Hu, Qiang Zhou, Qihong Chen, Xiaopeng Li, Linbo Liu, Dejiao Zhang, Amit Kachroo, Talha Oz, Omer Tripp
所属: Iowa State University, Amazon Web Services, University of California, Irvine
arXiv:2501.17167

最新情報をチェックしよう！

フォローする

LLMの最新記事4件