LLM Scoring

LLM-based scoring engine using the OpenAI Responses API.

Builds role-specific prompts from persona + profile + rubric, sends them to OpenAI with a strict JSON schema, and parses the structured response into an AgentScore.

The prompt includes global scoring anchors (1-5 scale), hard constraints (e.g. missing evidence → score ≤ 2), and the role's full prompt profile so the model applies role-appropriate evaluation priorities.

`score_summary_llm(summary, role, rubric, *, model='gpt-4o-mini', temperature=0.2)`

Score a clinical summary using an LLM via the OpenAI Responses API.

Parameters:

Name	Type	Description	Default
`summary`	`str`	The clinical summary text to evaluate.	required
`role`	`RoleProfile`	The clinical role whose perspective to apply.	required
`rubric`	`Rubric`	The evaluation rubric (defines which dimensions to score).	required
`model`	`str`	OpenAI model identifier.	`'gpt-4o-mini'`
`temperature`	`float`	Sampling temperature (lower = more deterministic).	`0.2`

Returns:

Type	Description
`AgentScore`	An `AgentScore` with per-dimension integer scores and a weighted
`AgentScore`	overall. Rationales are not returned by this engine (only the
`AgentScore`	heuristic engine produces them).

Raises:

Type	Description
`OpenAIClientError`	If the API call fails, the response is malformed, or any score is missing / out of range.

Source code in src/grading_pipeline/llm_scoring.py

def score_summary_llm(
    summary: str,
    role: RoleProfile,
    rubric: Rubric,
    *,
    model: str = "gpt-4o-mini",
    temperature: float = 0.2,
) -> AgentScore:
    """Score a clinical summary using an LLM via the OpenAI Responses API.

    Args:
        summary: The clinical summary text to evaluate.
        role: The clinical role whose perspective to apply.
        rubric: The evaluation rubric (defines which dimensions to score).
        model: OpenAI model identifier.
        temperature: Sampling temperature (lower = more deterministic).

    Returns:
        An ``AgentScore`` with per-dimension integer scores and a weighted
        overall.  Rationales are not returned by this engine (only the
        heuristic engine produces them).

    Raises:
        OpenAIClientError: If the API call fails, the response is malformed,
            or any score is missing / out of range.
    """
    schema = _build_score_schema(rubric)
    instructions = _build_instructions(role, rubric)

    response = create_response(
        model=model,
        instructions=instructions,
        input_text=summary,
        json_schema=schema,
        temperature=temperature,
    )

    data = extract_json_output(response)

    scores = data.get("score") or data.get("scores")
    if not isinstance(scores, dict):
        raise OpenAIClientError("Model output missing 'score' object.")

    normalized: Dict[str, int] = {}
    for dim_id in rubric.dimension_ids:
        value = scores.get(dim_id)
        if value is None:
            raise OpenAIClientError(f"Missing score for dimension: {dim_id}")
        try:
            normalized[dim_id] = int(value)
        except (TypeError, ValueError) as exc:
            raise OpenAIClientError(f"Invalid score for {dim_id}: {value}") from exc

        if normalized[dim_id] < 1 or normalized[dim_id] > 5:
            raise OpenAIClientError(
                f"Out-of-range score for {dim_id}: {normalized[dim_id]}"
            )

    overall = compute_overall_score(normalized, role.w_prior, rubric.dimension_ids)

    return AgentScore(
        role_id=role.id,
        scores=normalized,
        overall_score=overall,
    )