Skip to content

LLM Scoring

LLM-based scoring engine using the OpenAI Responses API.

Builds role-specific prompts from persona + profile + rubric, sends them to OpenAI with a strict JSON schema, and parses the structured response into an AgentScore.

The prompt includes global scoring anchors (1-5 scale), hard constraints (e.g. missing evidence → score ≤ 2), and the role's full prompt profile so the model applies role-appropriate evaluation priorities.

score_summary_llm(summary, role, rubric, *, model='gpt-4o-mini', temperature=0.2)

Score a clinical summary using an LLM via the OpenAI Responses API.

Parameters:

Name Type Description Default
summary str

The clinical summary text to evaluate.

required
role RoleProfile

The clinical role whose perspective to apply.

required
rubric Rubric

The evaluation rubric (defines which dimensions to score).

required
model str

OpenAI model identifier.

'gpt-4o-mini'
temperature float

Sampling temperature (lower = more deterministic).

0.2

Returns:

Type Description
AgentScore

An AgentScore with per-dimension integer scores and a weighted

AgentScore

overall. Rationales are not returned by this engine (only the

AgentScore

heuristic engine produces them).

Raises:

Type Description
OpenAIClientError

If the API call fails, the response is malformed, or any score is missing / out of range.

Source code in src/grading_pipeline/llm_scoring.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
def score_summary_llm(
    summary: str,
    role: RoleProfile,
    rubric: Rubric,
    *,
    model: str = "gpt-4o-mini",
    temperature: float = 0.2,
) -> AgentScore:
    """Score a clinical summary using an LLM via the OpenAI Responses API.

    Args:
        summary: The clinical summary text to evaluate.
        role: The clinical role whose perspective to apply.
        rubric: The evaluation rubric (defines which dimensions to score).
        model: OpenAI model identifier.
        temperature: Sampling temperature (lower = more deterministic).

    Returns:
        An ``AgentScore`` with per-dimension integer scores and a weighted
        overall.  Rationales are not returned by this engine (only the
        heuristic engine produces them).

    Raises:
        OpenAIClientError: If the API call fails, the response is malformed,
            or any score is missing / out of range.
    """
    schema = _build_score_schema(rubric)
    instructions = _build_instructions(role, rubric)

    response = create_response(
        model=model,
        instructions=instructions,
        input_text=summary,
        json_schema=schema,
        temperature=temperature,
    )

    data = extract_json_output(response)

    scores = data.get("score") or data.get("scores")
    if not isinstance(scores, dict):
        raise OpenAIClientError("Model output missing 'score' object.")

    normalized: Dict[str, int] = {}
    for dim_id in rubric.dimension_ids:
        value = scores.get(dim_id)
        if value is None:
            raise OpenAIClientError(f"Missing score for dimension: {dim_id}")
        try:
            normalized[dim_id] = int(value)
        except (TypeError, ValueError) as exc:
            raise OpenAIClientError(f"Invalid score for {dim_id}: {value}") from exc

        if normalized[dim_id] < 1 or normalized[dim_id] > 5:
            raise OpenAIClientError(
                f"Out-of-range score for {dim_id}: {normalized[dim_id]}"
            )

    overall = compute_overall_score(normalized, role.w_prior, rubric.dimension_ids)

    return AgentScore(
        role_id=role.id,
        scores=normalized,
        overall_score=overall,
    )