Skip to content

Scoring

Heuristic (keyword-based) scoring engine for clinical summaries.

This module provides a fast, deterministic baseline scorer that requires no API key. Each of the 8 rubric dimensions has its own scoring function that inspects the summary text for keyword hits, structural markers, sentence statistics, and word count.

Scoring flow for one role
  1. Run all 8 dimension scorers independently.
  2. Apply role-specific adjustments (_apply_role_adjustments).
  3. Clamp all scores to [1, 5].
  4. Compute a weighted overall using the role's w_prior weights.

Score scale: 1 (worst) to 5 (best) per dimension.

AgentScore dataclass

Container for a single role's scoring output.

Attributes:

Name Type Description
role_id str

Which clinical role produced this score.

scores Dict[str, int]

Mapping of dimension_id → integer score (1-5).

rationales Dict[str, str] | None

Optional per-dimension textual justification.

evidence Dict[str, List[str]] | None

Optional per-dimension list of matched keywords/evidence.

overall_notes str

Free-text note about the role's perspective.

warnings List[str] | None

Any warnings generated during scoring (e.g. empty input).

overall_score float | None

Weighted average across dimensions (computed post-scoring).

Source code in src/grading_pipeline/scoring.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
@dataclass(frozen=True)
class AgentScore:
    """Container for a single role's scoring output.

    Attributes:
        role_id: Which clinical role produced this score.
        scores: Mapping of dimension_id → integer score (1-5).
        rationales: Optional per-dimension textual justification.
        evidence: Optional per-dimension list of matched keywords/evidence.
        overall_notes: Free-text note about the role's perspective.
        warnings: Any warnings generated during scoring (e.g. empty input).
        overall_score: Weighted average across dimensions (computed post-scoring).
    """
    role_id: str
    scores: Dict[str, int]
    rationales: Dict[str, str] | None = None
    evidence: Dict[str, List[str]] | None = None
    overall_notes: str = ""
    warnings: List[str] | None = None
    overall_score: float | None = None

    def to_dict(self) -> Dict:
        payload = {
            "role_id": self.role_id,
            "scores": self.scores,
            "score": self.scores,
        }
        if self.rationales is not None:
            payload["rationales"] = self.rationales
        if self.evidence is not None:
            payload["evidence"] = self.evidence
        if self.overall_notes:
            payload["overall_notes"] = self.overall_notes
        if self.overall_score is not None:
            payload["overall_score"] = self.overall_score
        if self.warnings:
            payload["warnings"] = self.warnings
        return payload

compute_overall_score(scores, weights, dimension_ids)

Compute a weighted average score across dimensions.

If total weight is zero (or all weights missing), falls back to a simple unweighted mean. Result is rounded to 2 decimal places.

Source code in src/grading_pipeline/scoring.py
345
346
347
348
349
350
351
352
353
354
355
356
357
358
def compute_overall_score(
    scores: Dict[str, int], weights: Dict[str, float], dimension_ids: List[str]
) -> float:
    """Compute a weighted average score across dimensions.

    If total weight is zero (or all weights missing), falls back to a
    simple unweighted mean.  Result is rounded to 2 decimal places.
    """
    total_weight = sum(weights.get(dim, 0.0) for dim in dimension_ids)
    if total_weight <= 0:
        total_weight = float(len(dimension_ids) or 1)
        return round(sum(scores[dim] for dim in dimension_ids) / total_weight, 2)
    weighted_sum = sum(scores[dim] * weights.get(dim, 0.0) for dim in dimension_ids)
    return round(weighted_sum / total_weight, 2)

score_summary_heuristic(summary, role, rubric)

Score a clinical summary using the keyword-based heuristic engine.

Runs all 8 dimension scorers, applies role-specific adjustments, clamps to [1, 5], and computes the weighted overall score.

Parameters:

Name Type Description Default
summary str

The clinical summary text to evaluate.

required
role RoleProfile

The clinical role whose perspective to apply.

required
rubric Rubric

The evaluation rubric (defines which dimensions to score).

required

Returns:

Type Description
AgentScore

An AgentScore with per-dimension scores, rationales, evidence,

AgentScore

and a weighted overall score.

Source code in src/grading_pipeline/scoring.py
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def score_summary_heuristic(summary: str, role: RoleProfile, rubric: Rubric) -> AgentScore:
    """Score a clinical summary using the keyword-based heuristic engine.

    Runs all 8 dimension scorers, applies role-specific adjustments,
    clamps to [1, 5], and computes the weighted overall score.

    Args:
        summary: The clinical summary text to evaluate.
        role: The clinical role whose perspective to apply.
        rubric: The evaluation rubric (defines which dimensions to score).

    Returns:
        An ``AgentScore`` with per-dimension scores, rationales, evidence,
        and a weighted overall score.
    """
    summary = summary.strip()
    warnings: List[str] = []
    if not summary:
        warnings.append("Empty summary provided.")

    scores: Dict[str, int] = {}
    rationales: Dict[str, str] = {}
    evidence: Dict[str, List[str]] = {}

    factual_score, factual_rationale, factual_evidence = _score_factual_accuracy(summary)
    scores["factual_accuracy"] = factual_score
    rationales["factual_accuracy"] = factual_rationale
    evidence["factual_accuracy"] = factual_evidence

    chronic_score, chronic_rationale, chronic_evidence = _score_chronic_coverage(summary)
    scores["relevant_chronic_problem_coverage"] = chronic_score
    rationales["relevant_chronic_problem_coverage"] = chronic_rationale
    evidence["relevant_chronic_problem_coverage"] = chronic_evidence

    org_score, org_rationale, org_evidence = _score_organized(summary)
    scores["organized_by_condition"] = org_score
    rationales["organized_by_condition"] = org_rationale
    evidence["organized_by_condition"] = org_evidence

    timeline_score, timeline_rationale, timeline_evidence = _score_timeline(summary)
    scores["timeline_evolution"] = timeline_score
    rationales["timeline_evolution"] = timeline_rationale
    evidence["timeline_evolution"] = timeline_evidence

    recent_score, recent_rationale, recent_evidence = _score_recent_changes(summary)
    scores["recent_changes_highlighted"] = recent_score
    rationales["recent_changes_highlighted"] = recent_rationale
    evidence["recent_changes_highlighted"] = recent_evidence

    word_count = _word_count(summary)
    focus_score, focus_rationale = _score_focus_by_length(word_count)
    scores["focused_not_cluttered"] = focus_score
    rationales["focused_not_cluttered"] = focus_rationale
    evidence["focused_not_cluttered"] = [f"word_count={word_count}"]

    decision_score, decision_rationale, decision_evidence = _score_decision_usefulness(summary)
    scores["usefulness_for_decision_making"] = decision_score
    rationales["usefulness_for_decision_making"] = decision_rationale
    evidence["usefulness_for_decision_making"] = decision_evidence

    clarity_score, clarity_rationale, clarity_evidence = _score_clarity(summary)
    scores["clarity_readability_formatting"] = clarity_score
    rationales["clarity_readability_formatting"] = clarity_rationale
    evidence["clarity_readability_formatting"] = clarity_evidence

    _apply_role_adjustments(role.id, scores, rationales)

    overall_notes = f"Role perspective: {role.name}. Summary length {word_count} words."

    for dim_id in rubric.dimension_ids:
        scores[dim_id] = max(1, min(5, int(scores[dim_id])))

    overall_score = compute_overall_score(scores, role.w_prior, rubric.dimension_ids)

    return AgentScore(
        role_id=role.id,
        scores=scores,
        rationales=rationales,
        evidence=evidence,
        overall_score=overall_score,
        overall_notes=overall_notes,
        warnings=warnings or None,
    )