# AI Output Evaluation

This document records the evaluation of AI-generated output for three sample
documents using the criteria defined below.

Run the app with each fixture document, generate both questions and suggestions,
and score the output using the rubric.  Record results in the tables below.

---

## Evaluation Criteria

Each criterion is scored 1 (worst) to 5 (best).

| # | Criterion | 1 | 3 | 5 |
|---|-----------|---|---|---|
| C1 | **Content grounding** — are questions/suggestions based only on the document? | Multiple invented facts or claims not in the document. | Mostly grounded; one or two minor extrapolations. | Every claim traceable to document content. |
| C2 | **No invented facts** — does the output avoid making up data, sources, or claims? | Contains fabricated statistics, names, or events. | One borderline inference that could be read as invention. | Strictly uses only what the document provides. |
| C3 | **Document references** — does the output include supporting quotes or section references? | No references at all. | Some references present but vague or misattributed. | Specific, verifiable quotes or section/slide references. |
| C4 | **Question usefulness** — would these questions help a presenter prepare? | Generic questions that could apply to any document. | Mixed; some specific, some generic. | Tailored, thought-provoking questions a real audience would ask. |
| C5 | **Suggestion specificity** — are suggestions concrete and actionable? | Vague advice ("be clearer", "add more detail"). | Some specific, some generic. | Concrete fixes tied to specific sections with clear rationale. |
| C6 | **Honesty about gaps** — does the app say when content is insufficient? | Claims confidence where it should not. Fabricates to fill gaps. | Flags one obvious gap but misses subtler ones. | Clearly states when content is too thin and explains what is missing. |
| C7 | **JSON validity** — is the output parseable, well-structured JSON? | Not valid JSON. Missing required fields. | Valid JSON but some optional fields missing or empty. | Complete, well-formed JSON matching the expected schema exactly. |

---

## Document 1: Short Presentation ("Why Remote Work Improves Productivity")

**Document characteristics:**
- 6 well-structured slides
- Clear methodology, data points, and specific findings
- Explicit limitations acknowledged
- Concrete recommendations

**Expected AI behaviour:**
- Should generate questions about methodology, sample bias, the three conditions,
  junior developer finding, and practical implementation of recommendations.
- Suggestions should address clarity (good), structure (good), and potentially
  ask for more detail on async practices or mentorship programmes.
- Should NOT invent statistics beyond the 22%, 12%, 35%, 8%, 40%, 31% provided.
- Should reference specific slide numbers or quotes.

### Questions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C4 — Question usefulness | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Questions total:** __ / 30

### Suggestions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C5 — Suggestion specificity | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Suggestions total:** __ / 30

**Document 1 overall:** __ / 60

---

## Document 2: Long Report ("Coastal Infrastructure Resilience Assessment")

**Document characteristics:**
- ~1,200 words, structured with executive summary, methodology, findings, recommendations
- Rich in specific data: dollar amounts, facility names, risk percentages, model names
- Explicit limitations section
- May trigger chunking (needs verification)

**Expected AI behaviour:**
- Questions should probe methodology (CHAM model limitations, lack of geotechnical
  analysis), economic assumptions (RIOM model), prioritisation logic, and funding
  mechanisms.
- Suggestions should address structure (well-organised), clarity (technical but
  clear), and might flag the appendices as missing detail.
- Should reference specific dollar amounts, facility names, and model versions.
- If chunking is triggered, the combined summary should preserve names, numbers,
  and the risk classification breakdown.

### Questions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C4 — Question usefulness | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Questions total:** __ / 30

### Suggestions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C5 — Suggestion specificity | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Suggestions total:** __ / 30

**Document 2 overall:** __ / 60

---

## Document 3: Weak/Unclear Presentation ("Our Project")

**Document characteristics:**
- 6 slides, highly vague
- No specific data, methodology, or findings
- Acknowledges problems but provides no resolution
- Filled with hedges ("pretty good", "kind of complicated", "maybe")

**Expected AI behaviour:**
- Should set `document_sufficient: false` with a note explaining why.
- If questions are generated, they should point out the lack of specifics.
- Should NOT invent a methodology, data, or findings to fill the gaps.
- Suggestions should focus on what IS present: add specific data, define
  methodology, include charts, reduce hedging language.
- This is the honesty test — the AI must not pretend to analyse content
  that does not exist.

### Questions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C4 — Question usefulness | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Questions total:** __ / 30

### Suggestions — Scoring

| Criterion | Score (1-5) | Notes |
|-----------|-------------|-------|
| C1 — Content grounding | | |
| C2 — No invented facts | | |
| C3 — Document references | | |
| C5 — Suggestion specificity | | |
| C6 — Honesty about gaps | | |
| C7 — JSON validity | | |

**Suggestions total:** __ / 30

**Document 3 overall:** __ / 60

---

## Summary

| Document | Questions (/30) | Suggestions (/30) | Total (/60) | Grade |
|----------|-----------------|--------------------|-------------|-------|
| 1 — Short Presentation | | | | |
| 2 — Long Report | | | | |  
| 3 — Weak Presentation | | | | | |
| **Overall** | | | **__ / 180** | |

**Grading scale:**
- 150-180: Excellent — AI output is reliable and trustworthy
- 120-149: Good — usable with minor manual review
- 90-119: Adequate — needs human oversight before sharing
- 60-89: Poor — significant issues; review prompt design
- Below 60: Unacceptable — do not deploy without major changes

---

## Qualitative Notes

*Record any observations that the scores do not capture:*

### Document 1

- 

### Document 2

- 

### Document 3

- 

---

## Test Environment

| Field | Value |
|-------|-------|
| Date tested | |
| AI Provider | |
| AI Model | |
| App version | |
| Tester | |