Ollama Qwen3:8B Test

Student Summary Generation: Test and Results — March 29, 2026

Context: Model Scale Frontier models in early 2026 typically have total parameter counts in the hundreds of billions to low trillions, but due to sparse (MoE) architectures, the active parameter count per token is often much lower, closer to tens or hundreds of billions. This test was run using an 8 billion parameter model running locally on a MacBook Pro laptop via Ollama, and represents a first attempt at generating student performance summaries with this prompt and model combination. Neither the prompting strategy nor the model configuration have been optimized; better results can be achieved with further iteration. The three-paragraph output format was retained from earlier tests run using the Claude API for the purposes of comparison, and alternative output formats will be explored.

Model

qwen3:8b

Runtime

Ollama (local)

Hardware

MacBook Pro 14" M5 Pro, 24GB

Contents

Test Objective
Prompt Design
Student Data
Model Responses (3 runs)
ChatGPT Analysis
Claude Analysis

1. Test Objective

This test evaluates whether the Qwen3:8B model, running locally via Ollama on consumer hardware, can generate accurate, consistent, teacher-facing student performance summaries from Mission HydroSci progress data.

The same prompt was run three times to test both output quality (accuracy, tone, rule compliance) and consistency (whether repeated runs produce stable results). The responses were then evaluated against the actual student data to identify factual errors, rule violations, and patterns of model behavior.

2. Prompt Design

The prompt is composed of four sections, each designed to constrain a specific aspect of the model's output:

System Instructions

Defines the model's role as an educational assessment specialist. Specifies writing requirements (2-4 paragraphs, prose, teacher-facing tone), accuracy requirements (data-only claims, no over-interpretation of passed points), and interpretation requirements (use curriculum context, distinguish completed/struggling/not-started).

Curriculum Context

Provides the domain knowledge the model needs to interpret performance data: core skill areas (scientific argumentation, hydrology), how to interpret passed/flagged/not-started statuses, common difficulty patterns by topic, and instructional interpretation principles.

Student Data

The actual grade data for one student: current unit, status of all 26 progress points across 5 units, with durations, metrics, and reason codes for flagged points. This is the only data the model should reference.

Final Output Rules

Hard constraints: no bullet points, no IDs, no reason codes, no metric keys, no bracketed data. Translation rules for converting system data to plain English. Accuracy guardrails against causal hallucination, overgeneralization, and future prediction. Self-check requirement before output.

Full Prompt

ROLE AND INSTRUCTIONS:

You are an educational assessment specialist analyzing student performance
data from Mission HydroSci, a science adventure game that teaches hydrology
and scientific argumentation.

Write a clear, professional performance summary for a teacher or
instructional leader based only on the student data provided and the
curriculum context below.

Your summary must:
- describe the student's overall progress through the game
- highlight areas of strength
- identify areas of concern
- connect flagged results to likely learning gaps using the curriculum context
- suggest instructional focus areas based on patterns in the data

Writing requirements:
- write 2 to 4 paragraphs in Markdown
- use standard prose, not bullet points
- refer to the learner only as "this student" or "the student"
- if referencing a progress point, use its descriptive name in natural
  language rather than any unit or point ID
- avoid generic praise and generic concern statements
- be specific, but concise
- use a professional teacher-facing tone

Accuracy requirements:
- use only information directly supported by the provided data
- do not describe a passed progress point as a failure or misunderstanding
  unless the data clearly supports that interpretation
- when a point is passed with minor mistakes, treat it as partial success
  or developing understanding if relevant
- prioritize patterns across multiple flagged or difficult points over
  isolated issues
- if you identify a learning gap, explain it in plain English
- do not predict future difficulty in later units unless the current data
  provides a clear and direct basis for that prediction
- do not include raw counts or numeric metrics unless they materially
  strengthen the teacher-facing interpretation

Interpretation requirements:
- use curriculum context to interpret what flagged or difficult performance
  likely means about student understanding
- treat flagged results as evidence of difficulty, but do not overstate
  certainty
- distinguish between completed progress, current struggles, and
  not-yet-started content
- do not treat not-started future units as weaknesses

CURRICULUM CONTEXT:

# Mission HydroSci: Curriculum Context

Mission HydroSci is a science adventure game where students learn hydrology
and scientific argumentation through five units and 26 progress points.

## Core Skill Areas

### Scientific Argumentation
Students develop the ability to:
- identify claims
- distinguish evidence and reasoning
- construct complete arguments
- evaluate and counter claims

### Hydrology Concepts
Students learn:
- topographic maps and elevation (Unit 2)
- watersheds and flow relationships (Unit 2)
- water flow direction (Unit 3)
- dissolved material transport (Unit 3)
- groundwater and water table (Unit 4)
- soil infiltration and water movement (Unit 4)
- evaporation, condensation, and water cycle (Unit 5)

## How to Interpret Performance

Passed: student reached required understanding
  - low mistakes = strong understanding
  - higher mistakes = developing understanding

Flagged: difficulty, evidence of misunderstanding
  - repeated incorrect attempts = unstable understanding
  - failure to reach success condition = missing core concept

Not Started: student has not yet reached that content
  - should not be treated as a weakness

## Meaning of Common Difficulty Patterns

Topographic Maps: trouble with contour lines, spatial reasoning
Watersheds: misunderstanding flow relationships, evidence selection
Water Flow Direction: cannot determine flow from elevation
Dissolved Materials: misunderstanding pollutant/nutrient transport
Soil and Groundwater: confusion about infiltration, water table
Water Cycle: misunderstanding evaporation/condensation

## Instructional Interpretation Principles
- patterns across multiple flagged points > isolated issues
- repeated difficulty in related skills = conceptual gap
- successful performance after struggle = partial understanding
- students may complete tasks without full concept mastery

STUDENT DATA:

Current Unit: unit3

## unit1: Orientation and Scientific Argumentation Basics
- u1p1 (Space Legs): passed [duration: 245s] [metrics: mistakeCount=0]
- u1p2 (Info & Intros): passed [duration: 180s] [metrics: mistakeCount=0]
- u1p3 (Defend Expedition): passed [duration: 120s] [metrics: mistakeCount=0]
- u1p4 (What Was That?): passed [duration: 95s]

## unit2: Topographic Maps and Watersheds
- u2p1 (Escape the Ruin): passed [duration: 310s] [metrics: mistakeCount=0]
- u2p2 (Foraged Forging): flagged [reason: BAD_FEEDBACK] [metrics: mistakeCount=6]
- u2p3 (Band Together II): passed [duration: 350s] [metrics: mistakeCount=2]
- u2p4 (Investigate Temple): passed [duration: 280s] [metrics: mistakeCount=1]
- u2p5 (Classified Info): passed [metrics: posCount=5, mistakeCount=1, score=4.67]
- u2p6 (Which Watershed? I): flagged [reason: MISSING_SUCCESS_NODE] [metrics: mistakeCount=3]
- u2p7 (Which Watershed? II): passed [duration: 180s] [metrics: mistakeCount=2]

## unit3: Water Flow Direction and Dissolved Materials
- u3p1 (Supply Run): flagged [reason: TOO_MANY_NEGATIVES] [metrics: mistakeCount=3]
- u3p2 (Pollution Solution I): Not started
- u3p3 (Pollution Solution II): Not started
- u3p4 (Forsaken Facility): Not started
- u3p5 (Balanced Ecosystem): Not started

## unit4-5: Not started (all points)

FINAL OUTPUT RULES:

- write only teacher-facing prose in 2 to 4 paragraphs
- do not use bullet points or lists
- refer to the learner only as "this student" or "the student"
- use descriptive names only, never unit or progress point IDs
- strictly forbid reason codes, metric keys, bracketed data
- convert flagged results into plain English
- do not describe passed points as failures
- do not use passed points as causes of learning gaps
- do not treat not-started content as weakness
- avoid strong predictions about future performance
- do not overgeneralize from a single flagged result
- self-check: scan for forbidden codes before returning
- return only the final summary text

3. Student Data

The test uses grade data for a single student currently in Unit 3. This student has completed all of Unit 1 and Unit 2, with three flagged progress points:

Unit 1: Orientation and Scientific Argumentation Basics

All 4 progress points passed with zero mistakes. Strong foundation in game mechanics and scientific argumentation basics.

Unit 2: Topographic Maps and Watersheds

5 of 7 progress points passed. Two flagged:

Foraged Forging — flagged (BAD_FEEDBACK), 6 mistakes. Significant difficulty with topographic map interpretation.
Which Watershed? I — flagged (MISSING_SUCCESS_NODE), 3 mistakes. Did not reach the success condition for watershed comparison.

The student subsequently passed Which Watershed? II with 2 mistakes, suggesting partial recovery.

Unit 3: Water Flow Direction and Dissolved Materials

1 of 5 progress points attempted:

Supply Run — flagged (TOO_MANY_NEGATIVES), 3 mistakes. Difficulty with water flow direction.

Remaining 4 progress points not yet started.

Units 4-5

Not started (all 10 progress points).

4. Model Responses

The same prompt was run three times. All three responses were nearly identical, differing only in minor formatting (bold markup on "Supply Run" in runs 2 and 3). The content, sentence structure, and claims are the same across all runs.

Response 1a (Run 1)

This student has demonstrated strong foundational skills in scientific argumentation and basic hydrology, particularly in Units 1 and 2. Their mastery of claim identification (Defend the Expedition) and narrative engagement (Info and Intros) indicates a solid grasp of argumentation basics and game mechanics. In Unit 2, they successfully navigated topographic maps (Escape the Ruin) and constructed coherent scientific arguments (Classified Information), showing proficiency in connecting terrain features to watershed dynamics. However, their progress in Unit 3 is stalled, with significant gaps in understanding water flow direction and dissolved material transport.

The student's flagged results in Supply Run—marked by repeated errors and zero successful sends—highlight a critical misunderstanding of how elevation data translates to water flow direction. This suggests they struggle to apply topographic map interpretation to predict water movement, a skill essential for later units. Additionally, their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing. While they passed U2P4 (Investigate Temple) with minimal mistakes, their failure to connect watershed size to flow rate in U2P4 (which was passed with one mistake) may have created a conceptual gap that now affects their ability to predict water behavior in Unit 3.

Instructional focus should prioritize reinforcing the relationship between contour lines and water flow direction through visual modeling and guided practice in Unit 3. Additionally, revisiting U2P2 (Foraged Forging) and U2P6 (Which Watershed? I) could help solidify their ability to translate topographic data into spatial reasoning. Addressing these gaps will better prepare the student to tackle dissolved material transport and argumentation tasks in later units. Without mastery of water flow principles, their progress in Units 4 and 5—particularly the solar still challenge in Unit 5—will remain hindered.

Response 1b (Run 2)

The student's flagged results in Supply Run—marked by repeated errors and zero successful sends—highlight a critical misunderstanding of how elevation data translates to water flow direction. This suggests they struggle to apply topographic map interpretation to predict water movement, a skill essential for later units. Additionally, their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing. While they passed U2P4 (Investigate Temple) with minimal mistakes, their failure to connect watershed size to flow rate in U2P4 (which was passed with one mistake) may have created a conceptual gap that now affects their ability to predict water behavior in Unit 3.

Response 1c (Run 3)

5. ChatGPT Analysis

The three responses were submitted to ChatGPT for evaluation against the actual student data and prompt rules. The analysis identified high consistency but several accuracy and rule-compliance issues.

Consistency

9.5 / 10

Nearly identical across 3 runs

Accuracy

7 / 10

Several rule violations detected

Consistency: Excellent All three responses are essentially the same output with only minor formatting differences (bolding of "Supply Run" in runs 2 and 3). This indicates the prompt is highly deterministic and the model locks into a stable interpretation. Production-ready consistency.

What the Model Gets Right

Unit 1 Interpretation: Accurate All passed, zero mistakes, correctly described as a strong foundation.

Unit 2 Strengths: Mostly Accurate Escape the Ruin and Classified Information correctly identified as strong. General argumentation skills correctly noted.

Key Struggle Identification: Accurate Supply Run flagged and correctly interpreted as difficulty understanding water flow direction.

Instructional Recommendation: Strong Focus on contour lines, flow direction, and spatial reasoning is well-aligned with curriculum context.

What the Model Gets Wrong

Issue 1: Hallucinated causal chain. "their failure to connect watershed size to flow rate in Investigate the Temple..." — This point was passed with only 1 mistake. There is no evidence of conceptual failure. The model fabricated a narrative bridge between an earlier point and a later difficulty.
Violates: "do not reinterpret a passed progress point as misunderstanding"
Issue 2: Using a passed point as a cause. "may have created a conceptual gap..." — The model uses a passed progress point to explain a later difficulty, which is explicitly forbidden.
Violates: "do not use a passed progress point as the cause of a learning gap"
Issue 3: Overgeneralization from a single flagged point. "significant gaps in understanding water flow direction and dissolved material transport" — Only one point is flagged in Unit 3. Dissolved materials have not been attempted. No data supports this claim.
Violates: "do not overgeneralize from a single flagged result"
Issue 4: Future prediction (too strong). "will remain hindered" — This is deterministic and unsupported by data.
Violates: "avoid strong predictions about future performance"
Issue 5: ID leakage. "U2P4 (Investigate Temple)" — Progress point IDs appear in the output despite explicit prohibition.
Violates: "never include unit or progress point IDs"

Root Cause

The model optimizes for telling a coherent story rather than strict evidence-based reporting. LLMs tend to create narrative bridges ("earlier weakness led to later failure") even when the earlier weakness does not exist in the data. This is a well-known behavior pattern that requires stronger prompt constraints to suppress.

Bottom Line

What Works

Consistent outputs
Strong structure
Good instructional language

What Remains

Narrative overreach
Causal hallucination
Future prediction

Suggested Prompt Fixes

ChatGPT recommended three additions to the Final Output Rules to address the remaining issues:

# Highest-impact fix:
- do not introduce causal explanations between progress points
  unless both are flagged or show clear repeated difficulty

# Additional high-impact fixes:
- do not attribute current difficulties to earlier passed progress points
- do not describe unattempted content (e.g., not started points)
  as areas of difficulty or misunderstanding

# Optional (fixes overgeneralization):
- when only one progress point is flagged in a skill area,
  describe it as an isolated difficulty, not a broad conceptual gap

6. Claude Analysis

Independent analysis of the same three responses, evaluated against the student data and prompt rules. This analysis confirms ChatGPT's findings and identifies additional issues.

Agreement with ChatGPT Analysis

ChatGPT's five identified issues are all valid. The hallucinated causal chain (Issue 1), misuse of passed data (Issue 2), overgeneralization (Issue 3), deterministic future prediction (Issue 4), and ID leakage (Issue 5) are all clearly present in the output and clearly violate the prompt rules. The consistency score of 9.5/10 is also accurate: three runs produced effectively identical content.

Additional Issues Identified

Issue 6: False confidence on ambiguous metrics. The model claims "zero successful sends" based on the metric count=0 in the Supply Run data. However, the meaning of count is not defined in the prompt or curriculum context. The model invents a specific interpretation ("successful sends") for an ambiguous field and presents it as fact. A data-grounded summary should not interpret undefined metrics with this level of specificity.
Issue 7: "Stalled" overstates the situation. The model describes progress in Unit 3 as "stalled," implying the student has stopped advancing. The data shows only that the student is currently in Unit 3 and has attempted one progress point. This is a snapshot, not evidence of stalling. The word choice injects a judgment that the data does not support.
Issue 8: Not-started points treated as evidence. "Their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing." These are not-started points being used as evidence of a learning need. The prompt explicitly forbids treating not-started content as a weakness, yet the model frames them as reinforcing a deficit narrative.
Issue 9: Partial recovery ignored. The student was flagged on Which Watershed? I but then passed Which Watershed? II with only 2 mistakes. This is evidence of developing understanding and partial recovery, which the prompt asks the model to recognize. The model never acknowledges this recovery, instead using only the flagged point to build its deficit narrative. This is a significant omission that skews the summary toward a more negative interpretation than the data warrants.
Issue 10: Additional ID leakage. Beyond "U2P4" noted by ChatGPT, the model also outputs "U2P2 (Foraged Forging)" and "U2P6 (Which Watershed? I)" in the third paragraph. The prompt rule against IDs is violated in multiple locations, not just one.

Structural Observation

The model produces a three-paragraph structure: strengths, concerns, recommendations. This is a reasonable structure, but it creates a narrative arc that pressures the model into escalation: paragraph 1 establishes competence, paragraph 2 introduces problems, paragraph 3 prescribes fixes. This arc incentivizes the model to overstate the severity of concerns in paragraph 2 to justify the recommendations in paragraph 3. A flatter structure (e.g., progress summary, then specific observations without a prescribed fix) might reduce the tendency toward narrative overreach.

Revised Scores

Consistency

9.5 / 10

Agrees with ChatGPT

Accuracy

5.5 / 10

Lower than ChatGPT's 7/10

The accuracy score is lower than ChatGPT's assessment because the additional issues (ambiguous metric interpretation, not-started-as-evidence, ignored recovery, multiple ID leaks) compound the original five problems. The output contains correct high-level observations but builds an unsupported deficit narrative that a teacher could act on inappropriately. For a system intended to inform instructional decisions, accuracy at the claim level matters more than getting the overall shape right.

Bottom Line

Production-Ready

Consistency across runs
Professional tone and structure
Correct identification of strengths
Appropriate use of curriculum context
Good instructional recommendations (when grounded)

Not Yet Production-Ready

Fabricated causal chains between progress points
Confident interpretation of ambiguous data
Not-started content used as deficit evidence
Partial recovery evidence ignored
Deterministic future predictions
Persistent ID leakage despite explicit prohibition
Narrative arc incentivizes overstating severity

Recommendations

ChatGPT's suggested prompt fixes are a good starting point. Beyond those, I would recommend:

# Fix for ambiguous metrics:
- do not interpret metric fields beyond what is explicitly defined
  in the curriculum context; if a metric's meaning is unclear,
  do not reference it

# Fix for ignored recovery:
- when a student is flagged on one progress point but passes a
  closely related subsequent point, acknowledge this as evidence
  of developing understanding

# Fix for narrative escalation:
- do not escalate severity to justify recommendations;
  recommendations should be proportional to the evidence

# Fix for ID leakage (stronger enforcement):
- after completing the draft, perform a second pass to remove
  any text matching the pattern u[0-9]p[0-9] or U[0-9]P[0-9]
  in any capitalization

The most impactful structural change would be to test whether a shorter output (2 paragraphs instead of 3) reduces narrative overreach by removing the pressure to build a three-act arc. The current prompt allows 2-4 paragraphs; testing at the lower bound may produce more disciplined output.