Intelligence Builders
AI Research Home

Ollama Qwen3:8B Test

Student Summary Generation: Test and Results — March 29, 2026
Context: Model Scale Frontier models in early 2026 typically have total parameter counts in the hundreds of billions to low trillions, but due to sparse (MoE) architectures, the active parameter count per token is often much lower, closer to tens or hundreds of billions. This test was run using an 8 billion parameter model running locally on a MacBook Pro laptop via Ollama, and represents a first attempt at generating student performance summaries with this prompt and model combination. Neither the prompting strategy nor the model configuration have been optimized; better results can be achieved with further iteration. The three-paragraph output format was retained from earlier tests run using the Claude API for the purposes of comparison, and alternative output formats will be explored.
Model
qwen3:8b
Runtime
Ollama (local)
Hardware
MacBook Pro 14" M5 Pro, 24GB

1. Test Objective

This test evaluates whether the Qwen3:8B model, running locally via Ollama on consumer hardware, can generate accurate, consistent, teacher-facing student performance summaries from Mission HydroSci progress data.

The same prompt was run three times to test both output quality (accuracy, tone, rule compliance) and consistency (whether repeated runs produce stable results). The responses were then evaluated against the actual student data to identify factual errors, rule violations, and patterns of model behavior.

2. Prompt Design

The prompt is composed of four sections, each designed to constrain a specific aspect of the model's output:

System Instructions

Defines the model's role as an educational assessment specialist. Specifies writing requirements (2-4 paragraphs, prose, teacher-facing tone), accuracy requirements (data-only claims, no over-interpretation of passed points), and interpretation requirements (use curriculum context, distinguish completed/struggling/not-started).

Curriculum Context

Provides the domain knowledge the model needs to interpret performance data: core skill areas (scientific argumentation, hydrology), how to interpret passed/flagged/not-started statuses, common difficulty patterns by topic, and instructional interpretation principles.

Student Data

The actual grade data for one student: current unit, status of all 26 progress points across 5 units, with durations, metrics, and reason codes for flagged points. This is the only data the model should reference.

Final Output Rules

Hard constraints: no bullet points, no IDs, no reason codes, no metric keys, no bracketed data. Translation rules for converting system data to plain English. Accuracy guardrails against causal hallucination, overgeneralization, and future prediction. Self-check requirement before output.

Full Prompt

You are an educational assessment specialist analyzing student performance data from Mission HydroSci, a science adventure game that teaches hydrology and scientific argumentation. Write a clear, professional performance summary for a teacher or instructional leader based only on the student data provided and the curriculum context below. Your summary must: - describe the student's overall progress through the game - highlight areas of strength - identify areas of concern - connect flagged results to likely learning gaps using the curriculum context - suggest instructional focus areas based on patterns in the data Writing requirements: - write 2 to 4 paragraphs in Markdown - use standard prose, not bullet points - refer to the learner only as "this student" or "the student" - if referencing a progress point, use its descriptive name in natural language rather than any unit or point ID - avoid generic praise and generic concern statements - be specific, but concise - use a professional teacher-facing tone Accuracy requirements: - use only information directly supported by the provided data - do not describe a passed progress point as a failure or misunderstanding unless the data clearly supports that interpretation - when a point is passed with minor mistakes, treat it as partial success or developing understanding if relevant - prioritize patterns across multiple flagged or difficult points over isolated issues - if you identify a learning gap, explain it in plain English - do not predict future difficulty in later units unless the current data provides a clear and direct basis for that prediction - do not include raw counts or numeric metrics unless they materially strengthen the teacher-facing interpretation Interpretation requirements: - use curriculum context to interpret what flagged or difficult performance likely means about student understanding - treat flagged results as evidence of difficulty, but do not overstate certainty - distinguish between completed progress, current struggles, and not-yet-started content - do not treat not-started future units as weaknesses # Mission HydroSci: Curriculum Context Mission HydroSci is a science adventure game where students learn hydrology and scientific argumentation through five units and 26 progress points. ## Core Skill Areas ### Scientific Argumentation Students develop the ability to: - identify claims - distinguish evidence and reasoning - construct complete arguments - evaluate and counter claims ### Hydrology Concepts Students learn: - topographic maps and elevation (Unit 2) - watersheds and flow relationships (Unit 2) - water flow direction (Unit 3) - dissolved material transport (Unit 3) - groundwater and water table (Unit 4) - soil infiltration and water movement (Unit 4) - evaporation, condensation, and water cycle (Unit 5) ## How to Interpret Performance Passed: student reached required understanding - low mistakes = strong understanding - higher mistakes = developing understanding Flagged: difficulty, evidence of misunderstanding - repeated incorrect attempts = unstable understanding - failure to reach success condition = missing core concept Not Started: student has not yet reached that content - should not be treated as a weakness ## Meaning of Common Difficulty Patterns Topographic Maps: trouble with contour lines, spatial reasoning Watersheds: misunderstanding flow relationships, evidence selection Water Flow Direction: cannot determine flow from elevation Dissolved Materials: misunderstanding pollutant/nutrient transport Soil and Groundwater: confusion about infiltration, water table Water Cycle: misunderstanding evaporation/condensation ## Instructional Interpretation Principles - patterns across multiple flagged points > isolated issues - repeated difficulty in related skills = conceptual gap - successful performance after struggle = partial understanding - students may complete tasks without full concept mastery Current Unit: unit3 ## unit1: Orientation and Scientific Argumentation Basics - u1p1 (Space Legs): passed [duration: 245s] [metrics: mistakeCount=0] - u1p2 (Info & Intros): passed [duration: 180s] [metrics: mistakeCount=0] - u1p3 (Defend Expedition): passed [duration: 120s] [metrics: mistakeCount=0] - u1p4 (What Was That?): passed [duration: 95s] ## unit2: Topographic Maps and Watersheds - u2p1 (Escape the Ruin): passed [duration: 310s] [metrics: mistakeCount=0] - u2p2 (Foraged Forging): flagged [reason: BAD_FEEDBACK] [metrics: mistakeCount=6] - u2p3 (Band Together II): passed [duration: 350s] [metrics: mistakeCount=2] - u2p4 (Investigate Temple): passed [duration: 280s] [metrics: mistakeCount=1] - u2p5 (Classified Info): passed [metrics: posCount=5, mistakeCount=1, score=4.67] - u2p6 (Which Watershed? I): flagged [reason: MISSING_SUCCESS_NODE] [metrics: mistakeCount=3] - u2p7 (Which Watershed? II): passed [duration: 180s] [metrics: mistakeCount=2] ## unit3: Water Flow Direction and Dissolved Materials - u3p1 (Supply Run): flagged [reason: TOO_MANY_NEGATIVES] [metrics: mistakeCount=3] - u3p2 (Pollution Solution I): Not started - u3p3 (Pollution Solution II): Not started - u3p4 (Forsaken Facility): Not started - u3p5 (Balanced Ecosystem): Not started ## unit4-5: Not started (all points) - write only teacher-facing prose in 2 to 4 paragraphs - do not use bullet points or lists - refer to the learner only as "this student" or "the student" - use descriptive names only, never unit or progress point IDs - strictly forbid reason codes, metric keys, bracketed data - convert flagged results into plain English - do not describe passed points as failures - do not use passed points as causes of learning gaps - do not treat not-started content as weakness - avoid strong predictions about future performance - do not overgeneralize from a single flagged result - self-check: scan for forbidden codes before returning - return only the final summary text

3. Student Data

The test uses grade data for a single student currently in Unit 3. This student has completed all of Unit 1 and Unit 2, with three flagged progress points:

Unit 1: Orientation and Scientific Argumentation Basics

All 4 progress points passed with zero mistakes. Strong foundation in game mechanics and scientific argumentation basics.

Unit 2: Topographic Maps and Watersheds

5 of 7 progress points passed. Two flagged:

The student subsequently passed Which Watershed? II with 2 mistakes, suggesting partial recovery.

Unit 3: Water Flow Direction and Dissolved Materials

1 of 5 progress points attempted:

Remaining 4 progress points not yet started.

Units 4-5

Not started (all 10 progress points).

4. Model Responses

The same prompt was run three times. All three responses were nearly identical, differing only in minor formatting (bold markup on "Supply Run" in runs 2 and 3). The content, sentence structure, and claims are the same across all runs.

Response 1a (Run 1)

This student has demonstrated strong foundational skills in scientific argumentation and basic hydrology, particularly in Units 1 and 2. Their mastery of claim identification (Defend the Expedition) and narrative engagement (Info and Intros) indicates a solid grasp of argumentation basics and game mechanics. In Unit 2, they successfully navigated topographic maps (Escape the Ruin) and constructed coherent scientific arguments (Classified Information), showing proficiency in connecting terrain features to watershed dynamics. However, their progress in Unit 3 is stalled, with significant gaps in understanding water flow direction and dissolved material transport.

The student's flagged results in Supply Run—marked by repeated errors and zero successful sends—highlight a critical misunderstanding of how elevation data translates to water flow direction. This suggests they struggle to apply topographic map interpretation to predict water movement, a skill essential for later units. Additionally, their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing. While they passed U2P4 (Investigate Temple) with minimal mistakes, their failure to connect watershed size to flow rate in U2P4 (which was passed with one mistake) may have created a conceptual gap that now affects their ability to predict water behavior in Unit 3.

Instructional focus should prioritize reinforcing the relationship between contour lines and water flow direction through visual modeling and guided practice in Unit 3. Additionally, revisiting U2P2 (Foraged Forging) and U2P6 (Which Watershed? I) could help solidify their ability to translate topographic data into spatial reasoning. Addressing these gaps will better prepare the student to tackle dissolved material transport and argumentation tasks in later units. Without mastery of water flow principles, their progress in Units 4 and 5—particularly the solar still challenge in Unit 5—will remain hindered.

Response 1b (Run 2)

This student has demonstrated strong foundational skills in scientific argumentation and basic hydrology, particularly in Units 1 and 2. Their mastery of claim identification (Defend the Expedition) and narrative engagement (Info and Intros) indicates a solid grasp of argumentation basics and game mechanics. In Unit 2, they successfully navigated topographic maps (Escape the Ruin) and constructed coherent scientific arguments (Classified Information), showing proficiency in connecting terrain features to watershed dynamics. However, their progress in Unit 3 is stalled, with significant gaps in understanding water flow direction and dissolved material transport.

The student's flagged results in Supply Run—marked by repeated errors and zero successful sends—highlight a critical misunderstanding of how elevation data translates to water flow direction. This suggests they struggle to apply topographic map interpretation to predict water movement, a skill essential for later units. Additionally, their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing. While they passed U2P4 (Investigate Temple) with minimal mistakes, their failure to connect watershed size to flow rate in U2P4 (which was passed with one mistake) may have created a conceptual gap that now affects their ability to predict water behavior in Unit 3.

Instructional focus should prioritize reinforcing the relationship between contour lines and water flow direction through visual modeling and guided practice in Unit 3. Additionally, revisiting U2P2 (Foraged Forging) and U2P6 (Which Watershed? I) could help solidify their ability to translate topographic data into spatial reasoning. Addressing these gaps will better prepare the student to tackle dissolved material transport and argumentation tasks in later units. Without mastery of water flow principles, their progress in Units 4 and 5—particularly the solar still challenge in Unit 5—will remain hindered.

Response 1c (Run 3)

This student has demonstrated strong foundational skills in scientific argumentation and basic hydrology, particularly in Units 1 and 2. Their mastery of claim identification (Defend the Expedition) and narrative engagement (Info and Intros) indicates a solid grasp of argumentation basics and game mechanics. In Unit 2, they successfully navigated topographic maps (Escape the Ruin) and constructed coherent scientific arguments (Classified Information), showing proficiency in connecting terrain features to watershed dynamics. However, their progress in Unit 3 is stalled, with significant gaps in understanding water flow direction and dissolved material transport.

The student's flagged results in Supply Run—marked by repeated errors and zero successful sends—highlight a critical misunderstanding of how elevation data translates to water flow direction. This suggests they struggle to apply topographic map interpretation to predict water movement, a skill essential for later units. Additionally, their incomplete progress in Unit 3 (Pollution Solution I–Balanced Ecosystem) indicates a need to reinforce foundational hydrological concepts before advancing. While they passed U2P4 (Investigate Temple) with minimal mistakes, their failure to connect watershed size to flow rate in U2P4 (which was passed with one mistake) may have created a conceptual gap that now affects their ability to predict water behavior in Unit 3.

Instructional focus should prioritize reinforcing the relationship between contour lines and water flow direction through visual modeling and guided practice in Unit 3. Additionally, revisiting U2P2 (Foraged Forging) and U2P6 (Which Watershed? I) could help solidify their ability to translate topographic data into spatial reasoning. Addressing these gaps will better prepare the student to tackle dissolved material transport and argumentation tasks in later units. Without mastery of water flow principles, their progress in Units 4 and 5—particularly the solar still challenge in Unit 5—will remain hindered.


5. ChatGPT Analysis

The three responses were submitted to ChatGPT for evaluation against the actual student data and prompt rules. The analysis identified high consistency but several accuracy and rule-compliance issues.

Consistency

Consistency
9.5 / 10
Nearly identical across 3 runs
Accuracy
7 / 10
Several rule violations detected
Consistency: Excellent All three responses are essentially the same output with only minor formatting differences (bolding of "Supply Run" in runs 2 and 3). This indicates the prompt is highly deterministic and the model locks into a stable interpretation. Production-ready consistency.

What the Model Gets Right

Unit 1 Interpretation: Accurate All passed, zero mistakes, correctly described as a strong foundation.
Unit 2 Strengths: Mostly Accurate Escape the Ruin and Classified Information correctly identified as strong. General argumentation skills correctly noted.
Key Struggle Identification: Accurate Supply Run flagged and correctly interpreted as difficulty understanding water flow direction.
Instructional Recommendation: Strong Focus on contour lines, flow direction, and spatial reasoning is well-aligned with curriculum context.

What the Model Gets Wrong

Root Cause

The model optimizes for telling a coherent story rather than strict evidence-based reporting. LLMs tend to create narrative bridges ("earlier weakness led to later failure") even when the earlier weakness does not exist in the data. This is a well-known behavior pattern that requires stronger prompt constraints to suppress.

Bottom Line

What Works

  • Consistent outputs
  • Strong structure
  • Good instructional language

What Remains

  • Narrative overreach
  • Causal hallucination
  • Future prediction

Suggested Prompt Fixes

ChatGPT recommended three additions to the Final Output Rules to address the remaining issues:

# Highest-impact fix: - do not introduce causal explanations between progress points unless both are flagged or show clear repeated difficulty # Additional high-impact fixes: - do not attribute current difficulties to earlier passed progress points - do not describe unattempted content (e.g., not started points) as areas of difficulty or misunderstanding # Optional (fixes overgeneralization): - when only one progress point is flagged in a skill area, describe it as an isolated difficulty, not a broad conceptual gap

6. Claude Analysis

Independent analysis of the same three responses, evaluated against the student data and prompt rules. This analysis confirms ChatGPT's findings and identifies additional issues.

Agreement with ChatGPT Analysis

ChatGPT's five identified issues are all valid. The hallucinated causal chain (Issue 1), misuse of passed data (Issue 2), overgeneralization (Issue 3), deterministic future prediction (Issue 4), and ID leakage (Issue 5) are all clearly present in the output and clearly violate the prompt rules. The consistency score of 9.5/10 is also accurate: three runs produced effectively identical content.

Additional Issues Identified

Structural Observation

The model produces a three-paragraph structure: strengths, concerns, recommendations. This is a reasonable structure, but it creates a narrative arc that pressures the model into escalation: paragraph 1 establishes competence, paragraph 2 introduces problems, paragraph 3 prescribes fixes. This arc incentivizes the model to overstate the severity of concerns in paragraph 2 to justify the recommendations in paragraph 3. A flatter structure (e.g., progress summary, then specific observations without a prescribed fix) might reduce the tendency toward narrative overreach.

Revised Scores

Consistency
9.5 / 10
Agrees with ChatGPT
Accuracy
5.5 / 10
Lower than ChatGPT's 7/10

The accuracy score is lower than ChatGPT's assessment because the additional issues (ambiguous metric interpretation, not-started-as-evidence, ignored recovery, multiple ID leaks) compound the original five problems. The output contains correct high-level observations but builds an unsupported deficit narrative that a teacher could act on inappropriately. For a system intended to inform instructional decisions, accuracy at the claim level matters more than getting the overall shape right.

Bottom Line

Production-Ready

  • Consistency across runs
  • Professional tone and structure
  • Correct identification of strengths
  • Appropriate use of curriculum context
  • Good instructional recommendations (when grounded)

Not Yet Production-Ready

  • Fabricated causal chains between progress points
  • Confident interpretation of ambiguous data
  • Not-started content used as deficit evidence
  • Partial recovery evidence ignored
  • Deterministic future predictions
  • Persistent ID leakage despite explicit prohibition
  • Narrative arc incentivizes overstating severity

Recommendations

ChatGPT's suggested prompt fixes are a good starting point. Beyond those, I would recommend:

# Fix for ambiguous metrics: - do not interpret metric fields beyond what is explicitly defined in the curriculum context; if a metric's meaning is unclear, do not reference it # Fix for ignored recovery: - when a student is flagged on one progress point but passes a closely related subsequent point, acknowledge this as evidence of developing understanding # Fix for narrative escalation: - do not escalate severity to justify recommendations; recommendations should be proportional to the evidence # Fix for ID leakage (stronger enforcement): - after completing the draft, perform a second pass to remove any text matching the pattern u[0-9]p[0-9] or U[0-9]P[0-9] in any capitalization

The most impactful structural change would be to test whether a shorter output (2 paragraphs instead of 3) reduces narrative overreach by removing the pressure to build a three-act arc. The current prompt allows 2-4 paragraphs; testing at the lower bound may produce more disciplined output.