Running Open-Weight LLMs for Student Performance Summaries
Infrastructure evaluation — cloud GPU providers, on-site hardware, and the recommended hybrid architecture
Executive Summary
This document explains how the team can run large language models (LLMs) for student-performance summaries on either:
Cloud-hosted GPUs using providers such as Runpod or Lambda, or
On-site purchased hardware such as NVIDIA DGX Spark.
The best long-term architecture for our project is:
Primary inference on hardware we control on site, using open-weight models we choose and tune,
Cloud GPU as backup and burst capacity, with Runpod and Lambda as the primary options to evaluate.
That approach gives us the best balance of:
Predictable cost
Data control
Repeatability of outputs
Owned hardware provides full use 24x7 at no additional cost
Resilience when local hardware is unavailable or demand spikes
Why We Are Evaluating This
We are currently using LLM APIs to generate summaries of student performance in Mission HydroSci. That works, but API cost can rise quickly because summaries are generated repeatedly across students, activities, and reporting cycles.
The question is not just whether an API can do the work. It is whether we can run our own model stack in a way that is:
affordable,
privacy-conscious,
IRB-friendly,
stable enough to tune once and use repeatedly,
and operationally realistic for a university research project.
Key lesson: Different models can produce different summaries even with the same prompt, context, and data. Our real goal should be to standardize on a primary model environment, tune that environment carefully, and avoid constant model-switching so the results remain predictable.
This is one of the strongest arguments for owning the primary inference path rather than relying entirely on shifting external APIs.
Recommendation
Recommended Architecture
PrimaryOn-site hardware we control (e.g., NVIDIA DGX Spark)
Backup / BurstCloud GPUs — Runpod and Lambda as the two most relevant providers
Why this is the best fit
Data control is strongest on-site. Student data stays inside infrastructure we manage directly.
Output consistency is better. We can pick one model, one quantization strategy, one prompt template, and tune for that environment.
Costs become more predictable. Hardware is a capital purchase rather than ongoing per-token spend.
Cloud still covers risk. If local hardware is down, overloaded, or temporarily insufficient, we can fail over to Runpod or Lambda.
The project keeps the asset. Unlike service spend, purchased hardware remains available after the initial budget period.
Own the primary inference path. Rent backup compute when needed.
The Main Ways These Systems Can Be Used
There are several distinct operating modes. These are often conflated, but they are materially different.
1. API Model
The current pattern with hosted LLM APIs. We send prompt + context + student data to a vendor-managed model endpoint and get back a summary.
Pros
Easiest to start
Strongest frontier models
No infrastructure to manage
Cons
Recurring usage cost
Changing model behavior over time
Lower control over inference stack
More concern about student data leaving project-controlled infrastructure
2. Cloud GPU Instance Running Our Own Model
We rent a GPU machine from a cloud provider, install or launch an inference server such as vLLM, and serve an open-weight model ourselves.
Pros
Much more control than API use
Can use open-weight models from Hugging Face
Can expose an OpenAI-compatible API
Often dramatically cheaper per summary
Cons
Still rely on third-party infrastructure
Must manage model serving, storage, access control, monitoring
Some providers have capacity variation
3. On-Site Hardware Running Our Own Model
Same architecture as the cloud-GPU case, except the hardware is in our own environment.
Pros
Strongest privacy/control position
Stable model environment
Retained institutional asset
No per-request inference bill
Best fit for long-lived model stack
Cons
Upfront hardware purchase
Maintenance responsibility
Limited local redundancy
4. Hybrid Architecture Recommended
PrimaryOn-site inference
SecondaryCloud failover or cloud burst capacity
Optional TertiaryPremium API only for exceptional cases
This gives us privacy and control by default, resilience when hardware is unavailable, and flexibility if demand grows unexpectedly.
Why Standardizing on One Primary Model Environment Matters
The same prompt, context, and data do not produce the same style or quality of summary across different models.
We have already observed this with different hosted models. The same will also be true when switching among API models, cloud-hosted open models, and local/open models on owned hardware.
This means the real product is not just “the model.” It is a summary engine, consisting of:
the model family,
quantization level,
inference runtime,
prompt template,
context construction,
output schema,
and validation/evaluation rules.
The more we switch model backends, the more tuning and evaluation noise we introduce. That is one of the strongest reasons to propose one primary local/on-site environment and cloud as backup, not as a constantly changing primary.
Cloud GPU Provider Landscape
The full GPU-cloud landscape is broad. Major options in 2026 include hyperscalers such as AWS, Azure, Google Cloud, Oracle Cloud, and IBM Cloud, plus AI-native providers such as Runpod, Lambda, CoreWeave, Crusoe, DigitalOcean/Paperspace, Replicate, Vast.ai, and others.1
For our purposes, the most relevant providers for serious evaluation are:
Runpod
Lambda
Those map well to our needs: open-weight model serving, cost-sensitive inference, relatively straightforward deployment, and realistic use for research workloads.
Runpod Deep Dive
Runpod is an AI-focused GPU platform offering both Pods and Serverless products.2
Main Operating Modes
A. Pods
The closest thing to renting a GPU machine. You choose a GPU, create a pod, and run your own software stack on it. Ideal when we want a stable inference target, SSH access, explicit control over the runtime, and the ability to install or launch vLLM ourselves.
B. Serverless
Runpod supports serverless vLLM deployments for open-source models.8 Ideal for lower idle cost, scale-to-zero behavior, and event-driven or batch workflows. The tradeoff is cold starts and more operational abstraction.
C. Flex vs Active Workers
Flex: Scale up only when traffic arrives; best for bursty workloads
Active: Always-on warm workers; avoids cold starts but costs more2
Typical GPUs Relevant to Us
GPU
VRAM
Notes
A100
80 GB
Safest general choice for LLM inference
L40S
48 GB
Potentially attractive lower-cost option
H100
80 GB
Usually more than we need unless throughput is very high
For our work, the right choice is usually network volume, because model weights persist independently of the pod and do not need to be downloaded every time.
Capacity Behavior
Runpod is flexible and cost-effective, but it is not guaranteed reserved enterprise capacity. When a stopped Pod is restarted, it may show “Zero GPU Pods” if the original GPU is no longer available.10, 11
Practical takeaway: As long as a Pod is running, the reserved GPU is stable. Once stopped, that exact GPU may no longer be available. This is manageable for our batchable workload, but is still a reason not to rely on Runpod as the only production path.
Security and Privacy
Pods and workers run with containerized isolation in a multi-tenant environment
Secure Cloud operates in enterprise-grade data centers
Host policies prohibit hosts from inspecting Pod/worker data
Can use vetted infrastructure partners meeting SOC 2, ISO 27001, and PCI DSS9
SOC 2 Type II certification announced October 20257
Runpod Summary
Strengths
Flexible, cost-efficient
Supports pods and serverless
Straightforward for open-source inference
Good fit for bursty workloads
Weaknesses
Capacity can vary
Secure Cloud and storage choices require deliberate configuration
Multi-tenant by default unless we choose the right deployment pattern
Lambda Deep Dive
Lambda is an AI-focused cloud company with roots in deep-learning workstations and servers. Today it provides cloud GPUs, on-demand instances, clusters, private cloud, and an enterprise trust/security program.12, 13
Main Operating Modes
A. On-Demand Cloud Instances
Individual Linux-based GPU-backed virtual machines.15 The closest counterpart to Runpod Pods. Ideal for stable inference servers, SSH-based administration, running vLLM or Ollama, and predictable hands-on control.
B. 1-Click Clusters
Production-ready clusters of 16 to 512 H100 GPUs.15, 13 Far more than we need for student-summary generation, but shows the platform can scale.
C. Private Cloud / Enterprise
Emphasizes single-tenant, shared-nothing architecture, SOC 2 Type II certification, and isolated/caged clusters.12 Relevant because our team is concerned about student data, IRB, and institutional review.
Choose Runpod for lower cost and flexibility. Choose Lambda if institutional comfort with security messaging matters more than raw cost. It is reasonable to test both.
What Is Involved in Actually Running an LLM on Cloud GPUs
This is the mechanics stakeholders often do not understand.
The Pieces
To run an open-weight LLM on cloud GPUs, we typically need:
A GPU machine (Pod, instance, or cluster)
A model source
An inference engine
Persistent storage for model caching
An HTTP API surface our application can call
Inference Engine
The most important serving engine for our purposes is vLLM, which is designed for high-throughput LLM serving and exposes an OpenAI-compatible API. vLLM supports a large range of open-source models and explicitly supports deployment on Runpod.17, 18
Deployment Pattern
Launch pod/instance
Mount or attach persistent storage
Authenticate to Hugging Face if the model requires a license grant or token
Start vllm serve with the chosen model
Expose an internal or protected HTTP endpoint
Send structured student-summary requests to that endpoint
Log outputs and evaluation metadata under our control
Key distinction: This architecture is materially different from sending student data directly to an external API vendor’s model endpoint. We control the model, inference server, prompt layer, and retention behavior much more directly.
Open-Weight Model Sources and What Is Available
The main source for open-weight models today is Hugging Face Hub, which hosts over 2 million models.19
Main repository for downloading and versioning model checkpoints19, 20
Vendor org pages
Meta Llama21, Qwen22, Mistral23, Google Gemma24, Microsoft Phi25
Vendor download pages
E.g., Meta provides Llama access through its official download process26
Ollama library
Convenient distribution path for local experimentation27, 28
Model Families Likely Most Relevant to Us
For student-performance summaries, strong candidates include:
Llama family (widely supported, common default)
Qwen family (strong open-model performance)
Mistral / Ministral family
Gemma family
Phi family
Note on terminology: Many teams say “open source models” when they really mean open-weight models. The important practical question is: can we download the weights, run them ourselves, and keep inference under our own control?
Model Size, GPU Requirements, and Quantization
Model Size vs GPU Memory
Model Size
VRAM (FP16)
VRAM (4-bit)
Typical Deployment
7B
~14 GB
~4–6 GB
Runs anywhere (local, DGX, cloud)
13B
~26 GB
~8–10 GB
Ideal for DGX and most GPUs
30B
~60 GB
~15–20 GB
DGX (quantized) or A100-class GPUs
70B
~140 GB
~35–45 GB
Multi-GPU or high-end cloud only
What This Means for DGX Spark
Comfortable range: 7B–13B models (full precision or quantized)
Not ideal for: 70B models unless heavily optimized or distributed
Quantization Levels
Quantization
Memory Usage
Quality Impact
FP16 (full precision)
100%
Highest fidelity
8-bit
~50%
Nearly identical in most cases
4-bit
~25%
Small quality reduction
2-bit
~12%
Noticeable degradation
Recommended: 4-bit quantization is typically the best balance of memory efficiency, speed, and output quality for structured summarization tasks.
Key insight: For structured summarization tasks, smaller quantized models (7B–30B) often perform sufficiently well, making them practical for on-site deployment without requiring large-scale GPU infrastructure. This is why our proposed architecture does not depend on frontier-scale models.
Can Purchased On-Site Hardware Actually Do the Job?
Yes—very likely.
The summarization task we care about is not the same as frontier general reasoning. We are asking the model to:
digest structured student-performance data,
identify strengths and struggles,
and produce clear teacher-facing language.
That is a good fit for smaller and mid-sized instruct models.
Student summaries are especially compatible with smaller instruct models, 4-bit quantization, careful prompt design, and structured input. The biggest quality lever is often not raw model size; it is input organization, prompt design, and consistent deployment.
That reinforces the case for one primary on-site model stack.
Approximate Cost Thinking
These are not procurement quotes; they are planning-level approximations.
Cloud GPU Cost Characteristics
Per-summary cost on cloud GPU can be very low when we batch requests, keep output lengths controlled, use an appropriate open model, and avoid premium API pricing. The main variables are model size, actual input length, output token count, and whether the endpoint is warm or cold.
Why Cloud Is Still Useful Even If We Buy Hardware
Failover
Overflow workloads
Trying models before standardizing
Benchmarking
Why On-Site Becomes Attractive Over Time
For recurring workloads, owned hardware changes the economics from “pay every time the model runs” into “pay once for the machine, then mostly pay power/ops.”
This is especially attractive when the institution can keep using the hardware after the initial project period.
Data Privacy, IRB, and Student Data
This is one of the most important sections for internal discussion.
The Strongest Privacy Position
Run the model on-site, on hardware we control, inside network and access controls we already manage. That gives the clearest answer to:
Where did the student data go?
Who had access to it?
Was it sent to a public API?
Was it retained in someone else’s system?
Why Cloud-Hosted Open Models Are Different from API Use
Using a cloud GPU to run our own open-weight model is not the same as using a third-party hosted LLM API. With cloud-hosted open models:
we choose the model,
we run the inference server,
we can minimize or disable application-side logging,
we can control what is persisted,
and there is no automatic model training on our prompts.
However, the infrastructure still belongs to a third party, the deployment must be configured correctly, and institutional review may still be required.
What SOC 2 Type II Certification Means — and What It Does Not
What it is
SOC 2 is an audit framework developed by the AICPA for evaluating controls related to security, availability, processing integrity, confidentiality, and privacy. A Type II report evaluates whether controls were operating effectively over a period of time.29, 30
What it helps with
Evidence that the provider has access control, change management, incident response, and data handling procedures reviewed by an independent auditor
Helpful in conversations with university IT, research administration, compliance reviewers, and project stakeholders
A positive trust signal
What it does NOT mean
Automatically approved for our student data
Automatically IRB-approved
Automatically FERPA-compliant for our exact use case
That data cannot be exposed if we misconfigure the service
That institutional review is unnecessary
Bottom line: SOC 2 Type II certification is best understood as a security/compliance maturity signal, not as an automatic green light for sensitive student-data processing.
Practical Privacy Recommendations
Primary summaries run on-site whenever possible.
Cloud is backup, not default.
If cloud is used, use the provider’s more secure configuration (Runpod Secure Cloud, Lambda single-tenant/private options).
Send only the minimum required data to the summarization service.
Prefer structured performance signals over raw logs when possible.
Log and retain outputs deliberately, not by default.
Review IRB / institutional requirements before production use of student records in cloud environments.
We can get the benefits of LLM summarization without making cloud or public APIs the default home of student data.
Why the Hybrid On-Site + Cloud-Backup Architecture Is Best
Problem
How the Hybrid Architecture Solves It
Consistency
Tune prompts and context for one primary model environment and keep it stable
Privacy
Sensitive student data stays on-site by default
Reliability
If on-site hardware is unavailable, fail over to Runpod or Lambda
Procurement / Value
The institution retains the hardware asset while preserving operational flexibility
Cost
Avoid making every summary depend on a paid external API call
Growth
If demand spikes, cloud covers overflow
Proposed Architecture for the Team
Primary Environment
Primary
On-site DGX Spark or similar purchased hardware
One standardized open-weight instruct model
One inference engine (preferably vLLM)
One carefully tuned prompt/context pipeline
Internal protected API endpoint for the summarization service
Secondary / Failover Environment
Backup
Runpod or Lambda
Same model family if possible
Same prompt contract
Same output schema
Same evaluation checks
Operational Pattern
Student performance data is preprocessed into structured features/signals
Summarization requests go to the on-site inference service first
If on-site inference is unavailable or backlogged, the request is routed to the backup cloud environment
Summaries are stored under our normal project controls
Concrete Recommendation to Present to the Team
Run open-weight LLMs primarily on-site on purchased hardware, and maintain a cloud-based backup/failover path using Runpod or Lambda.
Recommendation Details
Use on-site hardware as the default inference environment.
Standardize on one primary open-weight model family and optimize prompts for that environment.
Use Runpod as the most cost-flexible burst/backup option.
Evaluate Lambda as the more enterprise/security-oriented backup option.
Avoid making premium hosted APIs the primary system of record for student summaries.
Treat cloud as backup and overflow, not default.
Why This Is the Best Approach
Because it gives us:
a repeatable summary engine,
stronger privacy positioning,
long-term cost control,
retained institutional value,
and resilience when local hardware is insufficient or down.
This is the most defensible architecture technically, operationally, and institutionally.
Sources
DigitalOcean, “10 Leading AI Cloud Providers for Developers in 2026.” digitalocean.com