Running Open-Weight LLMs for Student Performance Summaries

Infrastructure evaluation — cloud GPU providers, on-site hardware, and the recommended hybrid architecture

Executive Summary

This document explains how the team can run large language models (LLMs) for student-performance summaries on either:

Cloud-hosted GPUs using providers such as Runpod or Lambda, or
On-site purchased hardware such as NVIDIA DGX Spark.

The best long-term architecture for our project is:

Primary inference on hardware we control on site, using open-weight models we choose and tune,
Cloud GPU as backup and burst capacity, with Runpod and Lambda as the primary options to evaluate.

That approach gives us the best balance of:

Predictable cost
Data control
Repeatability of outputs
Owned hardware provides full use 24x7 at no additional cost
Resilience when local hardware is unavailable or demand spikes

Why We Are Evaluating This

We are currently using LLM APIs to generate summaries of student performance in Mission HydroSci. That works, but API cost can rise quickly because summaries are generated repeatedly across students, activities, and reporting cycles.

The question is not just whether an API can do the work. It is whether we can run our own model stack in a way that is:

This is one of the strongest arguments for owning the primary inference path rather than relying entirely on shifting external APIs.

Recommendation

Why this is the best fit

The Main Ways These Systems Can Be Used

There are several distinct operating modes. These are often conflated, but they are materially different.

1. API Model

The current pattern with hosted LLM APIs. We send prompt + context + student data to a vendor-managed model endpoint and get back a summary.

Pros

Easiest to start
Strongest frontier models
No infrastructure to manage

Cons

Recurring usage cost
Changing model behavior over time
Lower control over inference stack
More concern about student data leaving project-controlled infrastructure

2. Cloud GPU Instance Running Our Own Model

We rent a GPU machine from a cloud provider, install or launch an inference server such as vLLM, and serve an open-weight model ourselves.

Pros

Much more control than API use
Can use open-weight models from Hugging Face
Can expose an OpenAI-compatible API
Often dramatically cheaper per summary

Cons

Still rely on third-party infrastructure
Must manage model serving, storage, access control, monitoring
Some providers have capacity variation

3. On-Site Hardware Running Our Own Model

Same architecture as the cloud-GPU case, except the hardware is in our own environment.

Pros

Strongest privacy/control position
Stable model environment
Retained institutional asset
No per-request inference bill
Best fit for long-lived model stack

Cons

Upfront hardware purchase
Maintenance responsibility
Limited local redundancy

4. Hybrid Architecture Recommended

Primary On-site inference

Secondary Cloud failover or cloud burst capacity

Optional Tertiary Premium API only for exceptional cases

This gives us privacy and control by default, resilience when hardware is unavailable, and flexibility if demand grows unexpectedly.

Why Standardizing on One Primary Model Environment Matters

We have already observed this with different hosted models. The same will also be true when switching among API models, cloud-hosted open models, and local/open models on owned hardware.

This means the real product is not just “the model.” It is a summary engine, consisting of:

The more we switch model backends, the more tuning and evaluation noise we introduce. That is one of the strongest reasons to propose one primary local/on-site environment and cloud as backup, not as a constantly changing primary.

Cloud GPU Provider Landscape

The full GPU-cloud landscape is broad. Major options in 2026 include hyperscalers such as AWS, Azure, Google Cloud, Oracle Cloud, and IBM Cloud, plus AI-native providers such as Runpod, Lambda, CoreWeave, Crusoe, DigitalOcean/Paperspace, Replicate, Vast.ai, and others.¹

Those map well to our needs: open-weight model serving, cost-sensitive inference, relatively straightforward deployment, and realistic use for research workloads.

Runpod Deep Dive

Runpod is an AI-focused GPU platform offering both Pods and Serverless products.²

Main Operating Modes

Typical GPUs Relevant to Us

Pricing

Storage Pricing

Capacity Behavior

GPU	VRAM	Notes
A100	80 GB	Safest general choice for LLM inference
L40S	48 GB	Potentially attractive lower-cost option
H100	80 GB	Usually more than we need unless throughput is very high
4090	24 GB	Budget option for smaller models

Configuration	Price	Source
Serverless A100 80GB (Flex)	$0.00076/sec	⁴
Serverless A100 80GB (Active)	$0.00060/sec	⁴
Serverless H100 80GB (Flex)	$0.00116/sec	⁴
Serverless H100 80GB (Active)	$0.00093/sec	⁴
On-demand A100 PCIe 80GB	$1.19/hr	³
On-demand H100 PCIe 80GB	$1.99/hr	³

Type	Price	Notes
Network volume (<1TB)	$0.07/GB/month	Model weights persist independently of pod
Network volume (>1TB)	$0.05/GB/month
Volume disk (running)	$0.10/GB/month
Volume disk (stopped)	$0.20/GB/month
Container disk	Temporary	Erased when a Pod stops

Runpod is flexible and cost-effective, but it is not guaranteed reserved enterprise capacity. When a stopped Pod is restarted, it may show “Zero GPU Pods” if the original GPU is no longer available.^{10, 11}

Security and Privacy

Runpod Summary

Lambda Deep Dive

Lambda is an AI-focused cloud company with roots in deep-learning workstations and servers. Today it provides cloud GPUs, on-demand instances, clusters, private cloud, and an enterprise trust/security program.^{12, 13}

Main Operating Modes

Pricing

Security and Privacy

Lambda Summary

Runpod vs Lambda

In One Sentence

Practical Comparison

Which Is Better for Our Project?

What Is Involved in Actually Running an LLM on Cloud GPUs

The Pieces

Inference Engine

Configuration	Price	Source
H100 (cluster pricing)	$2.76/hr	¹³
A100 PCIe 40GB	$1.99/hr	¹⁴
A6000 48GB	$1.09/hr	¹⁴
H100 SXM 80GB (8-GPU)	$3.99/hr	¹⁴

Dimension	Runpod	Lambda
Best mental model	Flexible GPU utility	Stable AI cloud
Primary modes	Pods, Serverless, Flex/Active	On-Demand instances, clusters, private cloud
Cost profile	Often lower, especially bursty	Often higher, more “traditional cloud”
Capacity behavior	Can vary; stopped pods can lose GPUs	More instance-oriented and stable
Security posture	Multi-tenant by default; stronger with Secure Cloud	Strong trust/security, single-tenant messaging
Best fit for us	Burst, backup, experiments	Backup, stable cloud serving, enterprise discussions

The most important serving engine for our purposes is vLLM, which is designed for high-throughput LLM serving and exposes an OpenAI-compatible API. vLLM supports a large range of open-source models and explicitly supports deployment on Runpod.^{17, 18}

Deployment Pattern

Open-Weight Model Sources and What Is Available

The main source for open-weight models today is Hugging Face Hub, which hosts over 2 million models.¹⁹

Common Sources and Registries

Model Families Likely Most Relevant to Us

Model Size, GPU Requirements, and Quantization

Model Size vs GPU Memory

What This Means for DGX Spark

Quantization Levels

Can Purchased On-Site Hardware Actually Do the Job?

The summarization task we care about is not the same as frontier general reasoning. We are asking the model to:

Student summaries are especially compatible with smaller instruct models, 4-bit quantization, careful prompt design, and structured input. The biggest quality lever is often not raw model size; it is input organization, prompt design, and consistent deployment.

Approximate Cost Thinking

Cloud GPU Cost Characteristics

Per-summary cost on cloud GPU can be very low when we batch requests, keep output lengths controlled, use an appropriate open model, and avoid premium API pricing. The main variables are model size, actual input length, output token count, and whether the endpoint is warm or cold.

Why Cloud Is Still Useful Even If We Buy Hardware

Why On-Site Becomes Attractive Over Time

Source	Description
Hugging Face Hub	Main repository for downloading and versioning model checkpoints^{19, 20}
Vendor org pages	Meta Llama²¹, Qwen²², Mistral²³, Google Gemma²⁴, Microsoft Phi²⁵
Vendor download pages	E.g., Meta provides Llama access through its official download process²⁶
Ollama library	Convenient distribution path for local experimentation^{27, 28}

Model Size	VRAM (FP16)	VRAM (4-bit)	Typical Deployment
7B	~14 GB	~4–6 GB	Runs anywhere (local, DGX, cloud)
13B	~26 GB	~8–10 GB	Ideal for DGX and most GPUs
30B	~60 GB	~15–20 GB	DGX (quantized) or A100-class GPUs
70B	~140 GB	~35–45 GB	Multi-GPU or high-end cloud only

Quantization	Memory Usage	Quality Impact
FP16 (full precision)	100%	Highest fidelity
8-bit	~50%	Nearly identical in most cases
4-bit	~25%	Small quality reduction
2-bit	~12%	Noticeable degradation

For recurring workloads, owned hardware changes the economics from “pay every time the model runs” into “pay once for the machine, then mostly pay power/ops.”

This is especially attractive when the institution can keep using the hardware after the initial project period.

Data Privacy, IRB, and Student Data

The Strongest Privacy Position

Run the model on-site, on hardware we control, inside network and access controls we already manage. That gives the clearest answer to:

Why Cloud-Hosted Open Models Are Different from API Use

Using a cloud GPU to run our own open-weight model is not the same as using a third-party hosted LLM API. With cloud-hosted open models:

However, the infrastructure still belongs to a third party, the deployment must be configured correctly, and institutional review may still be required.

What SOC 2 Type II Certification Means — and What It Does Not

What it is

SOC 2 is an audit framework developed by the AICPA for evaluating controls related to security, availability, processing integrity, confidentiality, and privacy. A Type II report evaluates whether controls were operating effectively over a period of time.^{29, 30}

Practical Privacy Recommendations

Why the Hybrid On-Site + Cloud-Backup Architecture Is Best

Proposed Architecture for the Team

Operational Pattern

Concrete Recommendation to Present to the Team

Recommendation Details

Why This Is the Best Approach

Problem	How the Hybrid Architecture Solves It
Consistency	Tune prompts and context for one primary model environment and keep it stable
Privacy	Sensitive student data stays on-site by default
Reliability	If on-site hardware is unavailable, fail over to Runpod or Lambda
Procurement / Value	The institution retains the hardware asset while preserving operational flexibility
Cost	Avoid making every summary depend on a paid external API call
Growth	If demand spikes, cloud covers overflow

Sources

DigitalOcean, “10 Leading AI Cloud Providers for Developers in 2026.” digitalocean.com
Runpod pricing page. runpod.io/pricing
Runpod GPU pricing page. runpod.io/gpu-pricing
Runpod serverless pricing docs. docs.runpod.io/serverless/pricing
Runpod storage/network volume pricing docs. docs.runpod.io/pods/pricing and docs.runpod.io/storage/network-volumes
Runpod serverless vLLM docs. docs.runpod.io/serverless/vllm
Runpod SOC 2 Type II announcement. runpod.io/blog
Runpod serverless vLLM docs. docs.runpod.io
Runpod security/compliance docs. docs.runpod.io/references/security-and-compliance
Runpod “Zero GPU Pods on restart.” docs.runpod.io
Runpod Pod migration docs. docs.runpod.io
Lambda home page / trust & security. lambda.ai
Lambda pricing page. lambda.ai/pricing
Lambda pricing page, public instance examples. lambda.ai/pricing
Lambda public cloud docs. docs.lambda.ai/public-cloud
Lambda Trust Portal. trust.lambda.ai
vLLM supported models docs. docs.vllm.ai
vLLM Runpod deployment docs. docs.vllm.ai
Hugging Face Hub documentation. huggingface.co/docs/hub
Hugging Face Model Hub docs. huggingface.co/docs/hub/models
Meta Llama on Hugging Face. huggingface.co/meta-llama
Qwen on Hugging Face. huggingface.co/Qwen
Mistral AI on Hugging Face. huggingface.co/mistralai
Google Gemma on Hugging Face. huggingface.co/google/gemma
Microsoft Phi on Hugging Face. huggingface.co/microsoft/phi-4
Meta official Llama downloads. llama.com/llama-downloads
Ollama home page. ollama.com
Ollama model library. ollama.com/library
AICPA SOC 2 overview. aicpa.org
AWS SOC 2 explanation. aws.amazon.com

Running Open-Weight LLMs for Student Performance Summaries

Executive Summary

Why We Are Evaluating This

Recommendation

Recommended Architecture

Why this is the best fit

The Main Ways These Systems Can Be Used

1. API Model

Pros

Cons

2. Cloud GPU Instance Running Our Own Model

Pros

Cons

3. On-Site Hardware Running Our Own Model

Pros

Cons

4. Hybrid Architecture Recommended

Why Standardizing on One Primary Model Environment Matters

Cloud GPU Provider Landscape

Runpod Deep Dive

Main Operating Modes

A. Pods

B. Serverless

C. Flex vs Active Workers

Typical GPUs Relevant to Us

Pricing

Storage Pricing

Capacity Behavior

Security and Privacy

Runpod Summary

Strengths

Weaknesses

Lambda Deep Dive

Main Operating Modes

A. On-Demand Cloud Instances

B. 1-Click Clusters

C. Private Cloud / Enterprise

Pricing

Security and Privacy

Lambda Summary

Strengths

Weaknesses

Runpod vs Lambda

In One Sentence

Practical Comparison

Which Is Better for Our Project?

What Is Involved in Actually Running an LLM on Cloud GPUs

The Pieces

Inference Engine

Deployment Pattern

Open-Weight Model Sources and What Is Available

Common Sources and Registries

Model Families Likely Most Relevant to Us

Model Size, GPU Requirements, and Quantization

Model Size vs GPU Memory

What This Means for DGX Spark

Quantization Levels

Can Purchased On-Site Hardware Actually Do the Job?

Approximate Cost Thinking

Cloud GPU Cost Characteristics

Why Cloud Is Still Useful Even If We Buy Hardware

Why On-Site Becomes Attractive Over Time

Data Privacy, IRB, and Student Data

The Strongest Privacy Position

Why Cloud-Hosted Open Models Are Different from API Use

What SOC 2 Type II Certification Means — and What It Does Not

What it is

What it helps with

What it does NOT mean

Practical Privacy Recommendations

Why the Hybrid On-Site + Cloud-Backup Architecture Is Best

Proposed Architecture for the Team

Primary Environment

Secondary / Failover Environment

Operational Pattern

Concrete Recommendation to Present to the Team

Recommendation Details

Why This Is the Best Approach

Sources