Intelligence Builders
AI Research Home

Running Open-Weight LLMs for Student Performance Summaries

Infrastructure evaluation — cloud GPU providers, on-site hardware, and the recommended hybrid architecture

Executive Summary

This document explains how the team can run large language models (LLMs) for student-performance summaries on either:

  1. Cloud-hosted GPUs using providers such as Runpod or Lambda, or
  2. On-site purchased hardware such as NVIDIA DGX Spark.

The best long-term architecture for our project is:

That approach gives us the best balance of:

Why We Are Evaluating This

We are currently using LLM APIs to generate summaries of student performance in Mission HydroSci. That works, but API cost can rise quickly because summaries are generated repeatedly across students, activities, and reporting cycles.

The question is not just whether an API can do the work. It is whether we can run our own model stack in a way that is:

Key lesson: Different models can produce different summaries even with the same prompt, context, and data. Our real goal should be to standardize on a primary model environment, tune that environment carefully, and avoid constant model-switching so the results remain predictable.

This is one of the strongest arguments for owning the primary inference path rather than relying entirely on shifting external APIs.

Recommendation

Recommended Architecture

Primary On-site hardware we control (e.g., NVIDIA DGX Spark)
Backup / Burst Cloud GPUs — Runpod and Lambda as the two most relevant providers

Why this is the best fit

  1. Data control is strongest on-site. Student data stays inside infrastructure we manage directly.
  2. Output consistency is better. We can pick one model, one quantization strategy, one prompt template, and tune for that environment.
  3. Costs become more predictable. Hardware is a capital purchase rather than ongoing per-token spend.
  4. Cloud still covers risk. If local hardware is down, overloaded, or temporarily insufficient, we can fail over to Runpod or Lambda.
  5. The project keeps the asset. Unlike service spend, purchased hardware remains available after the initial budget period.

Own the primary inference path. Rent backup compute when needed.

The Main Ways These Systems Can Be Used

There are several distinct operating modes. These are often conflated, but they are materially different.

1. API Model

The current pattern with hosted LLM APIs. We send prompt + context + student data to a vendor-managed model endpoint and get back a summary.

Pros

  • Easiest to start
  • Strongest frontier models
  • No infrastructure to manage

Cons

  • Recurring usage cost
  • Changing model behavior over time
  • Lower control over inference stack
  • More concern about student data leaving project-controlled infrastructure

2. Cloud GPU Instance Running Our Own Model

We rent a GPU machine from a cloud provider, install or launch an inference server such as vLLM, and serve an open-weight model ourselves.

Pros

  • Much more control than API use
  • Can use open-weight models from Hugging Face
  • Can expose an OpenAI-compatible API
  • Often dramatically cheaper per summary

Cons

  • Still rely on third-party infrastructure
  • Must manage model serving, storage, access control, monitoring
  • Some providers have capacity variation

3. On-Site Hardware Running Our Own Model

Same architecture as the cloud-GPU case, except the hardware is in our own environment.

Pros

  • Strongest privacy/control position
  • Stable model environment
  • Retained institutional asset
  • No per-request inference bill
  • Best fit for long-lived model stack

Cons

  • Upfront hardware purchase
  • Maintenance responsibility
  • Limited local redundancy

4. Hybrid Architecture Recommended

Primary On-site inference
Secondary Cloud failover or cloud burst capacity
Optional Tertiary Premium API only for exceptional cases

This gives us privacy and control by default, resilience when hardware is unavailable, and flexibility if demand grows unexpectedly.

Why Standardizing on One Primary Model Environment Matters

The same prompt, context, and data do not produce the same style or quality of summary across different models.

We have already observed this with different hosted models. The same will also be true when switching among API models, cloud-hosted open models, and local/open models on owned hardware.

This means the real product is not just “the model.” It is a summary engine, consisting of:

The more we switch model backends, the more tuning and evaluation noise we introduce. That is one of the strongest reasons to propose one primary local/on-site environment and cloud as backup, not as a constantly changing primary.

Cloud GPU Provider Landscape

The full GPU-cloud landscape is broad. Major options in 2026 include hyperscalers such as AWS, Azure, Google Cloud, Oracle Cloud, and IBM Cloud, plus AI-native providers such as Runpod, Lambda, CoreWeave, Crusoe, DigitalOcean/Paperspace, Replicate, Vast.ai, and others.1

For our purposes, the most relevant providers for serious evaluation are:

Those map well to our needs: open-weight model serving, cost-sensitive inference, relatively straightforward deployment, and realistic use for research workloads.


Runpod Deep Dive

Runpod is an AI-focused GPU platform offering both Pods and Serverless products.2

Main Operating Modes

A. Pods

The closest thing to renting a GPU machine. You choose a GPU, create a pod, and run your own software stack on it. Ideal when we want a stable inference target, SSH access, explicit control over the runtime, and the ability to install or launch vLLM ourselves.

B. Serverless

Runpod supports serverless vLLM deployments for open-source models.8 Ideal for lower idle cost, scale-to-zero behavior, and event-driven or batch workflows. The tradeoff is cold starts and more operational abstraction.

C. Flex vs Active Workers

Typical GPUs Relevant to Us

GPUVRAMNotes
A10080 GBSafest general choice for LLM inference
L40S48 GBPotentially attractive lower-cost option
H10080 GBUsually more than we need unless throughput is very high
409024 GBBudget option for smaller models

Pricing

ConfigurationPriceSource
Serverless A100 80GB (Flex)$0.00076/sec4
Serverless A100 80GB (Active)$0.00060/sec4
Serverless H100 80GB (Flex)$0.00116/sec4
Serverless H100 80GB (Active)$0.00093/sec4
On-demand A100 PCIe 80GB$1.19/hr3
On-demand H100 PCIe 80GB$1.99/hr3

Storage Pricing

TypePriceNotes
Network volume (<1TB)$0.07/GB/monthModel weights persist independently of pod
Network volume (>1TB)$0.05/GB/month
Volume disk (running)$0.10/GB/month
Volume disk (stopped)$0.20/GB/month
Container diskTemporaryErased when a Pod stops
For our work, the right choice is usually network volume, because model weights persist independently of the pod and do not need to be downloaded every time.

Capacity Behavior

Runpod is flexible and cost-effective, but it is not guaranteed reserved enterprise capacity. When a stopped Pod is restarted, it may show “Zero GPU Pods” if the original GPU is no longer available.10, 11

Practical takeaway: As long as a Pod is running, the reserved GPU is stable. Once stopped, that exact GPU may no longer be available. This is manageable for our batchable workload, but is still a reason not to rely on Runpod as the only production path.

Security and Privacy

Runpod Summary

Strengths

Weaknesses


Lambda Deep Dive

Lambda is an AI-focused cloud company with roots in deep-learning workstations and servers. Today it provides cloud GPUs, on-demand instances, clusters, private cloud, and an enterprise trust/security program.12, 13

Main Operating Modes

A. On-Demand Cloud Instances

Individual Linux-based GPU-backed virtual machines.15 The closest counterpart to Runpod Pods. Ideal for stable inference servers, SSH-based administration, running vLLM or Ollama, and predictable hands-on control.

B. 1-Click Clusters

Production-ready clusters of 16 to 512 H100 GPUs.15, 13 Far more than we need for student-summary generation, but shows the platform can scale.

C. Private Cloud / Enterprise

Emphasizes single-tenant, shared-nothing architecture, SOC 2 Type II certification, and isolated/caged clusters.12 Relevant because our team is concerned about student data, IRB, and institutional review.

Pricing

ConfigurationPriceSource
H100 (cluster pricing)$2.76/hr13
A100 PCIe 40GB$1.99/hr14
A6000 48GB$1.09/hr14
H100 SXM 80GB (8-GPU)$3.99/hr14
Lambda is generally positioned as a more stable, more enterprise-oriented AI cloud than Runpod, but usually not the cheapest option.

Security and Privacy

Lambda Summary

Strengths

Weaknesses


Runpod vs Lambda

In One Sentence

Practical Comparison

DimensionRunpodLambda
Best mental modelFlexible GPU utilityStable AI cloud
Primary modesPods, Serverless, Flex/ActiveOn-Demand instances, clusters, private cloud
Cost profileOften lower, especially burstyOften higher, more “traditional cloud”
Capacity behaviorCan vary; stopped pods can lose GPUsMore instance-oriented and stable
Security postureMulti-tenant by default; stronger with Secure CloudStrong trust/security, single-tenant messaging
Best fit for usBurst, backup, experimentsBackup, stable cloud serving, enterprise discussions

Which Is Better for Our Project?

Choose Runpod for lower cost and flexibility. Choose Lambda if institutional comfort with security messaging matters more than raw cost. It is reasonable to test both.

What Is Involved in Actually Running an LLM on Cloud GPUs

This is the mechanics stakeholders often do not understand.

The Pieces

To run an open-weight LLM on cloud GPUs, we typically need:

  1. A GPU machine (Pod, instance, or cluster)
  2. A model source
  3. An inference engine
  4. Persistent storage for model caching
  5. An HTTP API surface our application can call

Inference Engine

The most important serving engine for our purposes is vLLM, which is designed for high-throughput LLM serving and exposes an OpenAI-compatible API. vLLM supports a large range of open-source models and explicitly supports deployment on Runpod.17, 18

Deployment Pattern

  1. Launch pod/instance
  2. Mount or attach persistent storage
  3. Authenticate to Hugging Face if the model requires a license grant or token
  4. Start vllm serve with the chosen model
  5. Expose an internal or protected HTTP endpoint
  6. Send structured student-summary requests to that endpoint
  7. Log outputs and evaluation metadata under our control
Key distinction: This architecture is materially different from sending student data directly to an external API vendor’s model endpoint. We control the model, inference server, prompt layer, and retention behavior much more directly.

Open-Weight Model Sources and What Is Available

The main source for open-weight models today is Hugging Face Hub, which hosts over 2 million models.19

Common Sources and Registries

SourceDescription
Hugging Face HubMain repository for downloading and versioning model checkpoints19, 20
Vendor org pagesMeta Llama21, Qwen22, Mistral23, Google Gemma24, Microsoft Phi25
Vendor download pagesE.g., Meta provides Llama access through its official download process26
Ollama libraryConvenient distribution path for local experimentation27, 28

Model Families Likely Most Relevant to Us

For student-performance summaries, strong candidates include:

Note on terminology: Many teams say “open source models” when they really mean open-weight models. The important practical question is: can we download the weights, run them ourselves, and keep inference under our own control?

Model Size, GPU Requirements, and Quantization

Model Size vs GPU Memory

Model SizeVRAM (FP16)VRAM (4-bit)Typical Deployment
7B~14 GB~4–6 GBRuns anywhere (local, DGX, cloud)
13B~26 GB~8–10 GBIdeal for DGX and most GPUs
30B~60 GB~15–20 GBDGX (quantized) or A100-class GPUs
70B~140 GB~35–45 GBMulti-GPU or high-end cloud only

What This Means for DGX Spark

Quantization Levels

QuantizationMemory UsageQuality Impact
FP16 (full precision)100%Highest fidelity
8-bit~50%Nearly identical in most cases
4-bit~25%Small quality reduction
2-bit~12%Noticeable degradation
Recommended: 4-bit quantization is typically the best balance of memory efficiency, speed, and output quality for structured summarization tasks.
Key insight: For structured summarization tasks, smaller quantized models (7B–30B) often perform sufficiently well, making them practical for on-site deployment without requiring large-scale GPU infrastructure. This is why our proposed architecture does not depend on frontier-scale models.

Can Purchased On-Site Hardware Actually Do the Job?

Yes—very likely.

The summarization task we care about is not the same as frontier general reasoning. We are asking the model to:

That is a good fit for smaller and mid-sized instruct models.

Student summaries are especially compatible with smaller instruct models, 4-bit quantization, careful prompt design, and structured input. The biggest quality lever is often not raw model size; it is input organization, prompt design, and consistent deployment.

That reinforces the case for one primary on-site model stack.


Approximate Cost Thinking

These are not procurement quotes; they are planning-level approximations.

Cloud GPU Cost Characteristics

Per-summary cost on cloud GPU can be very low when we batch requests, keep output lengths controlled, use an appropriate open model, and avoid premium API pricing. The main variables are model size, actual input length, output token count, and whether the endpoint is warm or cold.

Why Cloud Is Still Useful Even If We Buy Hardware

Why On-Site Becomes Attractive Over Time

For recurring workloads, owned hardware changes the economics from “pay every time the model runs” into “pay once for the machine, then mostly pay power/ops.”

This is especially attractive when the institution can keep using the hardware after the initial project period.


Data Privacy, IRB, and Student Data

This is one of the most important sections for internal discussion.

The Strongest Privacy Position

Run the model on-site, on hardware we control, inside network and access controls we already manage. That gives the clearest answer to:

Why Cloud-Hosted Open Models Are Different from API Use

Using a cloud GPU to run our own open-weight model is not the same as using a third-party hosted LLM API. With cloud-hosted open models:

However, the infrastructure still belongs to a third party, the deployment must be configured correctly, and institutional review may still be required.

What SOC 2 Type II Certification Means — and What It Does Not

What it is

SOC 2 is an audit framework developed by the AICPA for evaluating controls related to security, availability, processing integrity, confidentiality, and privacy. A Type II report evaluates whether controls were operating effectively over a period of time.29, 30

What it helps with

What it does NOT mean

Bottom line: SOC 2 Type II certification is best understood as a security/compliance maturity signal, not as an automatic green light for sensitive student-data processing.

Practical Privacy Recommendations

  1. Primary summaries run on-site whenever possible.
  2. Cloud is backup, not default.
  3. If cloud is used, use the provider’s more secure configuration (Runpod Secure Cloud, Lambda single-tenant/private options).
  4. Send only the minimum required data to the summarization service.
  5. Prefer structured performance signals over raw logs when possible.
  6. Log and retain outputs deliberately, not by default.
  7. Review IRB / institutional requirements before production use of student records in cloud environments.

We can get the benefits of LLM summarization without making cloud or public APIs the default home of student data.


Why the Hybrid On-Site + Cloud-Backup Architecture Is Best

ProblemHow the Hybrid Architecture Solves It
ConsistencyTune prompts and context for one primary model environment and keep it stable
PrivacySensitive student data stays on-site by default
ReliabilityIf on-site hardware is unavailable, fail over to Runpod or Lambda
Procurement / ValueThe institution retains the hardware asset while preserving operational flexibility
CostAvoid making every summary depend on a paid external API call
GrowthIf demand spikes, cloud covers overflow

Proposed Architecture for the Team

Primary Environment

Primary

Secondary / Failover Environment

Backup

Operational Pattern

  1. Student performance data is preprocessed into structured features/signals
  2. Summarization requests go to the on-site inference service first
  3. If on-site inference is unavailable or backlogged, the request is routed to the backup cloud environment
  4. Summaries are stored under our normal project controls

Concrete Recommendation to Present to the Team

Run open-weight LLMs primarily on-site on purchased hardware, and maintain a cloud-based backup/failover path using Runpod or Lambda.

Recommendation Details

Why This Is the Best Approach

Because it gives us:

This is the most defensible architecture technically, operationally, and institutionally.

Sources

  1. DigitalOcean, “10 Leading AI Cloud Providers for Developers in 2026.” digitalocean.com
  2. Runpod pricing page. runpod.io/pricing
  3. Runpod GPU pricing page. runpod.io/gpu-pricing
  4. Runpod serverless pricing docs. docs.runpod.io/serverless/pricing
  5. Runpod storage/network volume pricing docs. docs.runpod.io/pods/pricing and docs.runpod.io/storage/network-volumes
  6. Runpod serverless vLLM docs. docs.runpod.io/serverless/vllm
  7. Runpod SOC 2 Type II announcement. runpod.io/blog
  8. Runpod serverless vLLM docs. docs.runpod.io
  9. Runpod security/compliance docs. docs.runpod.io/references/security-and-compliance
  10. Runpod “Zero GPU Pods on restart.” docs.runpod.io
  11. Runpod Pod migration docs. docs.runpod.io
  12. Lambda home page / trust & security. lambda.ai
  13. Lambda pricing page. lambda.ai/pricing
  14. Lambda pricing page, public instance examples. lambda.ai/pricing
  15. Lambda public cloud docs. docs.lambda.ai/public-cloud
  16. Lambda Trust Portal. trust.lambda.ai
  17. vLLM supported models docs. docs.vllm.ai
  18. vLLM Runpod deployment docs. docs.vllm.ai
  19. Hugging Face Hub documentation. huggingface.co/docs/hub
  20. Hugging Face Model Hub docs. huggingface.co/docs/hub/models
  21. Meta Llama on Hugging Face. huggingface.co/meta-llama
  22. Qwen on Hugging Face. huggingface.co/Qwen
  23. Mistral AI on Hugging Face. huggingface.co/mistralai
  24. Google Gemma on Hugging Face. huggingface.co/google/gemma
  25. Microsoft Phi on Hugging Face. huggingface.co/microsoft/phi-4
  26. Meta official Llama downloads. llama.com/llama-downloads
  27. Ollama home page. ollama.com
  28. Ollama model library. ollama.com/library
  29. AICPA SOC 2 overview. aicpa.org
  30. AWS SOC 2 explanation. aws.amazon.com