Ollama Modelfile List

This is a local Ollama installation running on a powerful local GPU.

This page serves as a comprehensive technical registry and capabilities guide for the locally installed Ollama Large Language Models (LLMs). It outlines their core architecture, performance profiles, memory usage, and safety alignment status.

Local Hardware Performance Overview

The following estimates assume full GPU offloading utilizing a high-end local GPU layout. Splitting models onto system RAM will heavily degrade these numbers.

Scope / Scale	Parameter Range	Avg. VRAM Footprint	Speed (Tokens/sec)	Context Latency (TTFT)
High-Capability	20B – 31B	16 GB – 24 GB	20 – 35 tok/s	Moderate (~1.5s - 3.0s)
Mid-Range / Specialist	8B – 16B	6 GB – 14 GB	40 – 65 tok/s	Low (~0.8s - 1.5s)
Low-Resource / Edge	1B – 7B	2 GB – 6 GB	70 – 120+ tok/s	Near-Instant (<0.5s)

---

Capability Comparison Matrix

Ratings are practical local-workflow estimates on a 1-5 scale (5 = strongest). Speech and Image scores reflect end-to-end usefulness when paired with local ASR/TTS or VLM pipelines.

Model	Text	Speech	Python	Image	Speed	Notes
gemma-4-31b-uncensored	5	2	4	1	2	Top-tier long-form reasoning and writing depth.
qwen3-coder:30b	4	1	5	1	2	Best overall local coding model for complex repos.
qwen3.6:27b	5	2	4	1	3	Strong generalist for reasoning and multilingual tasks.
gemma4:26b	5	2	4	3	3	High-quality aligned model; useful in multimodal stacks.
gpt-oss:20b	4	2	4	1	3	Fast all-purpose assistant behavior.
hf.co/mradermacher/Mistral-Nemo-Instruct-2407-abliterated-i1-GGUF:Q5_K_M	4	1	3	1	4	Fast and permissive instruction-following.
hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q5_K_M	4	2	3	1	4	Stable, predictable 12B instruction model.
MFDoom/deepseek-coder-v2-tool-calling:16b	3	1	5	1	4	Great for agent/tool-call workflows.
deepseek-coder-v2:16b	3	1	5	1	4	Excellent raw coding throughput.
qwen3:14b	4	2	4	1	4	Balanced mid-size general model.
qwen2.5:14b	4	2	4	1	4	Reliable instruction quality with strong consistency.
mistral-nemo:12b	4	2	3	1	4	Strong multilingual and summarization performance.
qwen2.5-coder:14b	3	1	5	1	4	Mature coding model with high syntax reliability.
gemma3:12b	4	2	3	1	4	Efficient everyday productivity model.
gemma4:e4b	4	2	3	3	4	Lightweight Gemma 4 variant with multimodal utility.
gemma2:9b	4	2	3	1	4	Fast, low-overhead generalist. Fast grammar check
hf.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF:Q5_K_M	4	2	3	1	4	Creative and permissive conversational behavior.
gemma4-8b-uncensored:latest	4	2	3	2	4	Unrestricted 8B with good speed/quality balance.
dolphin-2.9-8b:latest	3	1	2	1	4	Unfiltered assistant style; weaker on precise coding.
hf.co/Qwen/Qwen3-8B-GGUF:Q4_K_M	3	2	3	1	5	Lean baseline for low-latency local chat.
qwen2.5-coder:7b	2	1	4	1	5	Great speed for IDE autocomplete loops.
qwen7b-32k:latest	3	1	2	1	4	Useful when very long context is required.
llama3.2:latest	3	2	2	1	5	Ultra-fast small general assistant.
qwen2.5-coder:3b	2	1	3	1	5	Compact coding model for constrained hardware.
qwen3b-high-ctx:latest	3	1	2	1	5	Good small-model option for bigger contexts.
deepseek-r1:1.5b	2	1	1	1	5	Tiny reasoning model; best for lightweight logic tests.
smollm2-uncensored:latest	2	1	1	1	5	Minimal-footprint uncensored utility model.

---

High-Capability & Large-Scale Models (15B - 31B)

gemma-4-31b-uncensored

Purpose: A highly capable, large-scale model based on the Google Gemma 4 architecture, modified to bypass standard safety guardrails.
Best For: Complex reasoning, deep creative writing, philosophical exploration, and processing intricate multi-step prompts without refusal barriers.
Resource Profile:
- Parameters: ~31 Billion
- Memory/VRAM Usage: ~20.5 GB (Quantized)
- Performance & Latency: Generates ~22-26 tokens/sec. High reasoning overhead can cause a slightly delayed Time-To-First-Token (TTFT) of 2-3 seconds.
Censorship Status: [No] Uncensored / Abliterated – All standard safety alignment filters have been removed.

qwen3-coder:30b

Purpose: Alibaba's flagship open-weights coding model, optimized for enterprise-grade software engineering, architecture design, and complex debugging.
Best For: Writing full-stack applications, handling multi-file repository contexts, explaining complex algorithms, and optimizing legacy code.
Resource Profile:
- Parameters: ~30 Billion
- Memory/VRAM Usage: ~19.8 GB
- Performance & Latency: Achieves ~25-30 tokens/sec. Processing long context inputs (codebases) will cause pre-fill latency to scale upward.
Censorship Status: [Yes] Censored – Standard alignment remains active to prevent malicious code generation (e.g., malware design).

qwen3.6:27b

Purpose: A premium, heavy-duty generalist model designed for advanced reasoning, mathematical logic, translation, and analytical workflows.
Best For: High-fidelity data analysis, complex summarizing, multi-language translation, and acting as a central orchestration agent.
Resource Profile:
- Parameters: ~27 Billion
- Memory/VRAM Usage: ~18.2 GB
- Performance & Latency: ~28-33 tokens/sec. Consistent, stable token execution pacing.
Censorship Status: [Yes] Censored – Adheres to strict ethical, helpful, and harmless boundaries.

gemma4:26b

Purpose: The standard, fully-aligned iteration of Google's 26B Gemma 4 model, balancing massive parameter depth with strict safety guardrails.
Best For: Enterprise deployments, academic research assistance, and standard corporate productivity tools where safety compliance is required.
Resource Profile:
- Parameters: ~26 Billion
- Memory/VRAM Usage: ~17.5 GB
- Performance & Latency: ~30 tokens/sec. Safety evaluations add a marginal execution latency overhead to the initial response generation.
Censorship Status: [Yes] Strictly Censored – Contains default Google RLHF guardrails against sensitive, harmful, or controversial topics.

gpt-oss:20b

Purpose: An open-source generalist model trained to mimic commercial GPT-style interactions across diverse task types.
Best For: General brainstorming, text transformation, data cleaning, and everyday agentic tasks.
Resource Profile:
- Parameters: ~20 Billion
- Memory/VRAM Usage: ~14.0 GB
- Performance & Latency: Highly responsive at ~35 tokens/sec with low overall processing latency.
Censorship Status: [Partial] Lightly Censored – Usually features basic ethical boundaries but is significantly more permissive than strict corporate models.

---

Mid-Range & Specialist Models (8B - 16B)

hf.co/mradermacher/Mistral-Nemo-Instruct-2407-abliterated-i1-GGUF:Q5_K_M

Purpose: A precision-tuned 12B parameter model combining Mistral's architecture with "abliteration" techniques to erase negative refusal weights.
Best For: Unrestricted academic research, writing intense fiction, roleplay, and analyzing controversial text datasets.
Resource Profile:
- Parameters: 12.2 Billion
- Memory/VRAM Usage: ~8.9 GB (Q5_K_M layout)
- Performance & Latency: Swift ~45-52 tokens/sec response generation. Near-instant text generation initialization.
Censorship Status: [No] Uncensored / Abliterated – Chemically stripped of refusal behaviors while retaining structural coherence.

hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q5_K_M

Purpose: The cleanly quantized, official instruction-tuned variant of the highly regarded Mistral-Nemo 12B model.
Best For: General instruction following, structured text generation, multi-lingual translation, and logic puzzles.
Resource Profile:
- Parameters: 12.2 Billion
- Memory/VRAM Usage: ~8.9 GB (Q5_K_M layout)
- Performance & Latency: Highly predictable ~45-52 tokens/sec generation speed with negligible pre-fill lag.
Censorship Status: [Yes] Censored – Retains default Mistral safety alignments.

MFDoom/deepseek-coder-v2-tool-calling:16b

Purpose: A custom Mixture-of-Experts (MoE) fine-tune specifically optimized to act as an agentic backend capable of executing external function calls.
Best For: AI agents, local tool integration, and automated pipeline execution.
Resource Profile:
- Parameters: 16 Billion Total (uses ~3.3B active parameters per token)
- Memory/VRAM Usage: ~11.2 GB
- Performance & Latency: Blazing fast inference speeds due to MoE architecture, averaging ~55-65 tokens/sec. High initial layout parsing efficiency.
Censorship Status: [Yes] Censored – Aligned to prevent execution of destructive or malicious local commands.

deepseek-coder-v2:16b

Purpose: The foundational Mixture-of-Experts (MoE) coding model from DeepSeek, prized for high-efficiency programming generation.
Best For: Continuous inline code completion, rapid prototyping, code refactoring, and general script writing.
Resource Profile:
- Parameters: 16 Billion Total
- Memory/VRAM Usage: ~11.2 GB
- Performance & Latency: ~55-65 tokens/sec. Optimized for sub-second text stream delivery loops.
Censorship Status: [Yes] Censored – Standard guardrails against generating malware or exploits.

qwen3:14b

Purpose: The standard mid-tier iteration of the Qwen 3 general-purpose intelligence framework.
Best For: Summarizing lengthy documents, conversational search support, copy editing, and medium-complexity script writing.
Strengths:
Resource Profile:
- Parameters: ~14 Billion
- Memory/VRAM Usage: ~10.1 GB
- Performance & Latency: Runs fluidly at ~42-48 tokens/sec. Moderate latency scaling during lengthy prompt ingestion phases.
Censorship Status: [Yes] Censored – Fully aligned with default safety training.

qwen2.5:14b

Purpose: A mature 14B general-purpose model from the Qwen 2.5 family focused on reliable instruction following and balanced reasoning.
Best For: General assistant tasks, drafting and rewriting, translation, and broad knowledge Q&A.
Resource Profile:
- Parameters: ~14 Billion
- Memory/VRAM Usage: ~9.8 GB
- Performance & Latency: Consistent ~40-48 tokens/sec with modest prompt prefill latency on longer contexts.
Censorship Status: [Yes] Censored – Standard alignment behavior.

mistral-nemo:12b

Purpose: The native Ollama distribution of Mistral-Nemo 12B, optimized for instruction following and strong multilingual quality.
Best For: General chat, document transformation, summarization, translation, and medium-depth reasoning.
Resource Profile:
- Parameters: ~12 Billion
- Memory/VRAM Usage: ~8.5 GB
- Performance & Latency: Typically ~44-52 tokens/sec with fast startup and steady streaming.
Censorship Status: [Yes] Censored – Retains default instruction-tuned safety alignment.

qwen2.5-coder:14b

Purpose: The established, highly mature 14B coding engine from the Qwen 2.5 generation.
Best For: Stable software development environments requiring predictable, reliable syntax generation.
Resource Profile:
- Parameters: 14.8 Billion
- Memory/VRAM Usage: ~10.5 GB
- Performance & Latency: Stable ~42-48 tokens/sec. Excellent processing consistency throughout extended syntax blocks.
Censorship Status: [Yes] Censored – Features standard safety controls.

gemma3:12b

Purpose: A legacy mid-tier general model from Google’s third-generation open weights release.
Best For: Everyday office automation tasks, document formatting, and general QA datasets.
Resource Profile:
- Parameters: 12.2 Billion
- Memory/VRAM Usage: ~8.8 GB
- Performance & Latency: ~45 tokens/sec. Low baseline generation latency.
Censorship Status: [Yes] Censored – Governed by standard Google safety alignment.

gemma4:e4b

Purpose: An experimental or early-quantized/preview variant of the Gemma 4 framework optimized for edge environments.
Best For: Comparing structural generation changes between Gemma versions or running low-overhead general tasks.
Resource Profile:
- Parameters: ~12 Billion Equivalent
- Memory/VRAM Usage: ~7.9 GB
- Performance & Latency: Spits out tokens highly efficiently at ~50 tokens/sec. Rapid initialization cycles.
Censorship Status: [Yes] Censored – Standard guardrails apply.

gemma2:9b

Purpose: Google’s highly successful 9B parameter generalist model from the Gemma 2 era.
Best For: Low-resource conversational assistance, flashcard generation, and quick text summarization.
Resource Profile:
- Parameters: 9.2 Billion
- Memory/VRAM Usage: ~6.8 GB
- Performance & Latency: Sharp, snappy output reaching ~55 tokens/sec. Minimal processing delay.
Censorship Status: [Yes] Censored – Fully aligned.

hf.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF:Q5_K_M

Purpose: A premium, creative fine-tune of Llama 3.1 8B by Nous Research, tailored for advanced roleplay, agentic steps, and complex instruction following.
Best For: Creative writing, world-building, intricate multi-turn roleplay, and agentic workflows.
Resource Profile:
- Parameters: 8.0 Billion
- Memory/VRAM Usage: ~5.9 GB (Q5_K_M execution footprint)
- Performance & Latency: Highly responsive ~60 tokens/sec stream velocity. Instant initial output response behavior.
Censorship Status: [Partial] Highly Permissive – While not aggressively abliterated, it is fine-tuned to be neutral, non-preachy, and almost entirely free of false-positive refusals.

gemma4-8b-uncensored

Purpose: A modified 8B Gemma 4 base designed to offer modern reasoning power without any topic restriction.
Best For: Running unfiltered writing experiments or analysis tasks on low-spec hardware.
Resource Profile:
- Parameters: ~8 Billion
- Memory/VRAM Usage: ~5.5 GB
- Performance & Latency: Snappy ~58-64 tokens/sec. Zero latency blocks.
Censorship Status: [No] Uncensored – Safety tuning bypassed.

dolphin-2.9-8b:latest

Purpose: Eric Hartford's iconic Dolphin fine-tune applied to an 8B base, explicitly optimized to be helpful, harmless, and completely unbiased/unfiltered.
Best For: Unrestricted hacking/penetration testing research, unfiltered creative prose, and raw data transformations.
Resource Profile:
- Parameters: 8.0 Billion
- Memory/VRAM Usage: ~5.3 GB
- Performance & Latency: Fast ~60 tokens/sec stream output rate. Exceptionally low prompt parsing delays.
Censorship Status: [No] Uncensored – Fully uncensored by design.

hf.co/Qwen/Qwen3-8B-GGUF:Q4_K_M

Purpose: The lean, highly quantized 8B baseline of the Qwen 3 general series.
Best For: Basic chat utilities, lightweight translation scripts, and low-latency local assistants.
Resource Profile:
- Parameters: 8.2 Billion
- Memory/VRAM Usage: ~5.2 GB (Q4_K_M sweet spot footprint)
- Performance & Latency: Swift ~65 tokens/sec throughput with immediate text streaming characteristics.
Censorship Status: [Yes] Censored – Standard default alignment.

---

Low-Resource & Edge Models (1B - 7B)

qwen2.5-coder:7b

Purpose: A highly optimized 7B parameter programming specialist designed to run efficiently on standard laptops.
Best For: Real-time IDE integration, autocomplete loops, and small-scale scripting.
Resource Profile:
- Parameters: 7.6 Billion
- Memory/VRAM Usage: ~4.8 GB
- Performance & Latency: Blazing fast performance at ~75-85 tokens/sec. Perfect for immediate inline coding block autocomplete behaviors.
Censorship Status: [Yes] Censored – Basic coding safety limits apply.

qwen7b-32k:latest

Purpose: A specialized 7B variant configured specifically to ingest and remember massive text inputs up to a 32,000 token window.
Best For: Ingesting whole research papers, large source-code files, or extensive chat histories in a single prompt.
Resource Profile:
- Parameters: 7.2 Billion
- Memory/VRAM Usage: ~4.7 GB base allocation. Maxing out the context window to 32k tokens causes the pre-allocated Key-Value (KV) cache memory to expand significantly (up to an additional 4-8 GB of VRAM dynamically).
- Performance & Latency: ~70 tokens/sec under standard usage. Heavy prompt ingestion will scale initial TTFT up to several seconds during token processing phases.
Censorship Status: [Yes] Censored – Standard Qwen alignment.

llama3.2:latest

Purpose: Meta's highly popular, lightweight 3B generalist model designed for edge computing and mobile deployment.
Best For: Ultra-fast everyday text tasks, basic email formatting, and maintaining a low-footprint background desktop assistant.
Resource Profile:
- Parameters: 3.2 Billion
- Memory/VRAM Usage: ~2.5 GB
- Performance & Latency: Extreme text throughput speeds averaging ~100-120+ tokens/sec. Instantaneous response execution.
Censorship Status: [Yes] Censored – Adheres strictly to Meta's safety guidelines.

qwen2.5-coder:3b

Purpose: An ultra-compact coding model tailored for low-resource hardware, suitable for embedded systems or background IDE plugins.
Best For: Quick syntax checks, simple function writing, and short utility scripts.
Resource Profile:
- Parameters: 3.1 Billion
- Memory/VRAM Usage: ~2.4 GB
- Performance & Latency: ~110 tokens/sec. Text returns arrive immediately with zero typing lag.
Censorship Status: [Yes] Censored – Basic alignment active.

qwen3b-high-ctx:latest

Purpose: A 3B parameter model optimized specifically for handling enlarged context lengths on constrained hardware.
Best For: Reading lengthy logs or documentation files on machines lacking dedicated GPU hardware.
Resource Profile:
- Parameters: ~3 Billion
- Memory/VRAM Usage: ~2.3 GB base (expands based on operational context volume loading).
- Performance & Latency: ~95-105 tokens/sec under typical loads.
Censorship Status: [Yes] Censored – Standard alignment.

deepseek-r1:1.5b

Purpose: A tiny, distilled reasoning model featuring internal chain-of-thought ("thinking") capabilities.
Best For: Basic logical problem solving, simple math validation, and testing deep reasoning architectures on extremely weak hardware.
Resource Profile:
- Parameters: 1.5 Billion
- Memory/VRAM Usage: ~1.1 GB
- Performance & Latency: Outrageously fast formatting speeds up to ~130 tokens/sec. However, note that total latency is lengthened because the model generates internal thinking tokens before revealing the raw target answer text.
Censorship Status: [Yes] Censored – Retains default alignment protocols during its thinking phase.

smollm2-uncensored:latest

Purpose: An ultra-compact 1.7B parameter model optimized for mobile or CPU-only setups, stripped of systemic safety refusals.
Best For: Unfiltered basic text generation, edge-device testing, and quick offline note restructuring.
Resource Profile:
- Parameters: 1.7 Billion
- Memory/VRAM Usage: ~1.3 GB
- Performance & Latency: Absolute maximum speed configuration running at ~130+ tokens/sec with sub-millisecond start delivery times.
Censorship Status: [No] Uncensored – Safety mechanisms fully removed.