|
|
| Line 1: |
Line 1: |
| '''This is a local Ollama installation running on a powerful local GPU.'''
| | Only on Test |
| | |
| __NOTOC__
| |
| This page serves as a comprehensive technical registry and capabilities guide for the locally installed Ollama Large Language Models (LLMs). It outlines their core architecture, performance profiles, memory usage, and safety alignment status.
| |
| | |
| == Local Hardware Performance Overview ==
| |
| The following estimates assume full GPU offloading utilizing a high-end local GPU layout. Splitting models onto system RAM will heavily degrade these numbers.
| |
| | |
| {| class="wikibase" style="width:100%; border:1px solid #ccc; border-collapse:collapse; text-align:left;"
| |
| ! Scope / Scale
| |
| ! Parameter Range
| |
| ! Avg. VRAM Footprint
| |
| ! Speed (Tokens/sec)
| |
| ! Context Latency (TTFT)
| |
| |-
| |
| | '''High-Capability'''
| |
| | 20B – 31B
| |
| | 16 GB – 24 GB
| |
| | 20 – 35 tok/s
| |
| | Moderate (~1.5s - 3.0s)
| |
| |-
| |
| | '''Mid-Range / Specialist'''
| |
| | 8B – 16B
| |
| | 6 GB – 14 GB
| |
| | 40 – 65 tok/s
| |
| | Low (~0.8s - 1.5s)
| |
| |-
| |
| | '''Low-Resource / Edge'''
| |
| | 1B – 7B
| |
| | 2 GB – 6 GB
| |
| | 70 – 120+ tok/s
| |
| | Near-Instant (<0.5s)
| |
| |}
| |
| | |
| ---
| |
| | |
| == High-Capability & Large-Scale Models (15B - 31B) ==
| |
| | |
| === gemma-4-31b-uncensored ===
| |
| * '''Purpose:''' A highly capable, large-scale model based on the Google Gemma 4 architecture, modified to bypass standard safety guardrails.
| |
| * '''Best For:''' Complex reasoning, deep creative writing, philosophical exploration, and processing intricate multi-step prompts without refusal barriers.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~31 Billion
| |
| ** '''Memory/VRAM Usage:''' ~20.5 GB (Quantized)
| |
| ** '''Performance & Latency:''' Generates ~22-26 tokens/sec. High reasoning overhead can cause a slightly delayed Time-To-First-Token (TTFT) of 2-3 seconds.
| |
| * '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored / Abliterated</span> – All standard safety alignment filters have been removed.
| |
| | |
| === qwen3-coder:30b ===
| |
| * '''Purpose:''' Alibaba's flagship open-weights coding model, optimized for enterprise-grade software engineering, architecture design, and complex debugging.
| |
| * '''Best For:''' Writing full-stack applications, handling multi-file repository contexts, explaining complex algorithms, and optimizing legacy code.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~30 Billion
| |
| ** '''Memory/VRAM Usage:''' ~19.8 GB
| |
| ** '''Performance & Latency:''' Achieves ~25-30 tokens/sec. Processing long context inputs (codebases) will cause pre-fill latency to scale upward.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard alignment remains active to prevent malicious code generation (e.g., malware design).
| |
| | |
| === qwen3.6:27b ===
| |
| * '''Purpose:''' A premium, heavy-duty generalist model designed for advanced reasoning, mathematical logic, translation, and analytical workflows.
| |
| * '''Best For:''' High-fidelity data analysis, complex summarizing, multi-language translation, and acting as a central orchestration agent.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~27 Billion
| |
| ** '''Memory/VRAM Usage:''' ~18.2 GB
| |
| ** '''Performance & Latency:''' ~28-33 tokens/sec. Consistent, stable token execution pacing.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Adheres to strict ethical, helpful, and harmless boundaries.
| |
| | |
| === gemma4:26b ===
| |
| * '''Purpose:''' The standard, fully-aligned iteration of Google's 26B Gemma 4 model, balancing massive parameter depth with strict safety guardrails.
| |
| * '''Best For:''' Enterprise deployments, academic research assistance, and standard corporate productivity tools where safety compliance is required.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~26 Billion
| |
| ** '''Memory/VRAM Usage:''' ~17.5 GB
| |
| ** '''Performance & Latency:''' ~30 tokens/sec. Safety evaluations add a marginal execution latency overhead to the initial response generation.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Strictly Censored</span> – Contains default Google RLHF guardrails against sensitive, harmful, or controversial topics.
| |
| | |
| === gpt-oss:20b ===
| |
| * '''Purpose:''' An open-source generalist model trained to mimic commercial GPT-style interactions across diverse task types.
| |
| * '''Best For:''' General brainstorming, text transformation, data cleaning, and everyday agentic tasks.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~20 Billion
| |
| ** '''Memory/VRAM Usage:''' ~14.0 GB
| |
| ** '''Performance & Latency:''' Highly responsive at ~35 tokens/sec with low overall processing latency.
| |
| * '''Censorship Status:''' <span style="color:orange; font-weight:bold;">[Partial] Lightly Censored</span> – Usually features basic ethical boundaries but is significantly more permissive than strict corporate models.
| |
| | |
| ---
| |
| | |
| == Mid-Range & Specialist Models (8B - 16B) ==
| |
| | |
| === hf.co/mradermacher/Mistral-Nemo-Instruct-2407-abliterated-i1-GGUF:Q5_K_M ===
| |
| * '''Purpose:''' A precision-tuned 12B parameter model combining Mistral's architecture with "abliteration" techniques to erase negative refusal weights.
| |
| * '''Best For:''' Unrestricted academic research, writing intense fiction, roleplay, and analyzing controversial text datasets.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 12.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~8.9 GB (Q5_K_M layout)
| |
| ** '''Performance & Latency:''' Swift ~45-52 tokens/sec response generation. Near-instant text generation initialization.
| |
| * '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored / Abliterated</span> – Chemically stripped of refusal behaviors while retaining structural coherence.
| |
| | |
| === hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q5_K_M ===
| |
| * '''Purpose:''' The cleanly quantized, official instruction-tuned variant of the highly regarded Mistral-Nemo 12B model.
| |
| * '''Best For:''' General instruction following, structured text generation, multi-lingual translation, and logic puzzles.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 12.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~8.9 GB (Q5_K_M layout)
| |
| ** '''Performance & Latency:''' Highly predictable ~45-52 tokens/sec generation speed with negligible pre-fill lag.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Retains default Mistral safety alignments.
| |
| | |
| === MFDoom/deepseek-coder-v2-tool-calling:16b ===
| |
| * '''Purpose:''' A custom Mixture-of-Experts (MoE) fine-tune specifically optimized to act as an agentic backend capable of executing external function calls.
| |
| * '''Best For:''' AI agents, local tool integration, and automated pipeline execution.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 16 Billion Total (uses ~3.3B active parameters per token)
| |
| ** '''Memory/VRAM Usage:''' ~11.2 GB
| |
| ** '''Performance & Latency:''' Blazing fast inference speeds due to MoE architecture, averaging ~55-65 tokens/sec. High initial layout parsing efficiency.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Aligned to prevent execution of destructive or malicious local commands.
| |
| | |
| === deepseek-coder-v2:16b ===
| |
| * '''Purpose:''' The foundational Mixture-of-Experts (MoE) coding model from DeepSeek, prized for high-efficiency programming generation.
| |
| * '''Best For:''' Continuous inline code completion, rapid prototyping, code refactoring, and general script writing.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 16 Billion Total
| |
| ** '''Memory/VRAM Usage:''' ~11.2 GB
| |
| ** '''Performance & Latency:''' ~55-65 tokens/sec. Optimized for sub-second text stream delivery loops.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard guardrails against generating malware or exploits.
| |
| | |
| === qwen3:14b ===
| |
| * '''Purpose:''' The standard mid-tier iteration of the Qwen 3 general-purpose intelligence framework.
| |
| * '''Best For:''' Summarizing lengthy documents, conversational search support, copy editing, and medium-complexity script writing.
| |
| * '''Strengths:'''
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~14 Billion
| |
| ** '''Memory/VRAM Usage:''' ~10.1 GB
| |
| ** '''Performance & Latency:''' Runs fluidly at ~42-48 tokens/sec. Moderate latency scaling during lengthy prompt ingestion phases.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Fully aligned with default safety training.
| |
| | |
| === qwen2.5-coder:14b ===
| |
| * '''Purpose:''' The established, highly mature 14B coding engine from the Qwen 2.5 generation.
| |
| * '''Best For:''' Stable software development environments requiring predictable, reliable syntax generation.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 14.8 Billion
| |
| ** '''Memory/VRAM Usage:''' ~10.5 GB
| |
| ** '''Performance & Latency:''' Stable ~42-48 tokens/sec. Excellent processing consistency throughout extended syntax blocks.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Features standard safety controls.
| |
| | |
| === gemma3:12b ===
| |
| * '''Purpose:''' A legacy mid-tier general model from Google’s third-generation open weights release.
| |
| * '''Best For:''' Everyday office automation tasks, document formatting, and general QA datasets.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 12.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~8.8 GB
| |
| ** '''Performance & Latency:''' ~45 tokens/sec. Low baseline generation latency.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Governed by standard Google safety alignment.
| |
| | |
| === gemma4:e4b ===
| |
| * '''Purpose:''' An experimental or early-quantized/preview variant of the Gemma 4 framework optimized for edge environments.
| |
| * '''Best For:''' Comparing structural generation changes between Gemma versions or running low-overhead general tasks.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~12 Billion Equivalent
| |
| ** '''Memory/VRAM Usage:''' ~7.9 GB
| |
| ** '''Performance & Latency:''' Spits out tokens highly efficiently at ~50 tokens/sec. Rapid initialization cycles.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard guardrails apply.
| |
| | |
| === gemma2:9b ===
| |
| * '''Purpose:''' Google’s highly successful 9B parameter generalist model from the Gemma 2 era.
| |
| * '''Best For:''' Low-resource conversational assistance, flashcard generation, and quick text summarization.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 9.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~6.8 GB
| |
| ** '''Performance & Latency:''' Sharp, snappy output reaching ~55 tokens/sec. Minimal processing delay.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Fully aligned.
| |
| | |
| === hf.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF:Q5_K_M ===
| |
| * '''Purpose:''' A premium, creative fine-tune of Llama 3.1 8B by Nous Research, tailored for advanced roleplay, agentic steps, and complex instruction following.
| |
| * '''Best For:''' Creative writing, world-building, intricate multi-turn roleplay, and agentic workflows.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 8.0 Billion
| |
| ** '''Memory/VRAM Usage:''' ~5.9 GB (Q5_K_M execution footprint)
| |
| ** '''Performance & Latency:''' Highly responsive ~60 tokens/sec stream velocity. Instant initial output response behavior.
| |
| * '''Censorship Status:''' <span style="color:orange; font-weight:bold;">[Partial] Highly Permissive</span> – While not aggressively abliterated, it is fine-tuned to be neutral, non-preachy, and almost entirely free of false-positive refusals.
| |
| | |
| === gemma4-8b-uncensored ===
| |
| * '''Purpose:''' A modified 8B Gemma 4 base designed to offer modern reasoning power without any topic restriction.
| |
| * '''Best For:''' Running unfiltered writing experiments or analysis tasks on low-spec hardware.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~8 Billion
| |
| ** '''Memory/VRAM Usage:''' ~5.5 GB
| |
| ** '''Performance & Latency:''' Snappy ~58-64 tokens/sec. Zero latency blocks.
| |
| * '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Safety tuning bypassed.
| |
| | |
| === dolphin-2.9-8b:latest ===
| |
| * '''Purpose:''' Eric Hartford's iconic Dolphin fine-tune applied to an 8B base, explicitly optimized to be helpful, harmless, and completely unbiased/unfiltered.
| |
| * '''Best For:''' Unrestricted hacking/penetration testing research, unfiltered creative prose, and raw data transformations.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 8.0 Billion
| |
| ** '''Memory/VRAM Usage:''' ~5.3 GB
| |
| ** '''Performance & Latency:''' Fast ~60 tokens/sec stream output rate. Exceptionally low prompt parsing delays.
| |
| * '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Fully uncensored by design.
| |
| | |
| === hf.co/Qwen/Qwen3-8B-GGUF:Q4_K_M ===
| |
| * '''Purpose:''' The lean, highly quantized 8B baseline of the Qwen 3 general series.
| |
| * '''Best For:''' Basic chat utilities, lightweight translation scripts, and low-latency local assistants.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 8.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~5.2 GB (Q4_K_M sweet spot footprint)
| |
| ** '''Performance & Latency:''' Swift ~65 tokens/sec throughput with immediate text streaming characteristics.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard default alignment.
| |
| | |
| ---
| |
| | |
| == Low-Resource & Edge Models (1B - 7B) ==
| |
| | |
| === qwen2.5-coder:7b ===
| |
| * '''Purpose:''' A highly optimized 7B parameter programming specialist designed to run efficiently on standard laptops.
| |
| * '''Best For:''' Real-time IDE integration, autocomplete loops, and small-scale scripting.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 7.6 Billion
| |
| ** '''Memory/VRAM Usage:''' ~4.8 GB
| |
| ** '''Performance & Latency:''' Blazing fast performance at ~75-85 tokens/sec. Perfect for immediate inline coding block autocomplete behaviors.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Basic coding safety limits apply.
| |
| | |
| === qwen7b-32k:latest ===
| |
| * '''Purpose:''' A specialized 7B variant configured specifically to ingest and remember massive text inputs up to a 32,000 token window.
| |
| * '''Best For:''' Ingesting whole research papers, large source-code files, or extensive chat histories in a single prompt.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 7.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~4.7 GB base allocation. Maxing out the context window to 32k tokens causes the pre-allocated Key-Value (KV) cache memory to expand significantly (up to an additional 4-8 GB of VRAM dynamically).
| |
| ** '''Performance & Latency:''' ~70 tokens/sec under standard usage. Heavy prompt ingestion will scale initial TTFT up to several seconds during token processing phases.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard Qwen alignment.
| |
| | |
| === llama3.2:latest ===
| |
| * '''Purpose:''' Meta's highly popular, lightweight 3B generalist model designed for edge computing and mobile deployment.
| |
| * '''Best For:''' Ultra-fast everyday text tasks, basic email formatting, and maintaining a low-footprint background desktop assistant.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 3.2 Billion
| |
| ** '''Memory/VRAM Usage:''' ~2.5 GB
| |
| ** '''Performance & Latency:''' Extreme text throughput speeds averaging ~100-120+ tokens/sec. Instantaneous response execution.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Adheres strictly to Meta's safety guidelines.
| |
| | |
| === qwen2.5-coder:3b ===
| |
| * '''Purpose:''' An ultra-compact coding model tailored for low-resource hardware, suitable for embedded systems or background IDE plugins.
| |
| * '''Best For:''' Quick syntax checks, simple function writing, and short utility scripts.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 3.1 Billion
| |
| ** '''Memory/VRAM Usage:''' ~2.4 GB
| |
| ** '''Performance & Latency:''' ~110 tokens/sec. Text returns arrive immediately with zero typing lag.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Basic alignment active.
| |
| | |
| === qwen3b-high-ctx:latest ===
| |
| * '''Purpose:''' A 3B parameter model optimized specifically for handling enlarged context lengths on constrained hardware.
| |
| * '''Best For:''' Reading lengthy logs or documentation files on machines lacking dedicated GPU hardware.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' ~3 Billion
| |
| ** '''Memory/VRAM Usage:''' ~2.3 GB base (expands based on operational context volume loading).
| |
| ** '''Performance & Latency:''' ~95-105 tokens/sec under typical loads.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard alignment.
| |
| | |
| === deepseek-r1:1.5b ===
| |
| * '''Purpose:''' A tiny, distilled reasoning model featuring internal chain-of-thought ("thinking") capabilities.
| |
| * '''Best For:''' Basic logical problem solving, simple math validation, and testing deep reasoning architectures on extremely weak hardware.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 1.5 Billion
| |
| ** '''Memory/VRAM Usage:''' ~1.1 GB
| |
| ** '''Performance & Latency:''' Outrageously fast formatting speeds up to ~130 tokens/sec. However, note that total latency is lengthened because the model generates internal thinking tokens before revealing the raw target answer text.
| |
| * '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Retains default alignment protocols during its thinking phase.
| |
| | |
| === smollm2-uncensored:latest ===
| |
| * '''Purpose:''' An ultra-compact 1.7B parameter model optimized for mobile or CPU-only setups, stripped of systemic safety refusals.
| |
| * '''Best For:''' Unfiltered basic text generation, edge-device testing, and quick offline note restructuring.
| |
| * '''Resource Profile:'''
| |
| ** '''Parameters:''' 1.7 Billion
| |
| ** '''Memory/VRAM Usage:''' ~1.3 GB
| |
| ** '''Performance & Latency:''' Absolute maximum speed configuration running at ~130+ tokens/sec with sub-millisecond start delivery times.
| |
| * '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Safety mechanisms fully removed.
| |
| | |
| [[Category:Local AI Models]]
| |
| [[Category:Ollama Infrastructure]]
| |