@@ Line 1: / Line 1: @@
-'''This is a local Ollama installation running on a powerful local GPU.'''
+Only on Test
-__NOTOC__
-This page serves as a comprehensive technical registry and capabilities guide for the locally installed Ollama Large Language Models (LLMs). It outlines their core architecture, performance profiles, memory usage, and safety alignment status.
-== Local Hardware Performance Overview ==
-The following estimates assume full GPU offloading utilizing a high-end local GPU layout. Splitting models onto system RAM will heavily degrade these numbers.
-{| class="wikibase" style="width:100%; border:1px solid #ccc; border-collapse:collapse; text-align:left;"
-! Scope / Scale
-! Parameter Range
-! Avg. VRAM Footprint
-! Speed (Tokens/sec)
-! Context Latency (TTFT)
-|-
-| '''High-Capability'''
-| 20B – 31B
-| 16 GB – 24 GB
-| 20 – 35 tok/s
-| Moderate (~1.5s - 3.0s)
-|-
-| '''Mid-Range / Specialist'''
-| 8B – 16B
-| 6 GB – 14 GB
-| 40 – 65 tok/s
-| Low (~0.8s - 1.5s)
-|-
-| '''Low-Resource / Edge'''
-| 1B – 7B
-| 2 GB – 6 GB
-| 70 – 120+ tok/s
-| Near-Instant (<0.5s)
-|}
----
-== High-Capability & Large-Scale Models (15B - 31B) ==
-=== gemma-4-31b-uncensored ===
-* '''Purpose:''' A highly capable, large-scale model based on the Google Gemma 4 architecture, modified to bypass standard safety guardrails.
-* '''Best For:''' Complex reasoning, deep creative writing, philosophical exploration, and processing intricate multi-step prompts without refusal barriers.
-* '''Resource Profile:'''
-** '''Parameters:''' ~31 Billion
-** '''Memory/VRAM Usage:''' ~20.5 GB (Quantized)
-** '''Performance & Latency:''' Generates ~22-26 tokens/sec. High reasoning overhead can cause a slightly delayed Time-To-First-Token (TTFT) of 2-3 seconds.
-* '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored / Abliterated</span> – All standard safety alignment filters have been removed.
-=== qwen3-coder:30b ===
-* '''Purpose:''' Alibaba's flagship open-weights coding model, optimized for enterprise-grade software engineering, architecture design, and complex debugging.
-* '''Best For:''' Writing full-stack applications, handling multi-file repository contexts, explaining complex algorithms, and optimizing legacy code.
-* '''Resource Profile:'''
-** '''Parameters:''' ~30 Billion
-** '''Memory/VRAM Usage:''' ~19.8 GB
-** '''Performance & Latency:''' Achieves ~25-30 tokens/sec. Processing long context inputs (codebases) will cause pre-fill latency to scale upward.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard alignment remains active to prevent malicious code generation (e.g., malware design).
-=== qwen3.6:27b ===
-* '''Purpose:''' A premium, heavy-duty generalist model designed for advanced reasoning, mathematical logic, translation, and analytical workflows.
-* '''Best For:''' High-fidelity data analysis, complex summarizing, multi-language translation, and acting as a central orchestration agent.
-* '''Resource Profile:'''
-** '''Parameters:''' ~27 Billion
-** '''Memory/VRAM Usage:''' ~18.2 GB
-** '''Performance & Latency:''' ~28-33 tokens/sec. Consistent, stable token execution pacing.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Adheres to strict ethical, helpful, and harmless boundaries.
-=== gemma4:26b ===
-* '''Purpose:''' The standard, fully-aligned iteration of Google's 26B Gemma 4 model, balancing massive parameter depth with strict safety guardrails.
-* '''Best For:''' Enterprise deployments, academic research assistance, and standard corporate productivity tools where safety compliance is required.
-* '''Resource Profile:'''
-** '''Parameters:''' ~26 Billion
-** '''Memory/VRAM Usage:''' ~17.5 GB
-** '''Performance & Latency:''' ~30 tokens/sec. Safety evaluations add a marginal execution latency overhead to the initial response generation.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Strictly Censored</span> – Contains default Google RLHF guardrails against sensitive, harmful, or controversial topics.
-=== gpt-oss:20b ===
-* '''Purpose:''' An open-source generalist model trained to mimic commercial GPT-style interactions across diverse task types.
-* '''Best For:''' General brainstorming, text transformation, data cleaning, and everyday agentic tasks.
-* '''Resource Profile:'''
-** '''Parameters:''' ~20 Billion
-** '''Memory/VRAM Usage:''' ~14.0 GB
-** '''Performance & Latency:''' Highly responsive at ~35 tokens/sec with low overall processing latency.
-* '''Censorship Status:''' <span style="color:orange; font-weight:bold;">[Partial] Lightly Censored</span> – Usually features basic ethical boundaries but is significantly more permissive than strict corporate models.
----
-== Mid-Range & Specialist Models (8B - 16B) ==
-=== hf.co/mradermacher/Mistral-Nemo-Instruct-2407-abliterated-i1-GGUF:Q5_K_M ===
-* '''Purpose:''' A precision-tuned 12B parameter model combining Mistral's architecture with "abliteration" techniques to erase negative refusal weights.
-* '''Best For:''' Unrestricted academic research, writing intense fiction, roleplay, and analyzing controversial text datasets.
-* '''Resource Profile:'''
-** '''Parameters:''' 12.2 Billion
-** '''Memory/VRAM Usage:''' ~8.9 GB (Q5_K_M layout)
-** '''Performance & Latency:''' Swift ~45-52 tokens/sec response generation. Near-instant text generation initialization.
-* '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored / Abliterated</span> – Chemically stripped of refusal behaviors while retaining structural coherence.
-=== hf.co/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF:Q5_K_M ===
-* '''Purpose:''' The cleanly quantized, official instruction-tuned variant of the highly regarded Mistral-Nemo 12B model.
-* '''Best For:''' General instruction following, structured text generation, multi-lingual translation, and logic puzzles.
-* '''Resource Profile:'''
-** '''Parameters:''' 12.2 Billion
-** '''Memory/VRAM Usage:''' ~8.9 GB (Q5_K_M layout)
-** '''Performance & Latency:''' Highly predictable ~45-52 tokens/sec generation speed with negligible pre-fill lag.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Retains default Mistral safety alignments.
-=== MFDoom/deepseek-coder-v2-tool-calling:16b ===
-* '''Purpose:''' A custom Mixture-of-Experts (MoE) fine-tune specifically optimized to act as an agentic backend capable of executing external function calls.
-* '''Best For:''' AI agents, local tool integration, and automated pipeline execution.
-* '''Resource Profile:'''
-** '''Parameters:''' 16 Billion Total (uses ~3.3B active parameters per token)
-** '''Memory/VRAM Usage:''' ~11.2 GB
-** '''Performance & Latency:''' Blazing fast inference speeds due to MoE architecture, averaging ~55-65 tokens/sec. High initial layout parsing efficiency.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Aligned to prevent execution of destructive or malicious local commands.
-=== deepseek-coder-v2:16b ===
-* '''Purpose:''' The foundational Mixture-of-Experts (MoE) coding model from DeepSeek, prized for high-efficiency programming generation.
-* '''Best For:''' Continuous inline code completion, rapid prototyping, code refactoring, and general script writing.
-* '''Resource Profile:'''
-** '''Parameters:''' 16 Billion Total
-** '''Memory/VRAM Usage:''' ~11.2 GB
-** '''Performance & Latency:''' ~55-65 tokens/sec. Optimized for sub-second text stream delivery loops.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard guardrails against generating malware or exploits.
-=== qwen3:14b ===
-* '''Purpose:''' The standard mid-tier iteration of the Qwen 3 general-purpose intelligence framework.
-* '''Best For:''' Summarizing lengthy documents, conversational search support, copy editing, and medium-complexity script writing.
-* '''Strengths:'''
-* '''Resource Profile:'''
-** '''Parameters:''' ~14 Billion
-** '''Memory/VRAM Usage:''' ~10.1 GB
-** '''Performance & Latency:''' Runs fluidly at ~42-48 tokens/sec. Moderate latency scaling during lengthy prompt ingestion phases.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Fully aligned with default safety training.
-=== qwen2.5-coder:14b ===
-* '''Purpose:''' The established, highly mature 14B coding engine from the Qwen 2.5 generation.
-* '''Best For:''' Stable software development environments requiring predictable, reliable syntax generation.
-* '''Resource Profile:'''
-** '''Parameters:''' 14.8 Billion
-** '''Memory/VRAM Usage:''' ~10.5 GB
-** '''Performance & Latency:''' Stable ~42-48 tokens/sec. Excellent processing consistency throughout extended syntax blocks.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Features standard safety controls.
-=== gemma3:12b ===
-* '''Purpose:''' A legacy mid-tier general model from Google’s third-generation open weights release.
-* '''Best For:''' Everyday office automation tasks, document formatting, and general QA datasets.
-* '''Resource Profile:'''
-** '''Parameters:''' 12.2 Billion
-** '''Memory/VRAM Usage:''' ~8.8 GB
-** '''Performance & Latency:''' ~45 tokens/sec. Low baseline generation latency.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Governed by standard Google safety alignment.
-=== gemma4:e4b ===
-* '''Purpose:''' An experimental or early-quantized/preview variant of the Gemma 4 framework optimized for edge environments.
-* '''Best For:''' Comparing structural generation changes between Gemma versions or running low-overhead general tasks.
-* '''Resource Profile:'''
-** '''Parameters:''' ~12 Billion Equivalent
-** '''Memory/VRAM Usage:''' ~7.9 GB
-** '''Performance & Latency:''' Spits out tokens highly efficiently at ~50 tokens/sec. Rapid initialization cycles.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard guardrails apply.
-=== gemma2:9b ===
-* '''Purpose:''' Google’s highly successful 9B parameter generalist model from the Gemma 2 era.
-* '''Best For:''' Low-resource conversational assistance, flashcard generation, and quick text summarization.
-* '''Resource Profile:'''
-** '''Parameters:''' 9.2 Billion
-** '''Memory/VRAM Usage:''' ~6.8 GB
-** '''Performance & Latency:''' Sharp, snappy output reaching ~55 tokens/sec. Minimal processing delay.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Fully aligned.
-=== hf.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF:Q5_K_M ===
-* '''Purpose:''' A premium, creative fine-tune of Llama 3.1 8B by Nous Research, tailored for advanced roleplay, agentic steps, and complex instruction following.
-* '''Best For:''' Creative writing, world-building, intricate multi-turn roleplay, and agentic workflows.
-* '''Resource Profile:'''
-** '''Parameters:''' 8.0 Billion
-** '''Memory/VRAM Usage:''' ~5.9 GB (Q5_K_M execution footprint)
-** '''Performance & Latency:''' Highly responsive ~60 tokens/sec stream velocity. Instant initial output response behavior.
-* '''Censorship Status:''' <span style="color:orange; font-weight:bold;">[Partial] Highly Permissive</span> – While not aggressively abliterated, it is fine-tuned to be neutral, non-preachy, and almost entirely free of false-positive refusals.
-=== gemma4-8b-uncensored ===
-* '''Purpose:''' A modified 8B Gemma 4 base designed to offer modern reasoning power without any topic restriction.
-* '''Best For:''' Running unfiltered writing experiments or analysis tasks on low-spec hardware.
-* '''Resource Profile:'''
-** '''Parameters:''' ~8 Billion
-** '''Memory/VRAM Usage:''' ~5.5 GB
-** '''Performance & Latency:''' Snappy ~58-64 tokens/sec. Zero latency blocks.
-* '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Safety tuning bypassed.
-=== dolphin-2.9-8b:latest ===
-* '''Purpose:''' Eric Hartford's iconic Dolphin fine-tune applied to an 8B base, explicitly optimized to be helpful, harmless, and completely unbiased/unfiltered.
-* '''Best For:''' Unrestricted hacking/penetration testing research, unfiltered creative prose, and raw data transformations.
-* '''Resource Profile:'''
-** '''Parameters:''' 8.0 Billion
-** '''Memory/VRAM Usage:''' ~5.3 GB
-** '''Performance & Latency:''' Fast ~60 tokens/sec stream output rate. Exceptionally low prompt parsing delays.
-* '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Fully uncensored by design.
-=== hf.co/Qwen/Qwen3-8B-GGUF:Q4_K_M ===
-* '''Purpose:''' The lean, highly quantized 8B baseline of the Qwen 3 general series.
-* '''Best For:''' Basic chat utilities, lightweight translation scripts, and low-latency local assistants.
-* '''Resource Profile:'''
-** '''Parameters:''' 8.2 Billion
-** '''Memory/VRAM Usage:''' ~5.2 GB (Q4_K_M sweet spot footprint)
-** '''Performance & Latency:''' Swift ~65 tokens/sec throughput with immediate text streaming characteristics.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard default alignment.
----
-== Low-Resource & Edge Models (1B - 7B) ==
-=== qwen2.5-coder:7b ===
-* '''Purpose:''' A highly optimized 7B parameter programming specialist designed to run efficiently on standard laptops.
-* '''Best For:''' Real-time IDE integration, autocomplete loops, and small-scale scripting.
-* '''Resource Profile:'''
-** '''Parameters:''' 7.6 Billion
-** '''Memory/VRAM Usage:''' ~4.8 GB
-** '''Performance & Latency:''' Blazing fast performance at ~75-85 tokens/sec. Perfect for immediate inline coding block autocomplete behaviors.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Basic coding safety limits apply.
-=== qwen7b-32k:latest ===
-* '''Purpose:''' A specialized 7B variant configured specifically to ingest and remember massive text inputs up to a 32,000 token window.
-* '''Best For:''' Ingesting whole research papers, large source-code files, or extensive chat histories in a single prompt.
-* '''Resource Profile:'''
-** '''Parameters:''' 7.2 Billion
-** '''Memory/VRAM Usage:''' ~4.7 GB base allocation. Maxing out the context window to 32k tokens causes the pre-allocated Key-Value (KV) cache memory to expand significantly (up to an additional 4-8 GB of VRAM dynamically).
-** '''Performance & Latency:''' ~70 tokens/sec under standard usage. Heavy prompt ingestion will scale initial TTFT up to several seconds during token processing phases.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard Qwen alignment.
-=== llama3.2:latest ===
-* '''Purpose:''' Meta's highly popular, lightweight 3B generalist model designed for edge computing and mobile deployment.
-* '''Best For:''' Ultra-fast everyday text tasks, basic email formatting, and maintaining a low-footprint background desktop assistant.
-* '''Resource Profile:'''
-** '''Parameters:''' 3.2 Billion
-** '''Memory/VRAM Usage:''' ~2.5 GB
-** '''Performance & Latency:''' Extreme text throughput speeds averaging ~100-120+ tokens/sec. Instantaneous response execution.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Adheres strictly to Meta's safety guidelines.
-=== qwen2.5-coder:3b ===
-* '''Purpose:''' An ultra-compact coding model tailored for low-resource hardware, suitable for embedded systems or background IDE plugins.
-* '''Best For:''' Quick syntax checks, simple function writing, and short utility scripts.
-* '''Resource Profile:'''
-** '''Parameters:''' 3.1 Billion
-** '''Memory/VRAM Usage:''' ~2.4 GB
-** '''Performance & Latency:''' ~110 tokens/sec. Text returns arrive immediately with zero typing lag.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Basic alignment active.
-=== qwen3b-high-ctx:latest ===
-* '''Purpose:''' A 3B parameter model optimized specifically for handling enlarged context lengths on constrained hardware.
-* '''Best For:''' Reading lengthy logs or documentation files on machines lacking dedicated GPU hardware.
-* '''Resource Profile:'''
-** '''Parameters:''' ~3 Billion
-** '''Memory/VRAM Usage:''' ~2.3 GB base (expands based on operational context volume loading).
-** '''Performance & Latency:''' ~95-105 tokens/sec under typical loads.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Standard alignment.
-=== deepseek-r1:1.5b ===
-* '''Purpose:''' A tiny, distilled reasoning model featuring internal chain-of-thought ("thinking") capabilities.
-* '''Best For:''' Basic logical problem solving, simple math validation, and testing deep reasoning architectures on extremely weak hardware.
-* '''Resource Profile:'''
-** '''Parameters:''' 1.5 Billion
-** '''Memory/VRAM Usage:''' ~1.1 GB
-** '''Performance & Latency:''' Outrageously fast formatting speeds up to ~130 tokens/sec. However, note that total latency is lengthened because the model generates internal thinking tokens before revealing the raw target answer text.
-* '''Censorship Status:''' <span style="color:red;">[Yes] Censored</span> – Retains default alignment protocols during its thinking phase.
-=== smollm2-uncensored:latest ===
-* '''Purpose:''' An ultra-compact 1.7B parameter model optimized for mobile or CPU-only setups, stripped of systemic safety refusals.
-* '''Best For:''' Unfiltered basic text generation, edge-device testing, and quick offline note restructuring.
-* '''Resource Profile:'''
-** '''Parameters:''' 1.7 Billion
-** '''Memory/VRAM Usage:''' ~1.3 GB
-** '''Performance & Latency:''' Absolute maximum speed configuration running at ~130+ tokens/sec with sub-millisecond start delivery times.
-* '''Censorship Status:''' <span style="color:green; font-weight:bold;">[No] Uncensored</span> – Safety mechanisms fully removed.
-[[Category:Local AI Models]]
-[[Category:Ollama Infrastructure]]

Ollama Modelfile List: Difference between revisions

Revision as of 20:49, 26 May 2026

Navigation menu