Understanding GGUF and GGML in Machine Learning

Photo of author

By Youssef B.

What Are GGUF and GGML?

GGUF, or GGML Universal File, is a file format for storing machine learning models, particularly LLMs, for fast and efficient inference. It’s built to work with GGML, a tensor library that enables running complex models on everyday hardware like CPUs. GGUF is the newer, more flexible format, replacing GGML’s older file format, which was less extensible. Think of GGUF as a neatly packed suitcase containing everything a model needs—weights, metadata, and more—in one file, making it easy to use and share.

GGML, on the other hand, started as both a library and a file format. As a library, it provides the tools to run models efficiently, while its file format was used to store models before GGUF came along. GGML’s focus was on accessibility, letting people run powerful models without fancy GPUs, but it struggled with adding new features without breaking older models.

How Do They Work Together?

GGML is like the engine that powers the model, handling computations during inference. GGUF is the fuel tank, storing the model’s data in a way that GGML can quickly access. Models are often trained in frameworks like PyTorch, then converted to GGUF for use with GGML. This setup is great for running models on your laptop or even a phone, as it doesn’t demand high-end hardware.

Using GGUF and GGML in Python

To use GGUF models in Python, you can rely on the llama-cpp-python library, which connects to GGML’s inference engine. You can download GGUF models from platforms like Hugging Face and run them locally. Here’s a basic example to help you begin.

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Download a GGUF model from Hugging Face
model_name = "TheBloke/Llama-2-7B-GGUF"
model_file = "llama-2-7b.Q4_K_M.gguf"
model_path = hf_hub_download(repo_id=model_name, filename=model_file)

# Initialize the Llama model
llm = Llama(model_path=model_path)

# Generate text
prompt = "What is the capital of France?"
output = llm(prompt, max_tokens=50)
print(output["choices"][0]["text"])

This code downloads a quantized Llama-2 model in GGUF format, loads it, and generates a response to a prompt. You can tweak parameters like max_tokens or add GPU support by setting n_gpu_layers=-1 if you have a compatible GPU.

Why Are They Important?

GGUF and GGML make AI more accessible by letting you run sophisticated models without expensive hardware. GGUF’s single-file design and GGML’s efficient computations are perfect for developers, hobbyists, or anyone wanting to experiment with LLMs locally. They’re especially popular for quantized models, which are smaller and faster, ideal for resource-constrained environments.


Deep Dive into GGUF: The GGML Universal File Format

GGUF, or GGML Universal File, is a binary file format designed for storing machine learning models, particularly large language models (LLMs), for efficient inference. It works seamlessly with GGML, a tensor library, and other GGML-based executors. GGUF aims to:

  • Enable fast loading and saving of models.
  • Simplify usage with minimal dependencies.
  • Ensure extensibility for future compatibility.

As a single-file format, GGUF encapsulates all model data—tensors and metadata—making it ideal for deployment on edge devices and consumer-grade hardware like CPUs and Apple Silicon.

GGML: The Predecessor and Engine

GGML, short for “GG’s Machine Learning,” is both a tensor library and an older file format developed by Georgi Gerganov. The library optimizes model inference on standard hardware, while the GGML file format was used to store models before GGUF. Its key strengths included:

  • High performance on CPUs.
  • Accessibility for non-GPU environments.

However, GGML’s file format lacked flexibility, often requiring manual adjustments for new features, which led to compatibility issues. GGUF was introduced to overcome these limitations.

Key Features of GGUF

GGUF’s design prioritizes efficiency and flexibility, making it a robust choice for modern machine learning workflows. Its features include:

  • Single-File Deployment: All model data is stored in one file, eliminating external dependencies.
  • Extensibility: Additional features can be introduced without affecting existing compatibility
  • Fast Loading and Saving: Optimized for quick inference tasks.
  • Memory-Mapped Files: Supports mmap for faster loading.
  • Minimal Dependencies: Requires no external libraries, simplifying implementation.

These characteristics make GGUF a good fit for deploying LLMs on devices with limited resources.

Naming Convention

GGUF files follow a structured naming convention: <BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf. Components are hyphen-separated, with at least BaseName, SizeLabel, and Version required. Key components include:

  • BaseName: Model architecture (e.g., “Mixtral”).
  • SizeLabel: Parameter weight class (e.g., “8x7B” for 8 experts, 7 billion parameters).
  • FineTune: Fine-tuning goal (e.g., “Chat”).
  • Version: Formatted as v<Major>.<Minor> (defaults to v1.0).
  • Encoding: Weights encoding scheme (e.g., “Q4_0”).
  • Type: File type (e.g., “LoRA” for adapters).
  • Shard: For multi-shard models (e.g., “00001-of-00009”).

Examples:

  • Mixtral-8x7B-v0.1-KQ2.gguf: Mixtral, 8x7B parameters, version 0.1, KQ2 encoding.
  • Hermes-2-Pro-Llama-3-8B-F16.gguf: Hermes 2 Pro Llama 3, 8B parameters, F16 encoding.
  • Grok-100B-v1.0-Q4_0-00003-of-00009.gguf: Grok, 100B parameters, Q4_0 encoding, shard 3 of 9.

File Structure

GGUF files are structured for efficiency:

  • Header: Contains metadata (version, alignment).
  • Tensor Infos: Details each tensor (name, type, dimensions).
  • Padding: Aligns data (e.g., to 32 bytes) for mmap performance.
  • Tensor Data: Stores model weights.

Key elements:

  • Global Alignment: Set by general.alignment, padded with 0x00.
  • Endianness: Little-endian by default, big-endian supported.
  • Data Types: 29 tensor types (e.g., GGML_TYPE_F32) and 13 metadata types (e.g., strings).

This structure ensures compact, fast-loading files.

Standardized Key-Value Pairs

GGUF uses metadata key-value pairs to describe models comprehensively. Categories include:

  • Required Metadata:
    • general.architecture: Model type (e.g., “llama”).
    • general.quantization_version: Quantization version.
    • general.alignment: File alignment (default 32).
  • General Metadata:
    • general.name, general.author, general.version, general.organization.
    • general.basename, general.finetune, general.description.
    • general.quantized_by, general.size_label, general.license.
    • general.url, general.doi, general.uuid, general.repo_url.
    • general.tags, general.languages, general.datasets, general.file_type.
  • Source Metadata:
    • general.source.url, general.source.doi, general.source.uuid.
    • general.source.repo_url, general.base_model.count.
  • LLM-Specific Metadata:
    • [llm].context_length, [llm].embedding_length, [llm].block_count.
    • [llm].feed_forward_length, [llm].use_parallel_residual.
    • [llm].tensor_data_layout, [llm].expert_count, [llm].expert_used_count.
  • Attention Metadata:
    • [llm].attention.head_count, [llm].attention.head_count_kv.
    • [llm].attention.max_alibi_bias, [llm].attention.clamp_kqv.
    • [llm].attention.layer_norm_epsilon, [llm].attention.key_length.
  • RoPE Metadata:
    • [llm].rope.dimension_count, [llm].rope.freq_base.
    • [llm].rope.scaling.type, [llm].rope.scaling.factor.
  • SSM Metadata:
    • [llm].ssm.conv_kernel, [llm].ssm.inner_size.
    • [llm].ssm.state_size, [llm].ssm.time_step_rank.

These pairs ensure rich model annotation.

Version History

GGUF has evolved:

  • v1: Initial version.
  • v2: Changed uint32 to uint64 for larger models.
  • v3: Added big-endian support.

This ensures compatibility and scalability.

Historical Context

GGUF succeeded earlier formats:

  • GGML: Unversioned, lacked extensibility.
  • GGMF: Single version, limited flexibility.
  • GGJT: Versions v1-v3, improved mmap, but had breaking changes.

GGUF addresses:

  • Architecture identification.
  • Extensibility without breaking changes.
  • Comprehensive metadata.

Why GGUF Over Other Formats?

GGUF was chosen because:

  • Avoids additional dependencies.
  • Supports 4-bit quantization.
  • Aligns with community workflows.
  • Embeds vocabularies.

GGUF vs. GGML: A Comparison

The following table compares GGUF and GGML:

AspectGGMLGGUF
PurposeTensor library and file formatFile format for inference
FlexibilityLimited, compatibility issuesExtensible, backward-compatible
EfficiencyOptimized for CPUsFast loading/saving, mmap support
MetadataBasic, manual adjustmentsStandardized, comprehensive
Use CaseEarly LLM deployment, now deprecatedModern inference, quantized models

GGUF builds on GGML’s strengths while addressing its shortcomings.

Community and Adoption

GGUF is widely adopted, with many models available on Hugging Face. This facilitates local deployment and sharing, appealing to developers and hobbyists.

Sources:

Security Considerations

Malformed GGUF files may pose risks like memory corruption. Best practices include:

  • Using trusted model sources.
  • Keeping software updated.

Comparison with Other Formats

Unlike PyTorch’s .pt or TensorFlow’s SavedModel, GGUF excels in:

  • Inference on standard hardware.
  • Extensibility.
  • Single-file deployment.

However, it’s primarily for inference, not training.

Future Developments

GGUF may evolve with:

  • New quantization schemes.
  • Metadata for emerging architectures.
  • Faster loading mechanisms.

Conclusion

GGUF and GGML help make AI more accessible by allowing efficient LLM inference on everyday hardware. GGUF’s extensible, single-file design and GGML’s robust library make them a powerful duo for deploying models locally. With growing community support and tools like llama-cpp-python, GGUF is poised to shape the future of accessible AI.

Key Citations:

==> See also in our Category: Scaling Smarter, Not Harder: A Deep Dive into Mixture-of-Experts in Modern LLMs

Share on:

Leave a comment