Transformers.cpp

C++LibTorchTransformersAtttentionLLMsTokenizersInference

Overview

Reconstructing modern multi & mix modal Transformer architectures with LibTorch in C++ without Python dependencies.

A C++ library for Transformer based LLMs, featuring multi & mix modal support, optimized attention, and hybrid CPU and GPU inference for high-speed, low-memory deployment.

Features

Manual LibTorch implementation of Llama, Llava, Mistral, DeepseekV3, LlavaNext, and CLIPVision
Optimized attention with MQA and GQA
Dynamic Key Value cache support
Multi-modal & mix-modal hybrid CPU and GPU inference
Flexible tokenization BPE, SentencePiece, HuggingFace-compatible
Supports Vision Trasnformers using OpenCV

GitHub Repository

Researching the Internals

Before starting my internship at BSC, I spent long nights researching the internals of Transformer architectures. The papers pinned on the board behind me are my handwritten diagrams of attention flows, embedding layers, and decoder stacks. This early exploration laid the foundation for the work I would later pursue at BSC and eventually expand into my Transformers.cpp project.

No Rest, Just Transformers.cpp

The day after I finished my internship at BSC in Barcelona, I flew back to Turkey and immediately started working on the Transformers.cpp project at the Kasırga Microprocessors Lab. No pause, no break, just straight into building C++ implementations of LLaMA, LLaVA, and more. This photo at the tech center marks the very first morning of that new chapter.

First LLaMA Milestone

This night marks the moment I successfully implemented the full LLaMA architecture in my Transformers.cpp project. After weeks of debugging attention, RoPE, and the KV cache system, everything finally came together. Once I saw the outputs match Hugging Face’s reference, I took a break to breathe and enjoy the night a small pause after a huge milestone.