English versionEN

[EN] When Less Is More: A Technical Overview of LLMs and the Strength of Smaller Models
[ES] Cuando Menos Es Más: Una Visión Técnica de los LLMs y la Fuerza de los Modelos Más Pequeños

When Less Is More: A Technical Overview of LLMs and the Strength of Smaller Models

In generative AI, the largest large language models (LLMs) often dominate the headlines, hailed as the best solutions for the most complex and diverse tasks. While they certainly have their place, are they the best option for every enterprise use case?

Smaller language models are gaining traction for their ability to deliver high performance with lower cost and resource requirements. These models are quicker, easier to fine-tune, and better suited for targeted business needs, making them an attractive alternative for many organizations.

In this session, we will:

-Explore the technical structure and content of LLMs.

-Discuss how smaller, purpose-built models can be more efficient and effective for enterprise tasks, including how model optimization techniques can boost performance even more.

-Demonstrate how smaller LLMs can provide faster, more cost-effective solutions while still meeting the demands of specialized use cases.

This talk has been presented at AI Coding Summit 2026, check out the latest edition of this Tech Conference.

FAQ

The presentation is a technical overview of large language models and the strengths of smaller models.

Common Crawl is a web scraper that hosts a large collection of internet pages and data for model training.

Data is collected from the internet, filtered for quality, converted to mathematical representations, and tokenized.

Tokenization is the process of converting human language into mathematical symbols that a neural network can process.

Model inference is the process of using a trained AI model to generate outputs from new inputs, often requiring an inference engine.

Smaller models are faster and cheaper to run, and can be used on local hardware for data privacy and control.

Quantization is a technique to reduce the size of AI models by decreasing the precision of their parameters, maintaining accuracy while improving efficiency.

AI models can be found on platforms like Hugging Face, which offers models compatible with various inference engines.

Guide LLM and LLM EvalHarness are tools mentioned for benchmarking AI models and their performance.

Lugari Karasen is a developer advocate at Red Hat AI.

artificial intelligence

Legare Kerrison

11 min

26 Feb, 2026

Comments

Video Summary and Transcription

Lugari Karasen from Red Hat AI discusses the technical aspects of large language models, including data collection, tokenization, and neural network internals. Model training involves converting human language to mathematical representations and adjusting parameters iteratively in a complex environment. Inference engines like VLLM aid in deploying models for rapid data processing. Optimizing model size for efficiency without sacrificing accuracy is crucial, with quantization reducing model size while maintaining precision. Local deployment offers privacy and control, and smaller purpose-driven models can enhance workflows and experimentation.

Available in Español: Cuando Menos Es Más: Una Visión Técnica de los LLMs y la Fuerza de los Modelos Más Pequeños

1. Technical Overview of Language Models

Short description:

Lugari Karasen, developer advocate at Red Hat AI, discusses large language models, focusing on data collection, tokenization, neural network internals, and model inferencing. Touching on the strengths of smaller models, the process includes web scraping, data filtering, language conversion to mathematical representations, and tokenization for training.

Hi guys, I'm Lugari Karasen. I'm a developer advocate at Red Hat AI, and today we're going to talk about a technical overview of large language models and the strengths of smaller models. So what we're going to touch on today and the next 10 minutes, we're going to look at data collection for the pre-training of these models, the tokenization of that data, what the internals of a neural network look like. And then inferencing these models once it's time to get them into production. Through that, we will then touch on the strengths of smaller models.

So first up, data collection for pre-training. If you've ever posted on the internet, you've probably helped contribute to training these models. Here we can see Common Crawl's statistical graph of what a web hierarchy looks like. Common Crawl is a web scraper that hosts a bunch of the internet's pages and the data that are on them. Every lab has some data set similar to what Common Crawl has captured to train these models on.

So once you pull from the website, people will typically filter the URLs that are going to predictably lead to bad results. They're going to pull the text from these websites, ignore things that are not text, filter for the type of language that you want. Maybe you want it to be 65% English, some other percent of a different language. At the end of the day, it's all going to be converted to mathematical representations. From there, you're going to remove duplicates and hopefully any personal identifiable information, such as social security numbers, passwords, etc. From there, we're going to tokenize that data.

2. Model Training and Inference

Short description:

Converting human language into math through encoding and tokenization. Training models adjust parameters iteratively to reflect data patterns, operating in a complex parameter-rich environment. Inference engines like VLLM facilitate the production deployment of models for rapid data processing and new data generation.

So this is us converting human language into math. If we see an excerpt from Alice in Wonderland, we can convert it to binaries with UTF-8 encoding, grouping them into 8 bits. This process, combined with byte-pair encoding, results in a tokenized version of the human language, associating words with corresponding tokens.

During training, the model adjusts parameters to reflect patterns in the data it was trained on. This involves tweaking weights through an iterative process to reduce loss and make the model more representative of the dataset. The complexity lies in the billions of parameters models typically have, with examples like Chat2BT5 trained on trillions of parameters, creating an intricate vector embedding space.

Inference involves putting models into production to process user inputs quickly and generate new data. An inference engine, like VLLM, is vital for this process, requiring inputs such as a config, tokenizer, and safe tensors. VLLM, supported by Red Hat and available on HuggingFace, stands out for its user-friendly compatibility with various models.

3. Model Training Complexity and Inference Engines

Short description:

During training, models adjust parameters iteratively to reflect data patterns in a highly complex process. Models can have billions of parameters, with examples like Chat2BT5 trained on trillions. Inference engines like VLLM facilitate model deployment for rapid data processing and new data generation.

So what is going on during training? For a mental model of this, I must preface and say every algorithm is going to be different and this is highly complex. But loosely what training is is finding the right parameters to make the model have outputs that are representative of the patterns in the data that you trained it on. So we are going to take the tokens that we created and the weights or parameters that correspond with them and tweak those weights and parameters through an iterative process. They begin random and then they slowly, via reducing loss, become more representative of the dataset.

A vector embedding space helps me visualize what this looks like. Again, this is a loose representation. This is so, so complex. In reality, a model can have billions of parameters, must almost have billions of parameters. It's rare you see any model with below one billion parameters. And Chat2BT5, for example, was trained on, I believe, three trillion parameters. But if we look at this and we have every dot represent a token, we can see that in this infinite dimensional space, these tokens or words are going to be placed near words that correspond with them. And queen might be equidistant from woman as king is equidistant from man.

So now inference. For the most part, we are not going to be training these models, but we will be responsible as developers for getting them into production and making sure that our users' inputs are processed quickly and we can generate new data from these models one token at a time. So we need an inference engine in order to put these into production. It's required. The required inputs for VLLM, which is an open source inference engine that Red Hat helps contribute to are a config, a tokenizer, and safe tensors. All of these can be found on HuggingFace, which is a repository for models. Some inference engines are compatible with the files on HuggingFace. Some are not. VLLM is, which definitely increases its user friendliness. So there are tradeoffs when you're deploying these LLMs. If you are optimizing for low latency and high accuracy, you're probably also going to get high cost. That high accuracy often comes from a very large model, which often means an expensive model.

4. Model Size Optimization and Local Deployment

Short description:

Boosting accuracy by adding parameters can hinder inference performance, advocating for smaller, faster, and cost-effective models. Quantization reduces model size while maintaining accuracy. Comparing smaller and quantized models, the latter maintains precision with fewer total parameters. Local model deployment offers data privacy, control, and experimentation flexibility without significant infrastructure needs.

So for example, if we boost accuracy, we might be adding parameters to our model, which is in turn leading to a decline in inference performance. So this is the first somewhat obvious argument for smaller models, which is that smaller models are faster to inference and they cost less to inference. Here is a chart that is showing how much a company would pay to run each of these different size models, 8 billion, 7 billion, 70 billion, 450 billion parameter models, as on-demand instances in the cloud, a.k.a. pinging an endpoint to run these models. Now you can see that the costs here are vastly different for a model whose accuracy might not be all that different.

Just to level set quickly, what is a small model? A small model is something that is easy to run in a local development environment on-prem, traditionally 8 billion parameters or less, whereas large models range from 70 billion to 1 trillion parameters requiring beefier hardware. To make models smaller but maintain accuracy, quantization is a viable option. It involves reducing the precision of numbers within the model, either in weights or activations used in generating new data during inference. Another method is distilling data from larger models to train smaller ones, teaching them to mimic the larger models. Additionally, setting some weights to zero, called sparsity, contributes to model size reduction.

A comparison between smaller and quantized models reveals differences in total parameters and precision. For instance, a native 16 floating point operations model compressed to a smaller size maintains 99% accuracy and runs twice as fast. Utilizing tools like LLM Compressor, a subset of the VLLM open-source project, allows for compressing models. Running local models offers benefits such as data privacy and control, especially for sensitive data. Smaller models enable experimentation without infrastructure constraints, serving as a sandbox for developing agents or rag pipelines efficiently. On-premises deployment of smaller models enhances privacy, control, and flexibility in tinkering with minimal infrastructure dependencies.

5. Local Model Deployment and Benchmarking

Short description:

A smaller model with fewer parameters and a quantized model with reduced precision maintain high accuracy and speed. Local model deployment ensures control, privacy, and flexibility for experimentation. Running on-premises enhances privacy and control, while smaller purpose-driven models can contribute to agentic workflows. Explore model tinkering and benchmarking tools like Guide LLM and LLM EvalHarness for performance evaluation.

A smaller model is going to have fewer total parameters, but a quantized model will have the same amount of parameters with less precision. This interesting chart series illustrates a model at native 16 floating point operations compressed to maintain 99% accuracy and run twice as fast. Utilizing tools like LLM Compressor from the VLLM open-source project can aid in model compression. Running local models provides control, data privacy, and experimentation flexibility, especially beneficial for sensitive data.

Running local models offers the advantages of data privacy, control, and the ability to experiment without infrastructure constraints. The use of small language models as sandboxes for developing agents or rag pipelines can mitigate upfront infrastructure risks. Argument three for smaller models emphasizes the benefits of running models on-premises for enhanced privacy and control. Despite the contrast with tool-dependent workflows, smaller purpose-driven models can still play a significant role in agentic workflows for specific use cases.

To explore further, Hugging Face offers the opportunity to tinker with models and access pre-quantized models through the Red Hat AI Hugging Face repo. Benchmarking model performance with tools like Guide LLM can provide insights into system-level performance and compatibility with the VLLM inference engine. Evaluating model performance across different sizes such as smaller, quantized, and larger models can be done using LLM EvalHarness. For additional content on AI for sysadmins and developers, a newsletter is available, and for direct conversations, the speaker is reachable on LinkedIn.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Building a Voice-Enabled AI Assistant With Javascript

JSNation 2023

21 min

Building a Voice-Enabled AI Assistant With Javascript

Top Content

Tejas Kumar

Author of the "Fluent React" bestselling book, software engineer with 23 years of experience, and host of the developer-loved ConTejas Code podcast.

This Talk discusses building a voice-activated AI assistant using web APIs and JavaScript. It covers using the Web Speech API for speech recognition and the speech synthesis API for text to speech. The speaker demonstrates how to communicate with the Open AI API and handle the response. The Talk also explores enabling speech recognition and addressing the user. The speaker concludes by mentioning the possibility of creating a product out of the project and using Tauri for native desktop-like experiences.

case study artificial intelligence

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

JSNation US 2024

31 min

The Ai-Assisted Developer Workflow: Build Faster and Smarter Today

Top Content

Addy Osmani

Engineering Leader Working on Google Chrome

AI is transforming software engineering by using agents to help with coding. Agents can autonomously complete tasks and make decisions based on data. Collaborative AI and automation are opening new possibilities in code generation. Bolt is a powerful tool for troubleshooting, bug fixing, and authentication. Code generation tools like Copilot and Cursor provide support for selecting models and codebase awareness. Cline is a useful extension for website inspection and testing. Guidelines for coding with agents include defining requirements, choosing the right model, and frequent testing. Clear and concise instructions are crucial in AI-generated code. Experienced engineers are still necessary in understanding architecture and problem-solving. Energy consumption insights and sustainability are discussed in the Talk.

artificial intelligence

The Rise of the AI Engineer

React Summit US 2023

30 min

The Rise of the AI Engineer

Top Content

Watch video: The Rise of the AI Engineer

Shawn Swyx Wang

Latent.Space Editor & Smol.ai Founder

The rise of AI engineers is driven by the demand for AI and the emergence of ML research and engineering organizations. Start-ups are leveraging AI through APIs, resulting in a time-to-market advantage. The future of AI engineering holds promising results, with a focus on AI UX and the role of AI agents. Equity in AI and the central problems of AI engineering require collective efforts to address. The day-to-day life of an AI engineer involves working on products or infrastructure and dealing with specialties and tools specific to the field.

web development artificial intelligence builders and founders future of development

AI and Web Development: Hype or Reality

JSNation 2023

24 min

AI and Web Development: Hype or Reality

Top Content

Wes Bos

Full Stack Developer, Speaker & Teacher, Co-host of Syntax.fm podcast.

This talk explores the use of AI in web development, including tools like GitHub Copilot and Fig for CLI commands. AI can generate boilerplate code, provide context-aware solutions, and generate dummy data. It can also assist with CSS selectors and regexes, and be integrated into applications. AI is used to enhance the podcast experience by transcribing episodes and providing JSON data. The talk also discusses formatting AI output, crafting requests, and analyzing embeddings for similarity.

productivity artificial intelligence

The AI-Native Software Engineer

JSNation US 2025

35 min

The AI-Native Software Engineer

Top Content

Addy Osmani

Engineering Leader Working on Google Chrome

Software engineering is evolving with AI and VIBE coding reshaping work, emphasizing collaboration and embracing AI. The future roadmap includes transitioning from augmented to AI-first and eventually AI-native developer experiences. AI integration in coding practices shapes a collaborative future, with tools evolving for startups and enterprises. AI tools aid in design, coding, and testing, offering varied assistance. Context relevance, spec-driven development, human review, and AI implementation challenges are key focus areas. AI boosts productivity but faces verification challenges, necessitating human oversight. The impact of AI on code reviews, talent development, and problem-solving evolution in coding practices is significant.

artificial intelligence

Web Apps of the Future With Web AI

JSNation 2024

32 min

Web Apps of the Future With Web AI

Jason Mayes

Web AI Lead at Google.

Web AI in JavaScript allows for running machine learning models client-side in a web browser, offering advantages such as privacy, offline capabilities, low latency, and cost savings. Various AI models can be used for tasks like background blur, text toxicity detection, 3D data extraction, face mesh recognition, hand tracking, pose detection, and body segmentation. JavaScript libraries like MediaPipe LLM inference API and Visual Blocks facilitate the use of AI models. Web AI is in its early stages but has the potential to revolutionize web experiences and improve accessibility.

artificial intelligence

Workshops on related topic

AI on Demand: Serverless AI

DevOps.js Conf 2024

163 min

AI on Demand: Serverless AI

Top Content

Featured WorkshopFree

Nathan Disidore

In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.

serverless architecture artificial intelligence

AI for React Developers

React Advanced 2024

142 min

AI for React Developers

Top Content

Featured Workshop

Eve Porcello

Knowledge of AI tooling is critical for future-proofing the careers of React developers, and the Vercel suite of AI tools is an approachable on-ramp. In this course, we’ll take a closer look at the Vercel AI SDK and how this can help React developers build streaming interfaces with JavaScript and Next.js. We’ll also incorporate additional 3rd party APIs to build and deploy a music visualization app.
Topics:- Creating a React Project with Next.js- Choosing a LLM- Customizing Streaming Interfaces- Building Routes- Creating and Generating Components - Using Hooks (useChat, useCompletion, useActions, etc)

react next.js artificial intelligence

Building Full Stack Apps With Cursor

JSNation 2025

46 min

Building Full Stack Apps With Cursor

Featured Workshop

Mike Mikula

In this workshop I’ll cover a repeatable process on how to spin up full stack apps in Cursor. Expect to understand techniques such as using GPT to create product requirements, database schemas, roadmaps and using those in notes to generate checklists to guide app development. We will dive further in on how to fix hallucinations/ errors that occur, useful prompts to make your app look and feel modern, approaches to get every layer wired up and more! By the end expect to be able to run your own AI generated full stack app on your machine!
Please, find the FAQ here

artificial intelligence

Vibe coding with Cline

JSNation 2025

64 min

Vibe coding with Cline

Featured Workshop

Nik Pash

The way we write code is fundamentally changing. Instead of getting stuck in nested loops and implementation details, imagine focusing purely on architecture and creative problem-solving while your AI pair programmer handles the execution. In this hands-on workshop, I'll show you how to leverage Cline (an autonomous coding agent that recently hit 1M VS Code downloads) to dramatically accelerate your development workflow through a practice we call "vibe coding" - where humans focus on high-level thinking and AI handles the implementation.You'll discover:The fundamental principles of "vibe coding" and how it differs from traditional developmentHow to architect solutions at a high level and have AI implement them accuratelyLive demo: Building a production-grade caching system in Go that saved us $500/weekTechniques for using AI to understand complex codebases in minutes instead of hoursBest practices for prompting AI agents to get exactly the code you wantCommon pitfalls to avoid when working with AI coding assistantsStrategies for using AI to accelerate learning and reduce dependency on senior engineersHow to effectively combine human creativity with AI implementation capabilitiesWhether you're a junior developer looking to accelerate your learning or a senior engineer wanting to optimize your workflow, you'll leave this workshop with practical experience in AI-assisted development that you can immediately apply to your projects. Through live coding demos and hands-on exercises, you'll learn how to leverage Cline to write better code faster while focusing on what matters - solving real problems.

artificial intelligence

The React Developer's Guide to AI Engineering

React Summit US 2025

96 min

The React Developer's Guide to AI Engineering

Featured WorkshopFree

Niall Maher

A comprehensive workshop designed specifically for React developers ready to become AI engineers. Learn how your existing React skills—component thinking, state management, effect handling, and performance optimization—directly translate to building sophisticated AI applications. We'll cover the full stack: AI API integration, streaming responses, error handling, state persistence with Supabase, and deployment with Vercel.Skills Translation:- Component lifecycle → AI conversation lifecycle- State management → AI context and memory management- Effect handling → AI response streaming and side effects- Performance optimization → AI caching and request optimization- Testing patterns → AI interaction testing strategiesWhat you'll build: A complete AI-powered project management tool showcasing enterprise-level AI integration patterns.

artificial intelligence

Build LLM agents in TypeScript with Mastra and Vercel AI SDK

React Advanced 2025

145 min

Build LLM agents in TypeScript with Mastra and Vercel AI SDK

Featured WorkshopFree

Eric Burel

LLMs are not just fancy search engines: they lay the ground for building autonomous and intelligent pieces of software, aka agents.
Companies are investing massively in generative AI infrastructures. To get their money's worth, they need developers that can make the best out of an LLM, and that could be you.
Discover the TypeScript stack for LLM-based development in this 3 hours workshop. Connect to your favorite model with the Vercel AI SDK and turn lines of code into AI agents with Mastra.ai.

typescript artificial intelligence