As AI continues to transform how we interact with data, text-to-SQL models are emerging as powerful tools to bridge the gap between natural language and structured query generation. Instead of manually writing SQL queries, imagine asking a model:
“What is the total revenue generated from organic products in California in the last six months?”
…and instantly receiving a well-formed SQL query!
This project was an exciting journey where I fine-tuned a large language model (LLM) to generate SQL queries from natural language. Throughout this process, I explored fine-tuning techniques, evaluation challenges, prompt engineering, and efficiency optimizations. In this blog, I will walk you through my approach, key concepts like LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA), the hurdles I faced, and what I learned.
While general-purpose LLMs can generate SQL, they often struggle with:
🔹 Domain-Specific Queries—They lack industry-specific nuances.
🔹 Schema Awareness — Without knowledge of database structures, they may generate invalid queries.
🔹 Execution Accuracy — A query might be syntactically correct but logically incorrect.
Fine-tuning an LLM on structured datasets with diverse query types can significantly improve SQL generation accuracy, making the model more context-aware and adaptable to real-world use cases.
Model Selection: Why I Used Unsloth’s Gemma-2–2B & LLaMA 3.2?
Since fine-tuning large models can be resource-intensive, I opted for Unsloth’s Gemma-2–2B & LLaMA 3.2(4-bit quantized), which is optimized for efficient training. Key reasons:
- LLaMA 3.2 — A powerful open-weight LLM with strong reasoning capabilities.
- Lightweight yet powerful — Gemma-2–2B is manageable on consumer GPUs.
- Supports LoRA/QLoRA — Enabling fine-tuning with minimal memory consumption.
- Faster training & inference — Thanks to 4-bit quantization, making it feasible without expensive GPUs.
Dataset Overview
We are using Gretel’s Synthetic Text-to-SQL dataset. Gretel’s Synthetic Text-to-SQL dataset is the largest open-source dataset of its kind, containing over 100,000 high-quality synthetic Text-to-SQL samples12. It’s designed to enhance AI model training by providing diverse and realistic SQL queries.
Dataset Structure
- sql_prompt — Natural language query (e.g., “Retrieve active defense contracts.”)
- sql_context — Schema information (e.g., “Table: contracts, Columns: contract_id, start_date, status”)
- sql — Expected SQL output
- sql_explanation — Explanation of SQL logic
Load and Format the Dataset:
- The Gretel synthetic text-to-SQL dataset is loaded using the load_dataset function.
- The formatting_prompts_func is applied to the dataset to format the prompts.
- The dataset is split into training and testing sets, with 80% of the data used for training and 20% for testing.
- The train_test_split method is used to perform the split, ensuring reproducibility with a fixed seed.
Formatting Prompts for Fine-Tuning
The alpaca_prompt template is used to structure the input data, ensuring that each example includes the context, SQL prompt, generated SQL query, and explanation. It helps to format prompts for fine-tuning a language model using Gretel’s synthetic text-to-SQL dataset.
LoRA (Low-Rank Adaptation) — The Key to Efficient Fine-Tuning
Instead of updating all model weights (which is computationally expensive), LoRA modifies only a subset of parameters. This reduces memory usage while still achieving fine-tuning benefits.
How it works:
🔹 LoRA adds small trainable adapter layers while freezing the pre-trained model.
🔹 Only low-rank updates are applied, making it much more efficient.
🔹 Drastically reduces VRAM requirements compared to full fine-tuning.
QLoRA (Quantized LoRA) — Taking Efficiency Further
While LoRA is great, QLoRA takes it a step further by using 4-bit quantization. This means:
✅ Lower memory footprint — Model weights are compressed to 4-bit precision.
✅ Faster training & inference — Less computational overhead.
✅ Same accuracy as full fine-tuning — Thanks to precision-preserving optimizations.
Using QLoRA, I fine-tuned the model on consumer-grade GPUs, making it highly cost-effective and scalable.
Structuring the Input Prompt
To get accurate SQL queries, I designed a structured prompt:
This ensures:
✅ The model understands the query intent.
✅ The database schema helps generate schema-aware SQL.
✅ The model explains the SQL, improving interpretability.
Example input-output:
Input:
“Find the total revenue from orders placed in the last 30 days.”Generated SQL:
SELECT SUM(order_total)
FROM orders
WHERE order_date >= DATE_SUB(CURDATE(), INTERVAL 30 DAY);
Explanation:
“This query calculates the total revenue by summing order_total for all orders placed within the last 30 days using DATE_SUB.”
This structured format significantly improved output coherence.
✅ Before Fine-Tuning:
The model generated SQL with missing conditions and incorrect table references.
✅ After Fine-Tuning:
- Queries were more accurate, with correct table joins.
The model better understood database context and produced SQL closer to human-written queries.
Unlike classification tasks, evaluating SQL correctness is non-trivial. Here’s how I tackled it:
1️⃣ Exact Match Accuracy
- Compare the generated SQL with the ground-truth SQL.
- Works well for simple queries but fails when different SQL formulations are logically equivalent.
2️⃣ Execution-Based Evaluation
- Run both generated SQL and expected SQL on a test database.
- Compare execution results instead of raw SQL text.
- More robust since different SQL formulations can yield the same results.
To make the model interactive, I built a Gradio-based UI where users can:
🔹 Enter a natural language query
🔹 Choose between the baseline & fine-tuned model
🔹 Receive SQL + Explanation instantly
This makes the model easily accessible, allowing users to test SQL generation accuracy in real-time!
🚀 QLoRA + 4-bit quantization = highly efficient fine-tuning with minimal computational cost.
📌 Prompt design matters — well-structured prompts significantly improved SQL accuracy.
⚡ Evaluation is non-trivial — correct SQL syntax ≠ correct execution, so query validation is crucial.
💡 Fine-tuned models outperform general-purpose LLMs on domain-specific SQL generation.
This project deepened my understanding of efficient fine-tuning techniques and practical AI model evaluation — and I’d love to discuss this with others working on LLMs, text-to-SQL, or fine-tuning!
The ability to fine-tune LLMs for specific tasks is a powerful skill in modern AI development. Text-to-SQL conversion is just one use case — these techniques can be extended to financial data extraction, legal document analysis, and more.
I hope this blog helps anyone exploring efficient LLM fine-tuning, Text-to-SQL models, and QLoRA techniques. Let me know your thoughts, and feel free to reach out if you’re working on similar projects! 🚀
#LLM #FineTuning #TextToSQL #QLoRA #AI #MachineLearning #Unsloth #LoRA #NLP #DataScience