Building an AI-Powered Smart Contract Security Auditor: From Fine-Tuning to Deployment
--
How I fine-tuned a 7B LLM on the SmartBugs dataset and deployed a fully functional Solidity vulnerability detector — completely for free.
The Problem
Smart contract security is one of the most critical challenges in the blockchain industry. Since 2016, over $3 billion has been lost to smart contract exploits — reentrancy attacks, integer overflows, access control flaws. Most of these vulnerabilities are well-known, well-documented, and yet developers keep shipping them to production.
The traditional solution is manual auditing — expensive, slow, and not scalable. Tools like Slither and MythX help, but they’re either purely rule-based or require paid subscriptions.
I wanted to build something different: an AI auditor that understands code semantically, not just through regex patterns.
The Architecture
The system has three layers working together:
[Solidity Code]
│
▼
[Pattern Analysis] ←── Fast, always-on regex layer
│
▼
[LLM Analysis] ←── Fine-tuned DeepSeek-Coder-7B
│
▼
[Audit Report] ←── Risk level, vulnerabilities, recommendationsWhy two layers? Pattern analysis is instant and reliable for known vulnerability signatures. The LLM adds semantic understanding — it can reason about why something is vulnerable and suggest specific fixes. The combination is more robust than either alone.
The Dataset: SmartBugs
I used the SmartBugs Curated Dataset — a collection of vulnerable Solidity contracts annotated with vulnerability categories. It contains contracts with real-world vulnerabilities including:
- Reentrancy (like the DAO hack)
- Integer overflow/underflow
- Unchecked external calls
- Access control issues
- Timestamp dependence
- tx.origin authorization
For each contract, I built a training example pairing the raw Solidity code with a structured audit report generated by the pattern analyzer. This gave me ~2,000 labeled examples.
def build_prompt(code, findings):
if findings:
report = '\n'.join(f"- [{f['severity']}] {f['name']}"
for f in findings)
answer = (
f"## Security Audit Report\n\n"
f"⚠️ Vulnerabilities detected:\n{report}\n\n"
f"Please review each finding and apply the recommended "
f"fixes before deploying to production."
)
return {'input': f"Analyze this contract:\n```solidity\n{code}\n```",
'output': answer}Fine-Tuning with LoRA
I chose DeepSeek-Coder-7B-Instruct as the base model. It’s specifically trained on code, understands Solidity syntax, and at 7B parameters fits comfortably in Google Colab Pro’s A100 GPU with 4-bit quantization.
LoRA (Low-Rank Adaptation) made fine-tuning feasible on a single GPU. Instead of updating all 7 billion parameters, LoRA adds small trainable matrices to the attention layers — reducing trainable parameters from 7B to just ~10M.
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)Training took about 25 minutes on an A100 GPU with these settings:
- 5 epochs
- Batch size 4 with gradient accumulation
- bf16 precision (A100 native)
- Cosine learning rate schedule
The Critical Step: Merging LoRA Weights
This is where I hit my first major issue. After training, I pushed the LoRA adapter to Hugging Face Hub — but the HF Inference API returned a 404: Cannot POST /models/....
The reason: LoRA adapters are differential weights — they require the base model to be loaded first. HF Inference API can’t handle this automatically for custom models.
The solution is to merge the adapter into the base model before pushing:
# ❌ What I pushed first (doesn't work with Inference API):
# adapter_config.json + adapter_model.safetensors
# ✅ What you need to push (complete merged model):
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
torch_dtype=torch.float16,
device_map="cpu", # CPU to avoid OOM during merge
)
model_with_lora = PeftModel.from_pretrained(base_model, checkpoint_path)
merged_model = model_with_lora.merge_and_unload() # ← key step
merged_model.push_to_hub(repo_id)Important: Do this on CPU, not GPU. After training, the GPU is nearly full. Loading the base model again (in fp16, without 4-bit quantization) on the same GPU causes OOM. Loading on CPU uses RAM instead.
Deployment: Fully Free Stack
Getting to zero cost required some creativity:
Training → Google Colab Pro
A100 GPU, ~25 minutes per training run. Not permanently free but affordable.
Model Storage → Hugging Face Hub
Free unlimited storage for public models.
API + UI → Hugging Face Spaces
This is where it gets interesting. HF Spaces provides free CPU containers. The 7B model with 4-bit quantization (~4GB) loads and runs inference on CPU — slowly (~3–5 minutes per request on CPU basic), but it works.
For production speed, I upgraded to a paid GPU Space, but the CPU version is fully functional for demos.
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
HF_MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
low_cpu_mem_usage=True,
)Lessons Learned
1. Always merge LoRA before deploying The adapter-only approach works for local inference where you control the environment, but any production API expects a complete model.
2. Free GPU tiers have hard memory limits Colab Pro’s A100 has 40GB, which sounds like a lot until you’re loading a 13B model twice (once for training, once for merging). Always del model; torch.cuda.empty_cache() before the merge step.
3. Pattern analysis is underrated The regex-based layer catches ~80% of common vulnerabilities instantly, with zero latency and zero cost. The LLM adds value for explaining why something is vulnerable and suggesting fixes, but don’t underestimate simple pattern matching.
4. 7B > 13B for deployment CodeLlama-13B produced slightly better audit reports, but the deployment friction was not worth it. DeepSeek-Coder-7B hits a sweet spot of quality, speed, and deployability.
Results
The deployed system can detect:
╔════════════════════════╦════════════════════╦════════════════╗
║ Vulnerability ║ Severity ║Detection Method║
╠════════════════════════╬════════════════════╬════════════════╣
║ Reentrancy ║ 🚨 HIGH ║ Pattern + LLM ║
║ Integer Overflow ║ 🚨 HIGH ║ Pattern + LLM ║
║ Unchecked Call Returns ║ ⚠️ MEDIUM ║ Pattern + LLM ║
║ tx.origin Authorization║ ⚠️ MEDIUM ║ Pattern + LLM ║
║ Missing Access Control ║ ⚠️ MEDIUM ║ Pattern + LLM ║
║ Timestamp Dependence ║ ℹ️ LOW ║ Pattern + LLM ║
╚════════════════════════╩════════════════════╩════════════════╝For the classic reentrancy contract (VulnerableBank), the model correctly identifies the vulnerability and outputs:
“The withdraw function sends ETH before updating the balance. An attacker can recursively call withdraw() before the balance is decremented, draining the contract. Use the checks-effects-interactions pattern and consider OpenZeppelin’s ReentrancyGuard.”
Try It Yourself
The full project is available on Hugging Face Spaces: https://parsa2025ai-auditagent.hf.space
The complete codebase — training notebook, FastAPI backend, and frontend — is modular and reusable. All free to use.
What’s Next
- Expanding the training dataset with more recent exploits (Euler Finance, Ronin Bridge)
- Adding support for multi-file contract analysis
- Integrating with Foundry and Hardhat as a pre-deployment hook
- Exploring smaller, faster models (Phi-3, Qwen2.5-Coder-1.5B) for lower latency on CPU
If you found this useful, the model is at huggingface.co/Parsa2025AI/smart-contract-auditor. Feedback and contributions welcome.