A Guided Tour of the New Microsoft Foundry Labs
June 16, 2026Azure Function App — Queue-Based Architecture for Long-Running Sync Jobs
June 16, 2026We are nine software engineering students at the Egyptian Chinese University in Cairo. When we got our project brief, we noticed a gap that bothered us: Python is the most widely used language in AI development, yet almost every security tool out there was built for C and C++. The tools that do exist for Python rely on regex pattern matching — a technique that has not changed meaningfully in years.
So we built one ourselves.
We called it Code Security Identifier — CSI. Instead of matching patterns like existing tools, CSI understands code structure. We split the work across nine people, each owning a specific piece of the system: dataset engineering, model architecture, loss function design, adversarial training, hyperparameter optimization, and deployment. None of us had built a production security tool before. By the end, we had one running.
This post documents what we built, the decisions we made, the things that did not work, and what we learned.
The Problem: Python Security Is Underserved
Python powers 70% of AI workloads and 45% of enterprise backends. As AI-assisted code generation becomes standard practice, the volume of Python code being written and deployed is growing faster than any team can manually audit. GitHub Copilot, ChatGPT, and similar tools generate thousands of lines of Python daily. Much of it is never reviewed for security.
The tools that exist were not built for this reality. Bandit, the industry standard for Python static analysis, uses regex pattern matching. Its F1 score on vulnerability detection is approximately 0.62. That means for every 100 real vulnerabilities in a Python codebase, Bandit catches 62 and misses 38. In a production system handling user data, financial transactions, or infrastructure commands, those 38 missed vulnerabilities are exploitable.
The deeper problem is architectural. Regex-based tools flag code that looks suspicious based on token patterns. They cannot trace how data flows through a program. They cannot reason about whether an untrusted input reaches a dangerous execution point. They catch obvious cases and miss subtle ones — which are exactly the cases that matter most in real-world exploits.
We set out to build something better.
Why Token-Only Models Also Fail
The first generation of deep learning approaches to vulnerability detection treated source code the same way NLP models treat text: as a sequence of tokens. Models like CodeBERT learn statistical co-occurrences. They learn that SELECT, WHERE, and execute appear together. They learn that os.system and subprocess appear near command-like strings. These patterns are suspicious. But suspicion is not detection.
What “Token Co-Occurrence” Actually Means
To be concrete: a token model doesn’t read code, it reads a flat sequence of sub-word units, the same way it would read a sentence. It has no built-in notion of “this token is a function argument,” “this token is the return value of that call,” or “this variable was assigned three lines up and is now being used here.” Everything is positional and statistical. During pretraining, the model learns that certain tokens tend to appear near other tokens — execute tends to follow strings that look like SQL, eval tends to appear near user-controlled-looking variable names, os.system tends to sit close to subprocess or shell=True. These are real correlations in code, and they give the model some signal. But a correlation between tokens is not the same as understanding what the code actually does with those tokens.
The Causal Chain a Vulnerability Actually Is
A real SQL injection vulnerability is not a collection of SQL-adjacent tokens. It is a causal chain: an untrusted value enters through a function parameter, passes through one or more assignments and string operations, and reaches a database execution call without sanitization. Concretely, that chain might look like: user_id arrives as a request parameter → it gets assigned to a local variable → that variable is interpolated into an f-string → the f-string is passed as the queryargument to cursor.execute(). Each of these steps, on its own, is completely unremarkable Python. Assigning a variable is not dangerous. Building an f-string is not dangerous. Calling execute() is not dangerous. The vulnerability exists only in the connection between these steps — specifically, in the fact that an untrusted value reaches a sensitive sink without anything sanitizing it along the way.
Token models cannot see this chain. They see the tokens at each step — user_id, f”…”, cursor.execute — and they may even have learned that this combination of tokens is statistically associated with vulnerable code. But “statistically associated with” is not the same as “I can trace that this specific value, from this specific source, reaches this specific sink.” The model has no mechanism for following a variable across lines, across function boundaries, or through transformations. It is reasoning about which words appear, not what happens to the data.
Two Failure Modes, Same Root Cause
This single limitation — no data-flow reasoning — produces two distinct failure modes, and both are costly in a real security context.
The first is false positives on safe code. Plenty of legitimate, secure code uses tokens that a token model has learned to associate with danger. A function that builds a SQL query using parameterized queries (the correct, safe way to do it) still contains tokens like query, execute, and variable names that look like user input — because they often are user input, just handled safely via placeholders and bound parameters instead of string concatenation. A token model, lacking the ability to distinguish “this value is interpolated directly into the query string” from “this value is passed as a separate, escaped parameter,” may flag both patterns identically. In practice, this is exactly the kind of false positive that erodes trust in a security tool — if a scanner flags safe, well-written code as vulnerable often enough, developers start ignoring its output entirely, which defeats the purpose of having it.
The second, more dangerous failure mode is false negatives on real injections that don’t match the training distribution. The token-level patterns a model learns during pretraining are necessarily a reflection of the kinds of vulnerable code that were common in its training data — typical variable names like user_input, query, cmd, typical function calls like os.system, eval, cursor.execute. But real-world code, especially AI-generated code, doesn’t always follow these conventions. A variable might be named x, payload, data_5, or something entirely project-specific. A dangerous sink might be wrapped in a thin custom helper function with an unfamiliar name that itself calls subprocess.run three layers down. If the surface tokens don’t match what the model has seen before, but the underlying data-flow path — untrusted input to dangerous sink — is identical to a thousand vulnerabilities the model has seen, a token model has no way to recognize that. It missed the pattern not because the vulnerability is novel, but because it was only ever looking at the wrong thing: the words, not the wiring between them.
Why This Matters for CSI’s Design
Both failure modes point to the same underlying gap: vulnerability detection is fundamentally a question about the paths data takes through a program, not about which words appear in the program’s source. A model that wants to close this gap needs access to something a flat token sequence cannot provide — an explicit representation of how data flows from one point in the code to another, independent of what the variables along that path happen to be named. This is exactly the gap GraphCodeBERT’s data-flow graph is designed to fill, and it’s the reason we built CSI around it rather than around a purely token-based model like CodeBERT.
Our Foundation: GraphCodeBERT
Microsoft’s GraphCodeBERT addresses the structural blindness of token models by parsing source code into three complementary graph representations and attending over all three simultaneously during pretraining.
Abstract Syntax Tree (AST) The AST captures the syntactic structure of code: how functions are defined, how expressions compose, how variables are declared relative to their scope. It gives the model a hierarchical view of code that token sequences cannot provide.
Data Flow Graph (DFG) The DFG is the critical representation for vulnerability detection. It traces how values propagate through a program: which variables receive which values, how those values are transformed, and where they ultimately flow. For an injection vulnerability, the DFG makes the taint path explicit: user_id → query → db.execute(). This is the path that token models cannot see.
Control Flow Graph (CFG) The CFG maps which code paths execute under which conditions. It captures branch logic, loops, and exception handlers — the execution context that determines whether a tainted value can actually reach a dangerous sink in practice.
Together, these three representations give GraphCodeBERT a structural understanding of code that enables meaningful vulnerability detection. For a SQL injection, it traces the full semantic chain from untrusted input through concatenation to database execution. Token models see three words at that point. GraphCodeBERT sees a taint flow.
Making It Trainable: LoRA Parameter Efficiency
GraphCodeBERT has 125 million parameters. Full fine-tuning on a domain-specific dataset at this scale requires significant GPU memory, long training times, and a dataset large enough to update 125M parameters meaningfully without overfitting. We had approximately 4,000 training samples and access to Google Colab.
We applied LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique that injects small trainable adapter matrices into the query, key, and value projection layers of each transformer attention block while keeping all backbone weights frozen. The adapter for a weight matrix W is parameterized as two low-rank matrices B and A, where the effective weight update is W + (α/r) × BA. With rank r=16 and scaling factor α=32, the number of trainable parameters drops from 124M to 2.07M — 0.24% of the full model.
This is not a compromise on performance. The LoRA constraint actively prevents overfitting on small datasets by limiting the effective model capacity. Our CSI-GCB model achieved F1 = 0.7012 after 30 training epochs, with validation F1 improving monotonically across all epochs — no overfitting, no degradation. The parameter-efficient constraint was a feature, not a limitation.
| Metric | Value |
|---|---|
| Base model parameters | 124M |
| Trainable parameters (LoRA) | 2.07M (0.24%) |
| LoRA rank (r) | 16 |
| LoRA scaling factor (α) | 32 |
| Training epochs | 30 |
| Optimizer | AdamW, lr=2e-5 |
| Best validation F1 | 0.7012 |
Building the Dataset: Three Real-World Sources
We could not use existing C/C++ vulnerability datasets. Cross-language transfer from C/C++ to Python is problematic: graph structures differ, tokenization differs, and the vulnerability patterns that dominate C/C++ (buffer overflows, memory corruption) are largely irrelevant in Python. We needed Python-native training data.
We unified approximately 4,000 deduplicated Python functions from three complementary real-world sources.
Source 1: AI-Generated Vulnerable Code (121 records) A curated dataset of AI-generated Python functions, each labeled with its CWE identifier. Every record pairs a natural-language prompt with the insecure Python function produced by the AI model, covering 68 unique CWE types. This source directly targets the threat model motivating CSI: AI-assisted code generation introducing security vulnerabilities that no one audits.
Source 2: GitHub Security Commits (2,173 records) Commit-level vulnerability pairs extracted from real GitHub security fix commits. Pre-patch function = vulnerable, post-patch function = safe. Labels verified using GPT-4 at approximately 94% accuracy — compared to 40–51% accuracy for automated commit-only labeling strategies. Our GPT-4 verification step was essential for training signal quality.
Source 3: Raw GitHub Diff Files (~300 records) Approximately 300 raw GitHub diff records across seven vulnerability types: XSS, SQL injection, command injection, open redirect, path disclosure, RCE, and XSRF. Incorporated as augmentation for underrepresented CWE categories.
One Critical Insight: Commit-Stratified Splitting Random train/test splits leak information when applied to commit-level data. Functions from the same Git commit share context: the same bug fix, the same coding style, the same changeset patterns. Published research shows this inflates F1 scores by up to 40 percentage points. Our solution: entire commits assigned to a single partition. No commit ever split across train and test.
The Preprocessing Pipeline
Six stages transform raw data into model-ready tensors.
- Stage 1 — Parse and Unify: Each source normalized into a unified schema: source code, binary label, CWE identifier, provenance tag.
- Stage 2 — Label Encoding: CWE-to-integer mapping constructed. Categories with fewer than 5 samples discarded.
- Stage 3 — Negative Sampling: Safe samples drawn from post-patch functions and CodeSearchNet. Target ratio: 1:1 vulnerable-to-safe, correcting the natural 8:1 imbalance.
- Stage 4 — Class Weighting: Per-class weights via scikit-learn compute_class_weight. Positive weight pos_weight = n_neg/n_pos for the binary head.
- Stage 5 — Mid-Truncation Tokenization: Max 512 tokens. First 128 (function signature, entry logic) + last 384 (return statements, taint sinks) retained. Standard head truncation discards function tails — exactly where SQL and command injection sinks most commonly appear.
- Stage 6 — Commit-Stratified Split: 70/15/15 train/validation/test. All functions from the same commit in the same partition.
Data Augmentation
We applied two augmentation techniques to the training set.
Variable and function name normalization replaces all identifier tokens with abstract symbolic tokens (VAR_1, FUNC_1, etc.), adopted from the DetectVul preprocessing strategy. A SQL injection through a variable named user_input and one through a variable named x are the same vulnerability. The model should treat them identically.
Dead-code insertion and minor refactoring variants of each vulnerable function were generated to increase intra-class diversity. This was motivated by a known failure mode in GNN-based detectors: models trained to distinguish vulnerable code from its fixed version perform near-randomly because security patches introduce minimal code differences. Increasing intra-class diversity forces the model to learn structural patterns rather than diff signatures.
The Architecture: Dual-Encoder Fusion
The architecture decision was one of the first major forks in the project, and it shaped almost everything that came after it. Early on, we had to decide: do we build one encoder that does everything, or do we combine two encoders that each bring something the other lacks? We went with the second option, but getting there — and then getting the two halves to actually work together — took several iterations.
Why Two Encoders At All
The case for a single encoder is simplicity: one model, one set of weights to fine-tune, fewer moving parts to debug. But GraphCodeBERT and VulBERTa are good at fundamentally different things. GraphCodeBERT understands structure — how data moves through a function, how control flow branches, how an AST is shaped. VulBERTa understands vulnerability semantics — it was pre-trained exclusively on NVD entries and CVE-linked code, so it has effectively memorized what dangerous code idioms look like: unsanitized input patterns, risky function calls, structures that resemble known CVEs.
A function can be structurally unremarkable — a simple, shallow control flow, nothing exotic in its data flow graph — and still be dangerous because of what it does with a specific input. Conversely, a function can have a complex data flow graph and still be perfectly safe. Structure alone doesn’t tell you “this looks like a CVE I’ve seen before,” and vulnerability-pattern memorization alone doesn’t tell you “this input actually reaches this sink.” We wanted both signals available to the classification heads simultaneously, which meant running both encoders on every input and combining their outputs — rather than picking one.
Encoder 1: GraphCodeBERT + LoRA
The first encoder is GraphCodeBERT, adapted with LoRA as described earlier. For every input function, GraphCodeBERT processes two things at once: the tokenized source code itself, and the data-flow graph edges extracted from the function’s AST. Internally, its attention layers attend across both — a token can attend not just to nearby tokens in the sequence, but to other tokens it has a data-flow relationship with, even if they’re far apart in the raw text. This is what lets the model “see” that a variable assigned on line 3 flows into a database call on line 40, even though those two lines are nowhere near each other as tokens.
The LoRA adapters sit on the query, key, and value projection matrices of every attention layer in this encoder. Everything else in GraphCodeBERT is frozen. After the full forward pass, we take the per-token output and apply mean pooling across the sequence dimension — averaging every token’s final representation into a single vector. The result is a 768-dimensional embedding that we think of as the structural signal: it encodes how this specific function is built, how its data moves, and how its control flow is organized.
Encoder 2: VulBERTa (Frozen)
The second encoder is VulBERTa, used as a fixed feature extractor. Unlike GraphCodeBERT, VulBERTa receives no adapters and no gradient updates at all — its weights are exactly as they came from pretraining on NVD and CVE-linked code. We made this choice deliberately: VulBERTa’s value to us is precisely the vulnerability-domain knowledge baked into its pretraining, and fine-tuning it on our comparatively small dataset risked overwriting that knowledge faster than it could learn anything useful from 4,000 samples — a classic catastrophic forgetting problem.
For every input function, VulBERTa runs its own tokenization (it uses RoBERTa’s BPE tokenizer, separate from GraphCodeBERT’s tokenizer — these are two different views of the same source code, tokenized differently) and produces a sequence of hidden states. Rather than mean pooling, we take the CLS token’s final representation — the standard approach for classification-style embeddings in BERT-family models — giving us a second 768-dimensional embedding. We think of this as the vulnerability-domain signal: it encodes how similar this function “feels” to the vulnerable and CVE-linked code VulBERTa was trained on, independent of the function’s own internal structure.
Fusion Layer: Combining Two 768-Dimensional Views
At this point we have two 768-dimensional vectors describing the same input function from two different angles. The fusion layer’s job is to combine them into a single representation that downstream classification heads can use.
The simplest possible approach — and the one we settled on — is concatenation: stack the two 768-dimensional vectors end to end into a single 1,536-dimensional vector. We considered alternatives (element-wise addition, learned gating, cross-attention between the two embeddings) but concatenation has one major advantage: it loses no information. Addition or gating require the two vectors to already be in a compatible space, which they aren’t — they come from different models with different pretraining objectives. Concatenation defers that reconciliation to a layer that’s actually trained for it.
That reconciliation happens in the FusionProjectionLayer immediately after concatenation: a Linear layer projects the 1,536-dimensional concatenated vector back down to 768 dimensions, followed by LayerNorm, a GELU activation, and Dropout. This is the layer that actually learns how to weigh and combine the structural signal from GraphCodeBERT against the vulnerability-domain signal from VulBERTa — effectively learning, per-feature, how much to trust each encoder’s contribution. The output is a single 768-dimensional fused representation that both downstream heads consume.
Classification Heads: Two Tasks, One Shared Representation
The fused representation feeds two separate heads, trained jointly.
The Binary Head is intentionally minimal: a single Linear(1536 → 1) layer followed by a sigmoid, producing a vulnerable/safe probability. We kept this head simple because the binary task is, relatively speaking, the easier of the two — most of the discriminative work needed for “is this vulnerable at all” is already present in the fused representation, and adding more layers here mainly risked overfitting on a task that didn’t need it.
The CWE Head is deliberately deeper: Linear(1536 → 384) → GELU → Dropout → Linear(384 → 8). Classifying which of seven CWE categories a vulnerability belongs to (plus an eighth “unknown” class) is a harder, more fine-grained task than the binary one — it requires distinguishing between vulnerability types that can share a lot of surface-level structure (an XSS and a command injection can look superficially similar in terms of “untrusted input flows somewhere dangerous,” but the kind of dangerous matters for classification). The extra hidden layer gives the head room to learn these finer distinctions from the same shared representation, without needing a separate encoder pass.
One detail that mattered in practice: for safe samples, the CWE label is set to −1 and masked out of the CWE loss entirely. A safe function has no CWE to predict, and including it in the CWE loss with some placeholder label would inject noise into a head that’s already working with a smaller, more imbalanced label space than the binary head. Masking keeps the CWE head’s gradient signal coming only from samples where a CWE label is actually meaningful.
Where the Two-Head Design Came From
This two-head structure wasn’t the original plan — early versions of the architecture experimented with a single combined output space (CWE categories plus an explicit “safe” class, predicted by one head). We moved to separate binary and CWE heads after running into a familiar problem: a single combined classifier tends to behave like a generalist binary detector with poor sensitivity to specific weakness types, because the “safe” class dominates the label distribution and pulls the decision boundary toward itself. Splitting the binary detection objective from the CWE classification objective let each head specialize — one for the broad “is this dangerous” question, one for the fine-grained “what kind of dangerous” question — while still sharing the same upstream encoders and fusion layer, so neither head requires its own separate feature extraction.
Training Strategy: Composite Loss and Adversarial Training
Getting a model to F1 = 0.7012 on a 4,000-sample dataset with eight unevenly distributed classes is not a single-loss problem. A model trained with plain cross-entropy on this data converges quickly to predicting the dominant classes and essentially ignores rare CWE categories — the validation F1 looks acceptable on paper while the model is functionally blind to the vulnerability types that matter most. We addressed this by combining four loss components, each solving a different failure mode we hit during early experiments.
Focal Loss: Fixing the Class Imbalance Problem
Our CWE distribution is heavily skewed — some categories have hundreds of examples, others barely clear the five-sample minimum from Stage 2 of preprocessing. With standard cross-entropy, the gradient signal from the dominant classes drowns out the rare ones, and the model learns to be “confidently correct” on easy majority-class examples while never improving on the hard minority-class examples.
Focal Loss adds a modulating term, (1 − p_t)^γ, to the standard cross-entropy loss. When the model is already confident and correct on an example (p_t close to 1), this term shrinks toward zero and the loss contribution from that example is suppressed. When the model is wrong or uncertain (p_t low), the term stays close to 1 and the full loss applies. In practice, this redirects the gradient budget toward the examples the model is actually struggling with — which, in our case, were almost always the underrepresented CWE categories. We ran an ablation over γ ∈ {1, 2, 3} to find the value that best balanced this trade-off without destabilizing training on the majority classes.
SCL-CVD: Making the Embedding Space Class-Aware
Focal Loss fixes the gradient imbalance, but it doesn’t directly address a separate problem: two functions with the same CWE label can look extremely different at the token level, while two functions with different CWE labels can look superficially similar (a few lines of diff apart). Without an explicit signal to organize the embedding space, the classification head has to do all the work of separating classes from a representation that wasn’t built with that goal in mind.
Supervised Contrastive Learning for Code Vulnerability Detection (SCL-CVD) adds a second objective directly on the embeddings, before the classification heads. For every anchor sample in a batch, it pulls embeddings from the same CWE class closer together in representation space and pushes embeddings from different classes apart, using a temperature parameter (tau) to control how sharply similarity is weighted. The result is an embedding space where same-class functions cluster together even if their surface code differs substantially, and where the classification head’s decision boundaries become easier to learn because the classes are already partially separated upstream.
This was one of the more iteration-heavy components of the project. We ran a direct SCL vs. no-SCL F1 comparison to confirm the contrastive objective was actually helping (it was), then separately tuned the temperature tau and the SCL loss weight alpha — two parameters that interact with each other and with the rest of the composite loss in non-obvious ways. Too high a weight on SCL and the model over-prioritizes embedding geometry at the expense of classification accuracy; too low and the contrastive signal gets lost in the noise of the other three losses.
R-Drop: Consistency Under Dropout
Dropout is applied during training for regularization, but it introduces a subtle problem: the same input, passed through the model twice, can produce noticeably different output distributions depending on which neurons happen to be dropped each time. For a model that needs to make confident, stable predictions about whether a specific line of code is vulnerable, this stochasticity is undesirable — it means the model’s “opinion” about a given function can shift run to run without any change to the input.
R-Drop addresses this directly. Each training sample is passed through the model twice, with two independent dropout masks, producing two output distributions. The loss adds a KL divergence term between these two distributions, on top of the standard task loss. This forces the model to produce consistent predictions regardless of which dropout mask is active — effectively training the model to be robust to its own regularization noise. We tested this on small batches first to confirm the KL term behaved as expected (it shouldn’t dominate the loss or collapse the distributions to a degenerate point) before integrating it into the full training loop.
EDAT: Adversarial Robustness with Syntactic Guarantees
The fourth component, Embedding-Disturbed Adversarial Training (EDAT), was the most involved to build, and the one that touched the most hands on the team.
The motivation is straightforward: vulnerability detection models are often brittle to small, semantically meaningless changes in code — renaming a variable, adding a comment, reordering independent statements. A model that flips its prediction because of a cosmetic change isn’t actually reasoning about the vulnerability; it’s keying on superficial patterns. EDAT trains against this by generating adversarial perturbations in the model’s embedding space using Projected Gradient Descent (PGD) — small, gradient-directed nudges to the embedding designed to push the model toward a wrong prediction — and then training the model to be robust to those nudges via an adversarial KL loss between the clean and perturbed predictions.
The catch is that perturbations applied carelessly in embedding space can correspond to nothing — there’s no guarantee that a perturbed embedding still maps back to anything resembling valid Python. That’s where AST constraint checking comes in: before a perturbed sample is accepted into the adversarial training loop, it’s checked against AST-level constraints to ensure the perturbation corresponds to a syntactically valid transformation, not an out-of-distribution artifact the model could “cheat” against.
Building this pipeline required several distinct pieces working together: tree-sitter-based identifier extraction to identify which tokens in a function are safe to perturb without breaking syntax, the AST constraint checker itself, the PGD perturbation loop, the adversarial KL loss term, and a tunable epsilon controlling the perturbation magnitude. Each of these had to be validated independently on small batches before being wired into the full training run, because a bug in any one component (a perturbation that’s too large, an AST check that’s too permissive, an epsilon that’s miscalibrated) can silently degrade training without throwing an error — the model still trains, it just gets worse, and that’s much harder to debug than a crash.
Who Built What
This was genuinely distributed work, and each piece depended on the one before it. Hend Elhout built the tree-sitter identifier extraction that underpins EDAT — the foundation everything else in the adversarial pipeline depends on. Jomana Mekheimar implemented the AST constraint checking on top of that, and later ran the full EDAT training and the EDAT vs. no-EDAT F1 comparison that validated the whole approach was worth the added complexity. Menna Reda built the PGD perturbation loop and the adversarial KL loss, and ran the small-batch tests that caught early calibration issues. Youstina Adel tuned the epsilon parameter and did the final integration of EDAT into the main training loop. Separately, Farida Hassan implemented and tested both SCL-CVD and R-Drop, and later ran the full GraphCodeBERT training run that produced our final F1 = 0.7012 result.
The composite loss that resulted from all of this — Focal Loss, SCL-CVD, R-Drop, and EDAT, combined and weighted together — was the single largest factor separating our early-epoch results (F1 ≈ 0.54) from the final 0.7012. No individual component alone got us there; it was the combination, tuned iteratively, that did.
Results
| Model | Method | Language | F1 | Notes |
|---|---|---|---|---|
| Bandit | Regex | Python | ~0.62 | Industry standard |
| DetectVul | Dual-BERT, full fine-tune | Python | 0.7447 | Prior SOTA |
| CSI-GCB | GraphCodeBERT + LoRA | Python | 0.7012 | 0.24% of params trained |
| CSI-Dual | GCB + VulBERTa fusion | Python | 0.6630 | Faster early convergence |
CSI-GCB outperforms Bandit by 8 percentage points — roughly 13 additional vulnerabilities caught per 100. The single-encoder LoRA model outperformed the dual-encoder fusion model by 3.82 points. CSI-Dual showed faster early convergence (F1 = 0.6099 at epoch 1 vs ~0.541 for CSI-GCB) but plateaued earlier because frozen VulBERTa could not adapt to our seven-class CWE taxonomy. At ~4,000 training samples, the fusion layer’s complexity outweighs its gains. This is a data scale problem, not an architecture problem.
Why This Matters
For security teams: a tool that traces taint flows rather than flagging token patterns, catching injections that Bandit misses entirely.
For ML practitioners: validation that LoRA is viable for production security tasks. 0.24% of parameters updated, F1 = 0.7012. Meaningful security tooling without large GPU infrastructure.
For researchers: a reproducible Python-specific multi-task vulnerability detection baseline demonstrating parameter-efficient single-encoder fine-tuning can approach full fine-tuning performance at significantly lower compute cost.
| Metric | Result |
|---|---|
| Training Time | ~30 epochs on A100 |
| Trainable Parameters | 2.07M (0.24%) |
| F1 Score | 0.7012 |
Building the Dataset: Three Sources, One Pipeline
We unified approximately 4,000 deduplicated Python functions from three real-world sources:
- AI-Generated Code (121 records): GPT-assisted Python functions labeled with CWE identifiers, spanning 68 unique CWE types. Primary training signal for the CWE classification head.
- GitHub Security Commits (2,173 records): Real commit-level vulnerability pairs where pre-patch = vulnerable and post-patch = safe, verified by GPT-4 at ~94% label accuracy — vs. 40–51% for automated commit-only labeling.
- Raw Diff Files (~300 records): GitHub diffs across seven vulnerability types: XSS, SQL injection, command injection, open redirect, path disclosure, RCE, and XSRF. Used as augmentation for underrepresented CWE categories.
One Critical Insight: Commit-Stratified Evaluation
Random data splits leak information. Functions in the same Git commit share context. If the model sees part of a commit during training and the rest during testing, it learns commit-specific signatures — metrics inflate by up to 40 percentage points. Solution: entire commits are assigned to either train or test, never split across partitions.
The Architecture: Dual-Encoder Fusion
CSI runs two encoders in parallel on every input:
- GraphCodeBERT + LoRA: captures AST structure, data flow between variables, and token relationships. Outputs a 768-dimensional embedding.
- VulBERTa (frozen): a RoBERTa model pre-trained exclusively on NVD entries and CVE-linked code. Captures dangerous code idioms, unsanitized input patterns, and similarity to known vulnerable code. Outputs a 768-dimensional embedding.
Both embeddings are concatenated into a 1,536-dimensional vector, passed through a FusionProjectionLayer (Linear → LayerNorm → GELU → Dropout), then routed to two task heads: a binary head (Linear 1536→1) predicting vulnerable vs. safe, and a CWE head (Linear 1536→384 → GELU → Dropout → Linear 384→8) classifying across seven CWE categories plus unknown.
Training used a composite loss: Focal Loss for class imbalance, SCL-CVD for intra-class compactness, R-Drop for output consistency, and EDAT adversarial perturbation on embeddings with AST constraint checking to preserve program semantics.
Results: Outperforming Baselines
| Model | Language | Method | F1 |
|---|---|---|---|
| Bandit | Python | Regex | ~0.62 |
| DetectVul | Python | Dual-BERT | 0.7447 |
| CSI-GCB | Python | GraphCodeBERT+LoRA | 0.7012 |
| CSI-Dual | Python | GCB+VulBERTa | 0.6630 |
CSI-GCB outperforms Bandit by 8 percentage points on Python. The single-encoder LoRA approach outperforms the dual-encoder fusion under current data scale — dual-encoder architectures require larger corpora to realize their complementary representation advantage.
Why This Matters
Security teams get a tool that understands code semantics, not surface patterns. ML practitioners see validation of parameter-efficient fine-tuning (LoRA) on a real security task. Microsoft’s GraphCodeBERT + PEFT ecosystem proves viable for production Python security tooling.
Meet the Team
CSI was built by a 9-person team at the Egyptian Chinese University.
|
Member |
Tasks Owned |
Specific Contributions |
|
|
Anas Abuelhaag |
Project lead, architecture, training infrastructure |
GitHub repo setup, Drive infrastructure, CWE classification head design, checkpoint save/load, validation loop (F1/precision/recall), SCL integration into training loop, overall system architecture |
|
|
Sohaila Tamer |
Graph extraction, SCL tuning, VulBERTa go/no-go |
AST/CFG/DFG graph extraction, SCL vs no-SCL F1 comparison, temperature tau tuning, SCL weight alpha tuning, results logging, VulBERTa go/no-go decision |
|
|
Farida Hassan |
Tokenization, loss functions, full training run |
GraphCodeBERT tokenization cache, checkpoint co-development, validation loop co-development, supervised contrastive loss implementation and testing, R-Drop KL divergence loss, full GraphCodeBERT training run |
|
|
Hend Elhout |
EDAT identifier extraction, Streamlit UI |
Tree-sitter identifier extraction for EDAT, Streamlit line highlighting display, 7 fix suggestion texts, example code snippets, loading spinner, edge case handling, VulBERTa fusion layer implementation |
|
|
Jomana Mekheimar |
AST constraints, hyperparameter search, VulBERTa tokenization |
EDAT AST constraint checking, full EDAT training run, EDAT vs no-EDAT F1 comparison, LoRA rank ablation (4 vs 8 vs 16), focal loss gamma ablation (1 vs 2 vs 3), VulBERTa BPE tokenization |
|
|
Menna Reda |
PGD adversarial training, VulBERTa training |
EDAT PGD perturbation loop, adversarial KL loss, small-batch EDAT testing, VulBERTa dual-encoder training run |
|
|
Youstina Adel |
Epsilon tuning, VulBERTa LoRA |
EDAT epsilon tuning, EDAT integration into training loop, VulBERTa LoRA adapter implementation |
|
|
MennatAllah Amr |
Hyperparameter optimization, VulBERTa forward pass |
Batch size ablation (4 vs 8), learning rate ablation (1e-5, 2e-5, 5e-5), hyperparameter results compilation and best config selection, VulBERTa dual-input forward pass update |
|
|
Hesham Elshimy |
Streamlit app, deployment, demo |
Streamlit UI layout design, hardcoded data testing, app.py with code input, 4 metric cards, CWE description section, 7 CWE descriptions, model download integration, model connection, safe/vulnerable code testing, demo video production |
REFERENCES
[1] PyVul Team, “PyVul: A Real-World Python Vulnerability Benchmark with LLM-Assisted Data Cleansing,” arXiv:2404.15687, 2024.
[2] Y. Chen et al., “DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning-Based Vulnerability Detection,” in Proc. ACM
RAID, 2023, pp. 1-16.
[3] H. Husain et al., “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search,” arXiv:1909.09436, 2019.
[4] Y. Feng et al., “CodeBERT: A Pre-Trained Model for Programming and Natural Languages,” in Proc. EMNLP Findings, 2020.
[5] M. T. Tran et al., “DetectVul: Statement-Level Python Vulnerability Detection Using Dual-BERT,” Future Generation Computer
Systems, 2025.
[6] R. Mussabayev, “Structure-Aware Code Vulnerability Analysis with Graph Neural Networks,” arXiv:2307.11454, 2023.
[7] Anonymous, “From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection,” arXiv:2408.02329, 2024.
[8] H. Hanif and S. Maffeis, “VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection,” in Proc. IJCNN, 2022.
[9] J. Liu et al., “Vul-LMGNNs: Vulnerability Detection by Fusing Language Models and Online-Distilled Graph Neural Networks,”
arXiv:2404.14719, 2025.
[10] X. Wen et al., “AMPLE: Vulnerability Detection with Graph Simplification and Enhanced Graph Representation Learning,”
arXiv:2302.04675, 2023.
[11] L. Peng et al., “ANGEL: Accurate Vulnerability Detection for Large Code Graphs,” arXiv:2412.10164, 2024.
[12] J. de Kraker, H. Vranken, and A. Hommersom, “MultiGLICE: GNNs with Program Slicing for Multiclass Vulnerability Detection,”
Computers, vol. 14, no. 3, p. 98, 2025.
[13] Anonymous, “Vignat: Vulnerability Identification by Learning Code Semantics via Graph Attention Networks,” arXiv:2310.20067,
2023.
[14] Anonymous, “Enhancing Vulnerability Detection Using Code Property Graphs and CNNs,” in Proc. ACM CCS Workshop, 2023.
[15] Y. Hu et al., “Interpreters for GNN-Based Vulnerability Detection: Are We There Yet?,” in Proc. ACM ISSTA, 2023.
[16] K. Wartschinski et al., “VUDENC: Vulnerability Detection with Deep Learning for a Large Codebase in Python,” Computers, vol. 14,
- 3, 2025.
[17] C. Liang et al., “Source Code Vulnerability Analysis Based on Deep Learning: A Survey,” Computers & Security, vol. 148, 2025.
[18] Y. Hu et al., “SoK: Automated Vulnerability Repair, Methods, Tools, and Assessments,” arXiv:2506.11697, 2025.
[19] G. Bhandari, P. Gavric, and A. Shalaginov, “PatchLM: Generating Vulnerability Security Fixes with Code Language Models,”
Information and Software Technology, vol. 185, 2025.
[20] D. Guo et al., “GraphCodeBERT: Pre-Training Code Representations with Data Flow,” in Proc. International Conference on Learning
Representations (ICLR), 2021.