Cross-File Dataflow Analysis: Taint Tracking Across Your Entire Project

Zoho has this sprawling campus in Guduvancheri, Chennai where thousands of engineers build products that millions of people depend on. In 2017, I was on the security team there. My job was to audit the codebase before each release and find vulnerabilities before they became a breach.

An internal team had built a code search tool on top of Lucene. It was painfully slow. I'd type a search query, hit enter, go grab a coffee, and come back ten minutes later to see the results. Some days the server was just down entirely. That's when I realized how much code search matters. Long before the agentic era, before LLMs could read your codebase, search was the only way to navigate a million-line monorepo. I loved what Sourcegraph was building and eventually joined them. But even with fast, precise search, the fundamental problem remains: code search finds patterns. It doesn't understand flows.

Every audit followed the same ritual. I'd open two browser tabs side by side. In the first tab, I'd search for request.getParameter to find where user input enters the application. In the second tab, I'd search for statement.execute to find where SQL queries run. Then I'd stare at the results and try to answer one question: can data from tab one actually reach tab two?

That question sounds simple. It isn't.

The input might arrive in a servlet in UserController.java. It gets assigned to a local variable, passed to a service method in UserService.java, which calls a DAO method in UserRepository.java, which finally builds a query string and hands it to statement.execute. Four files. Three function boundaries. One vulnerability that only exists if the chain is unbroken.

I'd trace these flows by hand. Coming from Android development, I was used to IntelliJ's code navigation where you could jump to definitions and trace call hierarchies effortlessly. But even IntelliJ, as good as it is at navigation, doesn't answer the security question: does this user input actually reach that SQL query through all these layers? I'd open each file, read the method signatures, follow the variable names, check if anything got reassigned or sanitized along the way. Some days I'd spend an hour tracing a path through six files only to discover the variable got escaped two hops in. Other days I'd clear a module as safe and later find out I missed a flow because it went through a utility function I never thought to search for.

The frustrating part was that I knew exactly what I wanted: give me a source, give me a sink, tell me if there's a connected path. That's the entire question. The only tool that could technically do this was CodeQL, but the licensing cost, the database build times, and the learning curve of writing QL queries meant that by the time you figured out the syntax, you'd already lost the motivation to finish the audit.

It took years of building Code Pathfinder to get there. But that exact question, "does data flow from A to B across my project?", is now something the engine answers in seconds.

Here's a cross-file SQL injection rule running against a Flask app split across two files. User input enters in app.py, gets passed through a helper function, and lands in cursor.execute() in db.py. Hit "Run Analysis" and watch the engine trace the full chain.

What's in v2.0

Here's what it actually does. You give it three things: where untrusted data enters (sources), where it becomes dangerous (sinks), and what neutralizes it (sanitizers). The engine traces every path from source to sink. Tainted variable hits a sink without passing through a sanitizer? That's a finding. Sanitized along the way? Silence.

This isn't pattern matching. Most SAST tools on the market still work by matching regex patterns against source code. They'll flag every call to execute() regardless of whether user input actually reaches it. That's why teams drown in false positives and eventually stop looking at findings altogether.

And honestly, it's not entirely the tooling's fault. Most engines don't expose their type systems through the rule API. Security engineers writing custom rules don't get access to type resolution, call graphs, or dataflow primitives. So they write what they can: regex patterns. The tool has an AST internally but the rule language is basically grep with extra steps.

Code Pathfinder's QueryType system replaces pattern matching with type-aware resolution. The engine builds a call graph, resolves types, and tracks assignments across function boundaries. It either proves a vulnerable path exists or stays quiet. No grep. No guessing.

Three kinds of code in your project

The engine has to understand what it's looking at. Your project has three distinct categories, and they need different treatment.

Your code. Code Pathfinder parses it fully. AST, call graph, type inference, every variable assignment. All of it is available for dataflow traversal.

Standard library. os.system(), subprocess.run(), open() don't live in your codebase but the engine needs to know their behavior. Which parameters are dangerous, how data flows through them. The entire Python stdlib has been extracted, parsed, and annotated. When the engine sees os.system(cmd), it knows argument 0 is a command injection sink without analyzing CPython source.

Third-party libraries. Flask, Django, SQLAlchemy, requests have methods that are sources, sinks, or propagators. Code Pathfinder resolves them through typeshed annotations and type inference. When it sees cursor.execute(query), it figures out that cursor is a sqlite3.Cursor (or psycopg2.extensions.cursor, or any of half a dozen MySQL variants) and knows argument 0 is a SQL sink.

Here's what gets interesting: this lets you verify third-party library internals. Picture a library that takes user input and internally calls eval(). Most static analysis tools treat library calls as black boxes. Code Pathfinder is working toward tracing through those boundaries. Future versions will index site-packages directly for libraries without typeshed coverage.

Writing a flow analysis rule

Enough theory. Here's how you actually write one.

Define your types

The QueryType system is how you tell the engine what objects you care about. Instead of matching function name strings (which break the moment someone aliases an import), you define fully qualified names and let type resolution do its job:

python
from codepathfinder import QueryType

class FlaskRequest(QueryType):
    fqns = ["flask.wrappers.Request", "flask.Request"]

class OSModule(QueryType):
    fqns = ["os"]

FlaskRequest matches anything the engine resolves to Flask's Request class. Doesn't matter if the code says "from flask import request" or "import flask" or "import flask as f". Type resolution handles it.

Where do you get the FQN? Open the library source or type stubs. Flask's request object is flask.wrappers.Request, but most code imports it as flask.Request. List both. For stdlib modules like os, the FQN is just "os". The engine resolves the import chain and bridges your FQN to whatever alias the developer used.

Define sources, sinks, and sanitizers

Use .method() to pick which methods matter, and .tracks() to say which parameter position carries tainted data:

python
sources = [
    FlaskRequest.method("args.get", "form.get", "values.get", "get_json"),
]

sinks = [
    OSModule.method("system", "popen").tracks(0),
    # .tracks(0) = only argument 0 is dangerous
    # os.system(cmd) where cmd is at position 0
]

sanitizers = [
    calls("shlex.quote"),
]

.tracks(0) matters a lot. It tells the engine that only the first argument to os.system() is a command injection sink. Without it, any tainted value anywhere in the arguments would trigger a finding.

Wire it together

python
from codepathfinder import flows
from codepathfinder.presets import PropagationPresets

@python_rule(
    id="MY-CUSTOM-001",
    name="Command Injection via os.system",
    severity="CRITICAL",
    cwe="CWE-78",
    message="User input flows to os.system() without sanitization.",
)
def detect_command_injection():
    return flows(
        from_sources=[
            FlaskRequest.method("args.get", "form.get", "values.get"),
        ],
        to_sinks=[
            OSModule.method("system", "popen").tracks(0),
        ],
        sanitized_by=[
            calls("shlex.quote"),
        ],
        propagates_through=PropagationPresets.standard(),
        scope="global",
    )

scope="global" is the cross-file part. The engine follows tainted data through function calls, across files, through return values.

Most commercial SAST vendors lock cross-file analysis behind their enterprise tier. Single-file scanning is free, but the moment you need to trace data across a function boundary, that's a premium feature. Which is wild, because real vulnerabilities almost always cross file boundaries. Nobody writes their entire Flask app in one file. Code Pathfinder ships cross-file analysis in the open source version. No feature gates.

Run it:

bash
pathfinder scan --rules my_rule.py --project ./my-flask-app

Output:

bash
[critical] [Taint-Global] MY-CUSTOM-001: Command Injection via os.system
    CWE-78 | A03:2021

    routes/admin.py:42
       39 | @app.route('/admin/exec')
       40 | def admin_exec():
       41 |     cmd = request.args.get('command')
     > 42 |     os.system(cmd)

    Flow: cmd (line 41) -> os.system(cmd) (line 42)
    Tainted variable 'cmd' reaches dangerous sink without sanitization

Cross-file SQL injection

Here's a more realistic example. The source and sink live in different files, connected through a helper function:

python
class DBCursor(QueryType):
    fqns = ["sqlite3.Cursor", "psycopg2.extensions.cursor",
            "mysql.connector.cursor.MySQLCursor", "pymysql.cursors.Cursor"]
    patterns = ["*Cursor"]
    match_subclasses = True

@python_rule(
    id="PYTHON-FLASK-SEC-003",
    name="Flask SQL Injection via Tainted String",
    severity="CRITICAL",
    cwe="CWE-89",
    message="User input flows to SQL execution without parameterization.",
)
def detect_flask_sql_injection():
    return flows(
        from_sources=[
            FlaskRequest.method("args.get", "form.get", "values.get",
                                "get_json", "cookies.get", "headers.get"),
        ],
        to_sinks=[
            DBCursor.method("execute", "executemany").tracks(0),
        ],
        sanitized_by=[
            calls("escape"),
            calls("escape_string"),
        ],
        propagates_through=PropagationPresets.standard(),
        scope="global",
    )

Two things worth noting. DBCursor uses patterns = ["*Cursor"] with match_subclasses = True, so it covers sqlite3, psycopg2, PyMySQL, mysql-connector, and any cursor subclass you've written. One rule handles all database drivers.

.tracks(0) on execute means only the SQL string at argument 0 triggers findings. cursor.execute(sql, (params,)) where the tuple at argument 1 is the parameterized values? That's the safe pattern, and the engine ignores it. This single distinction kills an entire class of false positives that plague most static analysis tools.

190+ rules ship out of the box

You don't have to write everything from scratch. There are 190+ rules covering Flask, Django, JWT, cryptography, AWS Lambda, deserialization, Docker, Docker Compose, and core Python. Every taint rule uses scope="global". Every sink uses .tracks() for parameter targeting. Every rule includes sanitizers.

You can browse all of them, see the detection logic, and try each one in the interactive playground at the rule registry.

The AI angle

Here's the thing I keep thinking about. The point isn't just making rule writing easy. It's making dataflow analysis and cross-file reasoning cheap enough that an LLM or an AI agent like Claude Code can leverage it directly. Write a rule, execute it against the engine, gather context from the results, and reason about the next step. All in a loop.

Think about what that means. An agent doesn't need to read every file in your project to understand a security flow. It writes a targeted rule, the engine traces the dataflow in seconds, and the agent gets back a precise answer: this variable flows from here to there, through these functions, across these files. That's structured context. Way more useful than dumping 50 files into a prompt window and hoping the model follows the thread.

Can LLMs just do the dataflow analysis themselves by reading code? Sometimes. I've seen Claude Sonnet 4.6 catch multi-hop flows across files, tracing taint through four or five function boundaries. Genuinely impressive. But it's non-deterministic. Run the same prompt twice, you get different results. Some sessions it catches everything. Other sessions it misses a one-liner. When I guide it through the flow step by step, it works. When I don't, it's a coin flip. You can't build a CI pipeline on a coin flip. Code Pathfinder's engine runs in seconds and gives the same answer every time.

I've been experimenting with LoRA adapters on open models, training them specifically for dataflow reasoning, using them to handle context sensitivity and path sensitivity where the deterministic engine has gaps.

What's ahead

This is v1 of cross-file dataflow. It works, and it ships with 190+ rules. I have a server that continuously benchmarks the engine against open source repositories. It runs scans, then uses an LLM as a judge to identify gaps, cases where the engine missed something a human reviewer would catch. Those gaps get recorded as reproducible test cases and fed back into the engine. It's a loop: scan, judge, reproduce, fix, repeat.

Still on the roadmap:

Go language support for dataflow analysis across Go codebases, covering net/http handlers, database/sql, os/exec, and the Go module ecosystem
Context sensitivity so the same function can behave differently depending on call context
Path sensitivity to prune impossible paths based on branch conditions
Full site-packages indexing to model any third-party library, not just the ones with typeshed coverage

Get started:

bash
pip install codepathfinder
pathfinder scan --ruleset python/flask --project .

CI integration:

bash
pathfinder ci --ruleset python/flask --project . --fail-on critical,high

Source code is on GitHub. All 190+ rules are browsable in the rule registry with interactive playgrounds.

If you find this useful, star the repo. If you want to write rules or improve the engine, check the contributing guide. Found a bug? Open an issue. Have ideas about where this should go? Start a discussion.

I'm genuinely excited about what cross-file analysis unlocks. More precise rule writing. LLMs that can reason about dataflow with a deterministic engine backing them up. And the whole thing is open source. Not locked behind a paid tier. Not gated by an enterprise license. Just install it and scan.

Cross-File Dataflow Analysis: Taint Tracking Across Your Entire Project

What's in v2.0

Three kinds of code in your project

Writing a flow analysis rule

Define your types

Define sources, sinks, and sanitizers

Wire it together

Cross-file SQL injection

190+ rules ship out of the box

The AI angle

What's ahead

Try Code Pathfinder Today

Secure your code with confidence

Write to us

Chat with us

Try it now