# PYTHON-LANG-SEC-046: dill Deserialization Detected

> **Severity:** HIGH | **CWE:** CWE-502 | **OWASP:** A08:2021

- **Language:** Python
- **Category:** Python Core
- **URL:** https://codepathfinder.dev/registry/python/lang/PYTHON-LANG-SEC-046
- **Detection:** `pathfinder scan --ruleset python/PYTHON-LANG-SEC-046 --project .`

## Description

dill is a Python package that extends the standard pickle module with broader serialization
capabilities, supporting lambda functions, generators, closures, nested functions, and other
Python objects that pickle cannot serialize. dill uses pickle's serialization mechanism as
its foundation and is fully compatible with pickle's security issues.

Like pickle, dill can execute arbitrary Python code during deserialization. dill actually
extends the attack surface compared to standard pickle because it can serialize and
deserialize additional constructs including lambda functions and closures containing
executable code.

dill is commonly used in scientific computing (multiprocessing with lambdas, distributed
computing) and machine learning (serializing model training functions). These use cases
involve trusted internal data, but dill.loads() on external data is just as dangerous
as pickle.loads().


## Vulnerable Code

```python
import pickle
import yaml
import marshal
import shelve

# SEC-046: dill
import dill
obj = dill.loads(b"data")
```

## Secure Code

```python
import json

# INSECURE: dill.loads() on external or network data
# import dill
# obj = dill.loads(received_bytes)
# func = dill.load(open("untrusted_model.dill", "rb"))

# SECURE: Use JSON for data interchange
def deserialize_request(data: bytes) -> dict:
    return json.loads(data.decode("utf-8"))

# SECURE: For ML function serialization, use explicit function registries
# instead of serializing function objects
ALLOWED_TRANSFORMS = {
    "normalize": lambda x: (x - x.mean()) / x.std(),
    "log": lambda x: x.apply(lambda v: max(0, v)),
}

def get_transform(name: str):
    if name not in ALLOWED_TRANSFORMS:
        raise ValueError(f"Transform not allowed: {name}")
    return ALLOWED_TRANSFORMS[name]

# SECURE: For distributed computing, use function references by name
# rather than serialized function objects
def submit_task(func_name: str, args: list):
    ALLOWED_FUNCTIONS = {"process_batch", "validate_record", "aggregate_results"}
    if func_name not in ALLOWED_FUNCTIONS:
        raise ValueError(f"Function not allowed: {func_name}")
    # Submit by reference, not by serialization

```

## Detection Rule (Python SDK)

```python
from rules.python_decorators import python_rule
from codepathfinder import calls, QueryType

class DillModule(QueryType):
    fqns = ["dill"]


@python_rule(
    id="PYTHON-LANG-SEC-046",
    name="dill Deserialization Detected",
    severity="HIGH",
    category="lang",
    cwe="CWE-502",
    tags="python,dill,deserialization,rce,CWE-502",
    message="dill.loads/load detected. dill extends pickle and can execute arbitrary code.",
    owasp="A08:2021",
)
def detect_dill():
    """Detects dill.loads/load usage."""
    return DillModule.method("loads", "load")
```

## How to Fix

- Never use dill.loads() or dill.load() on data from external sources, including network payloads, file uploads, or user-provided files.
- For distributed computing task serialization, restrict task definitions to developer-controlled function references rather than serialized closures from user input.
- For ML model portability, use format-specific safe serialization (ONNX, TorchScript, SavedModel) instead of dill-serialized Python function objects.
- If dill must be used for internal distributed computing, ensure task payloads are signed with HMAC and only processed within a trusted network boundary.
- Audit all dill usage in data science and ML pipelines to confirm no external data flows through dill.loads().

## Security Implications

- **Extended Code Execution via Closures and Lambdas:** dill can serialize lambda functions, closures, and generators that contain executable
code. An attacker crafting a dill payload can embed malicious lambda functions or
closures that execute arbitrary code when deserialized, in addition to all of pickle's
existing attack vectors.

- **Distributed Computing Attack Surface:** dill is commonly used with multiprocessing and distributed frameworks (Ray, Dask,
Apache Spark's Python serialization) to serialize functions for distribution across
workers. If an attacker can inject dill-serialized payloads into the task queue,
they can execute code on all worker nodes.

- **ML Model Poisoning:** Machine learning pipelines that serialize model training functions, preprocessing
steps, or custom loss functions using dill are vulnerable to model poisoning if
the serialized files can be replaced. Loading a malicious dill file from an
untrusted model repository triggers code execution.

- **Lambda-based Payload Evasion:** dill's ability to serialize lambdas and closures enables more sophisticated attack
payloads that may evade signature-based detection designed to look for common pickle
gadget chains, since the malicious code is embedded in function bytecode rather than
class instantiation sequences.


## FAQ

**Q: Why is dill more dangerous than standard pickle?**

Standard pickle cannot serialize lambda functions, closures, generators, or nested
functions. dill can serialize all of these, expanding the set of possible attack
payloads. An attacker can embed malicious code in a lambda or closure that executes
upon deserialization, potentially evading defenses designed for standard pickle gadget
chains.


**Q: Is dill used for legitimate purposes?**

Yes. dill is widely used for multiprocessing (to serialize lambda functions for worker
processes), distributed computing (to ship function closures to remote workers), and
ML/scientific computing (to checkpoint model training functions). These are all
legitimate use cases involving trusted internal data, not external input.


**Q: How does dill relate to cloudpickle, multiprocess, and pathos?**

cloudpickle (used by Apache Spark, Ray) and pathos.multiprocessing both use dill or
similar extended pickle mechanisms for function serialization. All share the same
security risk: loading a serialized object from an untrusted source can execute
arbitrary code. The same guidance applies: never deserialize externally sourced data
with these libraries.


**Q: Can I use dill.loads() if I hash-verify the data first?**

HMAC signing with a secret key before transmission and verification before deserialization
provides a reasonable mitigation for trusted sender scenarios. If the sender is trusted
and the secret key is properly protected, this reduces the risk to that of a
key compromise. However, this requires careful key management and is error-prone.


**Q: What is the safe alternative for ML model sharing?**

For model weights: use PyTorch's torch.save() with weights_only=True when loading,
or ONNX for cross-framework portability. For preprocessing pipelines: serialize
parameters (not functions) as JSON and reconstruct the pipeline deterministically.
For custom layers: use model architecture code in version-controlled source files,
not serialized function objects.


**Q: Are there any dill-compatible safe deserialization approaches?**

dill does not provide a safe subset or restricted loader like PyYAML's SafeLoader.
The only safe approach with dill is to ensure the data being deserialized comes
from a trusted source with cryptographic integrity protection. For untrusted data,
there is no safe way to use dill.loads().


## References

- [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html)
- [dill documentation](https://dill.readthedocs.io/)
- [Python docs: pickle security warning](https://docs.python.org/3/library/pickle.html#restricting-globals)
- [OWASP Deserialization Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Deserialization_Cheat_Sheet.html)
- [OWASP Top 10 A08:2021 Software and Data Integrity Failures](https://owasp.org/Top10/A08_2021-Software_and_Data_Integrity_Failures/)

---

Source: https://codepathfinder.dev/registry/python/lang/PYTHON-LANG-SEC-046
Code Pathfinder — Open source, type-aware SAST with cross-file dataflow analysis
