# PYTHON-LANG-SEC-040: Pickle Deserialization of Untrusted Data

> **Severity:** HIGH | **CWE:** CWE-502 | **OWASP:** A08:2021

- **Language:** Python
- **Category:** Python Core
- **URL:** https://codepathfinder.dev/registry/python/lang/PYTHON-LANG-SEC-040
- **Detection:** `pathfinder scan --ruleset python/PYTHON-LANG-SEC-040 --project .`

## Description

Python's pickle module serializes and deserializes Python objects by encoding them as a
stream of opcodes that are executed by a virtual stack machine during unpickling. The
__reduce__() and __reduce_ex__() methods on objects can define arbitrary Python code to
be executed when the object is deserialized.

This means that deserializing a pickle stream from an untrusted source is equivalent to
executing arbitrary Python code. An attacker who can control the pickled data can achieve
full remote code execution, read files, spawn processes, and exfiltrate data. There is no
safe subset of pickle operations — the entire pickle format is a code execution vector.

The Python documentation explicitly warns: "The pickle module is not secure. Only unpickle
data you trust." Use JSON, MessagePack, or Protocol Buffers for data interchange with
untrusted parties.


## Vulnerable Code

```python
import pickle
import yaml
import marshal
import shelve

# SEC-040: pickle
data = pickle.loads(b"malicious")
with open("data.pkl", "rb") as f:
    obj = pickle.load(f)
unpickler = pickle.Unpickler(f)
```

## Secure Code

```python
import json
import struct

# INSECURE: pickle.loads() on data from network/file/user
# obj = pickle.loads(received_data)

# SECURE: Use JSON for data interchange
def deserialize_config(json_bytes: bytes) -> dict:
    data = json.loads(json_bytes.decode("utf-8"))
    # Validate structure and types
    if not isinstance(data, dict):
        raise ValueError("Expected dict")
    return data

# SECURE: Use a schema-validated format for complex objects
from pydantic import BaseModel

class UserData(BaseModel):
    username: str
    age: int
    email: str

def parse_user_data(json_str: str) -> UserData:
    return UserData.model_validate_json(json_str)

# SECURE: For ML models, use format-specific safe loaders
# PyTorch: torch.load(f, map_location=...) with weights_only=True
# Scikit-learn: use joblib with caution, or export to ONNX
# NumPy arrays: np.load(f, allow_pickle=False)

```

## Detection Rule (Python SDK)

```python
from rules.python_decorators import python_rule
from codepathfinder import calls, QueryType

class PickleModule(QueryType):
    fqns = ["pickle", "_pickle", "cPickle"]


@python_rule(
    id="PYTHON-LANG-SEC-040",
    name="Pickle Deserialization Detected",
    severity="HIGH",
    category="lang",
    cwe="CWE-502",
    tags="python,pickle,deserialization,rce,OWASP-A08,CWE-502",
    message="pickle.loads/load detected. Pickle can execute arbitrary code. Use json or msgpack instead.",
    owasp="A08:2021",
)
def detect_pickle():
    """Detects pickle.loads/load/Unpickler usage."""
    return PickleModule.method("loads", "load", "Unpickler")
```

## How to Fix

- Replace pickle with JSON, MessagePack, or Protocol Buffers for data received from any external source.
- For ML model serialization, use format-specific safe formats: ONNX, SavedModel (TensorFlow), TorchScript, or weights_only=True for PyTorch.
- If pickle must be used for internal IPC, sign all pickle payloads with HMAC using a secret key and verify the signature before deserializing.
- Never accept pickle data in file uploads, API endpoints, message queues, or any interface reachable by external parties.
- For scientific data (NumPy, pandas), use safe alternatives: np.load() with allow_pickle=False, df.to_parquet()/pd.read_parquet(), or HDF5.

## Security Implications

- **Arbitrary Code Execution via __reduce__:** The pickle __reduce__ protocol allows any pickleable object to specify a callable and
arguments to be invoked during deserialization. An attacker crafts a pickle stream
that calls os.system(), subprocess.Popen(), or exec() with malicious arguments,
achieving RCE simply by having the pickle stream deserialized.

- **No Sanitization Is Possible:** Unlike SQL injection or XSS where sanitization can be effective, there is no way to
safely sanitize or validate a pickle stream before deserializing it. Parsing the pickle
stream to check for dangerous opcodes requires implementing a pickle interpreter, which
can itself be bypassed by encoding techniques.

- **Session and Cache Poisoning:** Applications that store pickled objects in Redis, Memcached, or cookies for session
management are vulnerable if an attacker can write to those stores. Session poisoning
via pickle injection in shared cache stores has been used in real attacks.

- **File Upload and Deserialization Chain:** Applications that accept file uploads and deserialize them with pickle (e.g., ML model
files, scientific data, serialized objects) are vulnerable to malicious uploads that
execute code on the server when the file is loaded.


## FAQ

**Q: Can I make pickle safe by restricting globals with a custom Unpickler?**

Python's documentation suggests subclassing pickle.Unpickler and overriding find_class()
to restrict which classes can be deserialized. This provides some protection but is
difficult to implement correctly and has been bypassed in practice using creative opcode
sequences. For truly untrusted data, use a different serialization format entirely.


**Q: Is pickle safe for data stored in my own database?**

Pickle is safer when stored data is only written by your own code and the storage
system is protected from external writes. However, if an attacker can inject data
into your database (via SQL injection, for example), they can plant malicious pickle
payloads. Defense in depth suggests using JSON even for internal storage.


**Q: What about signing pickle data with HMAC?**

Signing pickled data with HMAC and verifying the signature before deserializing is
a valid mitigation for trusted sender scenarios (e.g., signed cookies in web frameworks).
Django's signed cookie framework does this. The key must be kept secret and rotated
if compromised. This is not a substitute for avoiding pickle with truly untrusted data.


**Q: Is joblib.load() safer than pickle.load() for ML models?**

No. joblib uses pickle internally and has the same code execution risk. For scikit-learn
models, use ONNX export via sklearn2onnx for deployment. For numpy arrays, use
np.save()/np.load() with allow_pickle=False. For pandas DataFrames, use Parquet or
CSV with explicit dtypes.


**Q: How do I safely load PyTorch model files?**

PyTorch .pt/.pth files are pickle-based and can execute code. From PyTorch 2.0+, use
torch.load(f, weights_only=True) to load only tensor data without executing arbitrary
pickle opcodes. For third-party model files, use ONNX runtime or model format-specific
safe loaders.


**Q: What serialization format should I use to replace pickle?**

For structured data interchange: JSON (universal), MessagePack (binary, compact),
Protocol Buffers (schema-validated, efficient). For Python-specific types with schema
validation: Pydantic models with JSON. For scientific data: Arrow/Parquet, HDF5 with
h5py, NumPy's npz format with allow_pickle=False. Choose based on your type requirements
and performance needs.


## References

- [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html)
- [Python docs: pickle security warning](https://docs.python.org/3/library/pickle.html#restricting-globals)
- [OWASP Deserialization Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Deserialization_Cheat_Sheet.html)
- [Exploiting Python pickles (research)](https://davidhamann.de/2020/04/05/exploiting-python-pickle/)
- [OWASP Top 10 A08:2021 Software and Data Integrity Failures](https://owasp.org/Top10/A08_2021-Software_and_Data_Integrity_Failures/)

---

Source: https://codepathfinder.dev/registry/python/lang/PYTHON-LANG-SEC-040
Code Pathfinder — Open source, type-aware SAST with cross-file dataflow analysis
