# PYTHON-CRYPTO-SEC-010: Insecure MD5 Hash (cryptography)

> **Severity:** MEDIUM | **CWE:** CWE-327, CWE-328 | **OWASP:** A02:2021

- **Language:** Python
- **Category:** Cryptography
- **URL:** https://codepathfinder.dev/registry/python/cryptography/PYTHON-CRYPTO-SEC-010
- **Detection:** `pathfinder scan --ruleset python/PYTHON-CRYPTO-SEC-010 --project .`

## Description

Detects usage of MD5 via the `cryptography` library's hazmat primitives interface
(`hashes.MD5()`). MD5 produces a 128-bit digest and has been considered
cryptographically broken since 2004 when Wang et al. demonstrated practical chosen-prefix
collision attacks. By 2008, rogue CA certificates were forged using MD5 collisions in
under hours of computation. Today, MD5 collisions can be produced in seconds on commodity
hardware.

MD5 must not be used for digital signatures, certificate validation, HMAC-based
authentication, or data integrity verification in security contexts. It remains acceptable
for non-security purposes such as cache keys, file deduplication, or content-addressable
storage where collision resistance is not a security requirement.

This rule specifically targets `cryptography.hazmat.primitives.hashes.MD5` instantiation,
which is the hazmat (Hazardous Materials) layer indicating the caller is expected to
understand the risks — yet MD5 is still dangerous regardless of the API used.


## Vulnerable Code

```python
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.backends import default_backend

# SEC-010: MD5 in cryptography lib
digest = hashes.Hash(hashes.MD5(), backend=default_backend())
```

## Secure Code

```python
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import os

# SECURE: SHA-256 for general integrity checking
digest = hashes.Hash(hashes.SHA256())
digest.update(b"data to hash")
result = digest.finalize()

# SECURE: SHA-3 for stronger collision resistance
digest = hashes.Hash(hashes.SHA3_256())
digest.update(b"data to hash")
result = digest.finalize()

# SECURE: PBKDF2 with SHA-256 for password-derived keys
salt = os.urandom(16)
kdf = PBKDF2HMAC(algorithm=hashes.SHA256(), length=32, salt=salt, iterations=600000)
key = kdf.derive(b"my password")

# OK for non-security use (deduplication, cache keys — NOT integrity)
# import hashlib; hashlib.md5(data).hexdigest()  # only if collision is not a concern

```

## Detection Rule (Python SDK)

```python
from rules.python_decorators import python_rule
from codepathfinder import calls, flows, QueryType
from codepathfinder.presets import PropagationPresets

class CryptoHashes(QueryType):
    fqns = ["cryptography.hazmat.primitives.hashes"]


@python_rule(
    id="PYTHON-CRYPTO-SEC-010",
    name="Insecure MD5 Hash (cryptography)",
    severity="MEDIUM",
    category="cryptography",
    cwe="CWE-327",
    tags="python,cryptography,md5,weak-hash,CWE-327",
    message="MD5 is cryptographically broken. Use SHA-256 or SHA-3 instead.",
    owasp="A02:2021",
)
def detect_md5_hash_crypto():
    """Detects MD5 usage in cryptography library."""
    return CryptoHashes.method("MD5")
```

## How to Fix

- Replace hashes.MD5() with hashes.SHA256() or hashes.SHA3_256() for all integrity and signing use cases.
- For password hashing, do not use any raw hash function — use a memory-hard KDF such as Argon2 (argon2-cffi), bcrypt, or scrypt instead.
- For HMAC authentication, use HMAC with SHA-256 or SHA-3 (cryptography.hazmat.primitives.hmac with hashes.SHA256()).
- MD5 may remain in place for purely non-security uses (cache keys, file deduplication) where collision resistance carries no security consequence — document this explicitly.
- When migrating existing MD5-hashed data (e.g., stored checksums), re-hash with SHA-256 on first verified access and deprecate the MD5 path.

## Security Implications

- **undefined:** 
- **undefined:** 
- **undefined:** 
- **undefined:** 

## FAQ

**Q: Is MD5 ever safe to use?**

MD5 is safe for non-security checksums such as file deduplication, cache invalidation keys, or content-addressable storage where an attacker gaining from a collision provides no security benefit. It must not be used for digital signatures, certificate hashing, HMAC, password storage, or any context where collision resistance matters.

**Q: Why not just use SHA-256 for password hashing too?**

SHA-256 (and all raw hash functions) are designed to be fast. Speed is an advantage for an attacker performing brute-force or dictionary attacks. Password hashing requires a deliberately slow, memory-hard function — use Argon2, bcrypt, or scrypt. PBKDF2 with SHA-256 is acceptable when Argon2 is unavailable, but requires at least 600,000 iterations per NIST SP 800-132.

**Q: I need MD5 for a legacy protocol or format — what should I do?**

If MD5 is mandated by an external specification you cannot change, document it clearly, isolate the usage, and add compensating controls (e.g., an outer integrity layer using SHA-256 HMAC). Flag the dependency for removal when the protocol allows migration.

**Q: Does this rule fire on hashlib.md5() from the standard library?**

No. This rule targets the `cryptography` library's hazmat primitives. For hashlib.md5() detection, see the hashlib-specific rules in this ruleset.

**Q: How do I run this rule in CI/CD?**

Run `code-pathfinder scan --ruleset python/cryptography/PYTHON-CRYPTO-SEC-010 --path ./src` in your pipeline. Add `--format sarif` to produce SARIF output compatible with GitHub Advanced Security and similar platforms.

**Q: What is the severity and why MEDIUM rather than HIGH?**

MEDIUM reflects that MD5 is context-dependent — collision attacks are practical but require attacker interaction at the point of signing or hashing. Rules targeting MD4 and MD2 are rated HIGH because those algorithms offer no practical security even in constrained scenarios.

**Q: Can the cryptography library's hazmat MD5 be used safely for non-cryptographic purposes?**

Technically yes, but using the hazmat interface for non-security purposes adds unnecessary complexity. Prefer `hashlib.md5()` for checksums to make the non-security intent explicit. The hazmat interface signals cryptographic use, which increases the chance of future misuse.

## References

- [CWE-327: Use of a Broken or Risky Cryptographic Algorithm](https://cwe.mitre.org/data/definitions/327.html)
- [CWE-328: Use of Weak Hash](https://cwe.mitre.org/data/definitions/328.html)
- [Wang et al. 2004: How to Break MD5 and Other Hash Functions](https://link.springer.com/chapter/10.1007/978-3-540-28628-8_19)
- [Stevens et al. 2009: Short Chosen-Prefix Collisions for MD5](https://link.springer.com/chapter/10.1007/978-3-642-03356-8_8)
- [NIST SP 800-131A Rev 2: Transitioning the Use of Cryptographic Algorithms](https://csrc.nist.gov/publications/detail/sp/800-131a/rev-2/final)
- [NIST SP 800-107: Recommendation for Applications Using Approved Hash Algorithms](https://csrc.nist.gov/publications/detail/sp/800-107/rev-1/final)
- [OWASP Cryptographic Failures (A02:2021)](https://owasp.org/Top10/A02_2021-Cryptographic_Failures/)
- [cryptography library hazmat hashes documentation](https://cryptography.io/en/latest/hazmat/primitives/cryptographic-hashes/)

---

Source: https://codepathfinder.dev/registry/python/cryptography/PYTHON-CRYPTO-SEC-010
Code Pathfinder — Open source, type-aware SAST with cross-file dataflow analysis
