MD5 Collision Attacks

G5 Cyber Security

1 month ago

TL;DR

You can create two different inputs that produce the same MD5 hash (a collision). This is possible because MD5 isn’t a strong hashing algorithm anymore. If you have some control over one of the inputs, it makes finding a collision much easier than if you had to find it completely randomly.

Understanding the Problem

MD5 creates a 128-bit hash value from any input data. A ‘collision’ happens when two different pieces of data produce the same MD5 hash. Because MD5 is known to have weaknesses, collisions can be found relatively easily using techniques like length extension attacks and precomputed rainbow tables (though these are less common now). The fact that you have partial control over one input significantly simplifies things.

Solution: Length Extension Attack

The most practical approach when you have some control over the initial message is a length extension attack. This allows you to append data to an existing hash and create a new, colliding hash without knowing the original message.

Understand the Setup: You know part of the message (message1) and its MD5 hash (hash1). You also control some additional data you want to append (append_data). The goal is to find a message2 that has the same MD5 hash as message1.
Padding: MD5 requires specific padding. This involves appending ‘1’ followed by enough ‘0’s so the message length becomes a multiple of 512 bits, plus a 64-bit representation of the original message length in little-endian format. You don’t need to calculate this manually; libraries handle it for you.
Length Extension Calculation: The core idea is to use the initial hash (hash1), the original message length, and your append_data to compute a new hash that will collide with the original. This requires understanding how MD5 internally processes data in blocks. Libraries abstract this complexity.

Implementation Example (Python): The following example uses the hashlib library and demonstrates the concept. Note: This is a simplified illustration; real-world implementations might require more robust handling of message lengths and padding.

import hashlib

def length_extension_attack(message1, hash1, append_data):
  original_length = len(message1) * 8 # Length in bits
  new_message = message1 + append_data
  h = hashlib.md5()
  h.update(message1.encode('utf-8'))
  h.update(bytes([0x80])) # Append '1' bit
  # Pad with zeros to reach 512 bits (adjust as needed)
  padding_length = (448 - original_length) % 512
  h.update(b' ' * padding_length)
  h.update(original_length.to_bytes(8, 'little')) # Append original length in bits
  intermediate_hash = h.digest()

  # Now append the new data and calculate the final hash.
  h2 = hashlib.md5(intermediate_hash + append_data.encode('utf-8'))
  final_hash = h2.hexdigest()
  return final_hash

message1 = "hello"
hash1 = hashlib.md5(message1.encode('utf-8')).hexdigest()
append_data = "world"
new_hash = length_extension_attack(message1, hash1, append_data)
print(f"Original Message: {message1}")
print(f"Original Hash: {hash1}")
print(f"Appended Data: {append_data}")
print(f"New Hash (should collide): {new_hash}")

Verification: After calculating the new hash, verify that it matches the original hash. If they match, you’ve successfully created a collision. In practice, finding *useful* collisions might require more sophisticated techniques and tools. The example above is for demonstration only; the ‘world’ append data will not necessarily create a useful collision.

Important Considerations

MD5 is Deprecated: MD5 is considered cryptographically broken. Do *not* use it for security-sensitive applications like password storage or digital signatures. Use stronger hashing algorithms like SHA-256 or SHA-3.
Library Support: Most programming languages provide libraries that handle the complexities of MD5 padding and internal calculations. Use these libraries instead of trying to implement MD5 yourself.
Collision Resistance vs. Preimage Resistance: Length extension attacks exploit weaknesses in MD5’s collision resistance (finding two inputs with the same hash). They don’t directly break preimage resistance (finding an input that produces a given hash), but they can be used to construct attacks against systems relying on MD5 for integrity checks.