TL;DR
You can create two different inputs that produce the same MD5 hash (a collision). This is possible because MD5 isn’t a strong hashing algorithm anymore. If you have some control over one of the inputs, it makes finding a collision much easier than if you had to find it completely randomly.
Understanding the Problem
MD5 creates a 128-bit hash value from any input data. A ‘collision’ happens when two different pieces of data produce the same MD5 hash. Because MD5 is known to have weaknesses, collisions can be found relatively easily using techniques like length extension attacks and precomputed rainbow tables (though these are less common now). The fact that you have partial control over one input significantly simplifies things.
Solution: Length Extension Attack
The most practical approach when you have some control over the initial message is a length extension attack. This allows you to append data to an existing hash and create a new, colliding hash without knowing the original message.
- Understand the Setup: You know part of the message (
message1) and its MD5 hash (hash1). You also control some additional data you want to append (append_data). The goal is to find amessage2that has the same MD5 hash asmessage1. - Padding: MD5 requires specific padding. This involves appending ‘1’ followed by enough ‘0’s so the message length becomes a multiple of 512 bits, plus a 64-bit representation of the original message length in little-endian format. You don’t need to calculate this manually; libraries handle it for you.
- Length Extension Calculation: The core idea is to use the initial hash (
hash1), the original message length, and yourappend_datato compute a new hash that will collide with the original. This requires understanding how MD5 internally processes data in blocks. Libraries abstract this complexity. - Implementation Example (Python): The following example uses the
hashliblibrary and demonstrates the concept. Note: This is a simplified illustration; real-world implementations might require more robust handling of message lengths and padding.import hashlib def length_extension_attack(message1, hash1, append_data): original_length = len(message1) * 8 # Length in bits new_message = message1 + append_data h = hashlib.md5() h.update(message1.encode('utf-8')) h.update(bytes([0x80])) # Append '1' bit # Pad with zeros to reach 512 bits (adjust as needed) padding_length = (448 - original_length) % 512 h.update(b' ' * padding_length) h.update(original_length.to_bytes(8, 'little')) # Append original length in bits intermediate_hash = h.digest() # Now append the new data and calculate the final hash. h2 = hashlib.md5(intermediate_hash + append_data.encode('utf-8')) final_hash = h2.hexdigest() return final_hash message1 = "hello" hash1 = hashlib.md5(message1.encode('utf-8')).hexdigest() append_data = "world" new_hash = length_extension_attack(message1, hash1, append_data) print(f"Original Message: {message1}") print(f"Original Hash: {hash1}") print(f"Appended Data: {append_data}") print(f"New Hash (should collide): {new_hash}") - Verification: After calculating the new hash, verify that it matches the original hash. If they match, you’ve successfully created a collision. In practice, finding *useful* collisions might require more sophisticated techniques and tools. The example above is for demonstration only; the ‘world’ append data will not necessarily create a useful collision.
Important Considerations
- MD5 is Deprecated: MD5 is considered cryptographically broken. Do *not* use it for security-sensitive applications like password storage or digital signatures. Use stronger hashing algorithms like SHA-256 or SHA-3.
- Library Support: Most programming languages provide libraries that handle the complexities of MD5 padding and internal calculations. Use these libraries instead of trying to implement MD5 yourself.
- Collision Resistance vs. Preimage Resistance: Length extension attacks exploit weaknesses in MD5’s collision resistance (finding two inputs with the same hash). They don’t directly break preimage resistance (finding an input that produces a given hash), but they can be used to construct attacks against systems relying on MD5 for integrity checks.

