TL;DR
This guide shows you how to remove potential malware from a PDF file by ‘bursting’ it into its individual components and then rebuilding it. This process can discard malicious code that might be hidden within the file structure.
Steps
- Install Required Tools
- PDFtk Server: A command-line tool for manipulating PDFs. Download from PDF Labs (choose the correct version for your operating system). You may need to install it using a package manager like
apton Linux or by running the installer on Windows. - Ghostscript: A PostScript and PDF interpreter. Download from Ghostscript (again, choose the version for your OS). Ensure it’s added to your system’s PATH environment variable so you can run
gscommands from any directory. - Burst the PDF
- Inspect Extracted Components (Optional but Recommended)
- Images: Open images in an image editor and look for hidden data or unusual patterns.
- Fonts: Be wary of fonts from unknown sources. You can use font inspection tools online to check their properties.
- JavaScript Files (if any): Examine JavaScript files carefully for malicious code. Use a text editor or an online JavaScript analyser.
- Rebuild the PDF
- Verify the Rebuilt PDF
- Open and Test: Open the rebuilt PDF in a PDF viewer and test all its features (forms, links, buttons) to ensure everything works as expected.
- Scan with Anti-Virus: Scan the rebuilt PDF file with your anti-virus software for any remaining threats.
The goal here is to split the PDF into its individual components (images, fonts, etc.). Use PDFtk Server for this.
pdftk input.pdf burst output output_folder
Replace input.pdf with the name of your potentially infected file and output_folder with a new folder where you want to save the extracted components. This will create files like img001.jpg, font001.ttf, etc.
Before rebuilding, it’s a good idea to check the extracted components for anything suspicious. This is especially important if you have reason to believe specific types of malware might be present.
Now, rebuild the PDF using Ghostscript. This will create a new PDF file without the potentially harmful embedded content.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -dQUIET -c "(r) file output_folder/*.jpg output_folder/*.png output_folder/*.ttf" -f new.pdf
Replace new.pdf with the desired name for your cleaned PDF file. Adjust the *.jpg, *.png and *.ttf parts of the command to match the actual files in your output_folder. You may need to add other file types if your original PDF contained them.
Important: The order of files passed to Ghostscript matters. Make sure they are in a logical sequence (images, fonts, etc.).
Disclaimer: While this method can help remove embedded malware, it is not foolproof. Sophisticated malware may still be present. Always exercise caution when opening PDFs from untrusted sources.

