Automatically scanning for malicious user-uploaded files

If you run a service that accepts file uploads from users, and then subsequent re-download by other users (such as images), then your service is potentially at risk of becoming a system for distributing malware. Without safeguards in place, bad actors could potentially use your service to upload harmful files with the intention of them being downloaded by other users.

Services like Google Drive and some email providers will automatically scan files for malicious payloads, but if you - like many people - rely on more basic object storage for storing files for your apps, then there may be less default protection available.

Luckily there are a number of methods available for addressing this.

Overview

Whilst the concepts are mostly generic and framework/infrastructure agnostic, in this post I’ll focus on a process that leverages Amazon S3, Lambda, and ClamAV. ClamAV is open-source antimalware software that can be executed as a binary without requiring a GUI.

In this post I won’t include code (for brevity), but I will walk through the key stages, which are:

Periodic refresh of malware/virus definitions
Running the antimalware check upon new file upload
Denying uploads to “infected” files.

Managing and refreshing virus definitions

This stage allows ClamAV to keep up-to-date and recognise the types of files that might be infected. It involves three steps:

Obtain a binary capable of downloading new definitions
Write a function to run the binary
Schedule the function to be run periodically

ClamAV provides FreshClam - a tool for updating and managing a local database of virus signatures.

In order to obtain a freshclam binary for use in Lambda, I recommend installing ClamAV on an EC2 instance running Amazon Linux 2, and then extracting the tool from the filesystem (e.g. by running which freshclam). Note that you may also need some other library files for the binary to work (you’ll notice errors which will be pretty self-explanatory).

Once you have the freshclam binary, create and upload a Lambda Layer containing the binary. Depending on the runtime your Lambda function will use, you will need to store the binary at a specific path in your Layer. Give your Layer a suitable name, such as freshclamLayer.

Next, we need to create a Lambda function. It doesn’t matter what runtime you use, so long as you can use it to execute the freshclam binary from the filesystem as some form of subprocess. Once ready, upload it to Lambda and reference the freshclamLayer Layer so that the binary can be made available to the function.

The function should outut the generated signature database to an S3 bucket. As such, your function will need to have an IAM role that enables write access to your chosen bucket.

Finally, use CloudWatch Events to schedule your function to be run periodically. For example, you could update the signatures once or twice per day, depending on your needs.

Running the antimalware check on new files

This stage uses the virus definitions to decide whether a newly-uploaded file might be infected. It involves two key steps:

Create a Lambda function that runs the virus scanner
Create an S3 trigger that runs the function when new files are uploaded

In addition to FreshClam, ClamAV also provides a tool called ClamScan which can be invoked on specific files or directories in order to check them for malicious content.

Obtain the clamscan binary as described in the previous step, and bundle this into a Layer (again, as above), named something like clamscanLayer.

Next, create a new Lambda function (again, choose a runtime that can invoke binary subprocesses). The function should check the event passed to it in order to determine the path to the file that was uploaded, download the previously-uploaded virus signatures from S3, and then run clamscan against the target file. The output from the binary should be monitored to understand whether the file is suspected to be malicious.

If clamscan determines that the file is malicious, then use an S3 tag to mark the file as infected. For example, you could create a tag named Infected and pass a value of true or false depending on the scan’s output. For this, you’ll need to give your function relevant IAM permissions to enable object tagging in S3 and also read access to the bucket where the virus definitions were stored in the previous stage.

Also make sure that your function uses your clamscanLayer Layer so that it can access the relevant binary.

Finally, configure the S3 bucket responsible for holding user uploads in order to add a new trigger that invokes your scanning function each time a new object is put to the bucket.

Restrict downloads of infected files

The final stage involves telling S3 to forbid access to infected files. To do so, simply create (or modify) the bucket policy for the user uploads bucket such that it denies the GetObject action for any objects that have the condition of an Infected tag with the value of true.

Conclusion

In this post I’ve provided a rough overview of a process that allows for scanning user-uploaded files for malicious content. Hopefully this might help if you’re looking to make your services more secure for your users!