User Guide

Requirements

  • A relatively-recent Linux computer with more than 4Gb of RAM

We do not currently support Windows or Mac. (You should be able to run this on Windows using the Ubuntu [WSL](https://docs.microsoft.com/en-us/windows/wsl/about)).

Installation

Sunbeam has multiple options for installation. For development work on sunbeam, use git. For standard usage, installing each version of sunbeam that you need from tarballs into separate directories is recommended (i.e. if you want versions 3 and 4 installed, you would repeat the tar install process below for sunbeam3.1.1 and sunbeam4.0.0 (or whatever specific versions you want)). If you are comfortable with Docker, then the containerized version of Sunbeam is also an option.

On a Linux machine, download the tarball for the sunbeam version you want (sunbeamX.X.X) then unpack and install it.

wget https://github.com/sunbeam-labs/sunbeam/releases/latest/download/sunbeam.tar.gz
mkdir sunbeam4.0.0
tar -zxf sunbeam.tar.gz -C sunbeam4.0.0
cd sunbeam4.0.0 && ./install.sh

The installer will check for and install the three components necessary for Sunbeam to work. The first is Conda, a system for downloading and managing software environments. The second is the Sunbeam environment, which will contain all the core dependencies. The third is the Sunbeam library, which provides the necessary commands to run Sunbeam.

If you don’t have Conda installed prior to this, you will need to add a line (displayed during install) to your config file (usually in ~/.bashrc or ~/.profile). Restart your terminal after installation for this to take effect.

Testing

We’ve included tests that should verify all the dependencies are installed and Sunbeam can run properly. We strongly recommend running this after installing or updating Sunbeam:

pytest tests/ -vvl

If the tests fail, you should either refer to our troubleshooting guide or file an issue on our Github page.

Updating

Sunbeam follows semantic versioning practices. In short, this means that the version has three numbers: major, minor and patch. For instance, a version number of 1.2.1 has 1 as the major version, 2 as the minor, and 1 as the patch.

When we update Sunbeam, if your config files and environment will work between upgrades, we will increment the patch or minor numbers (e.g. 1.0.0 -> 1.1.0). All you need to do is the following:

The tar-based install can’t be upgraded, so you’ll need to download the latest tarball and install it alongside your current version. You can then remove the old version if you want to (or keep it so you know you have a working version to fall back on).

Sunbeam is designed to be installable separately on a system that already has sunbeam installed. This means multiple versions of sunbeam can be installed on the same machine in different repositories and different environments (or containers).

Uninstalling or reinstalling

If things go awry and updating doesn’t work, simply uninstall and reinstall Sunbeam.

source deactivate
conda remove -n sunbeamX.X.X --all
cd ../ && rm -rf sunbeam/

Then follow the installation instructions above.

Tip

If you’re using the Docker image, you can simply remove the image and container(s) and pull the latest version.

Installing Sunbeam extensions

As of version 3.0, Sunbeam extensions can be installed by running sunbeam extend followed by the URL of the extension’s GitHub repo:

sunbeam extend https://github.com/sunbeam-labs/sbx_mapping/

For Sunbeam versions prior to 3.0, follow the legacy installation instructions on the extension to install. They should look something like:

git clone https://github.com/sunbeam-labs/sbx_mapping.git extensions/sbx_mapping
cat extensions/sbx_mapping/config.yml >> /path/to/project/sunbeam_config.yml

Setup

Tip

From this point on, all instructions will be given assuming either a git or tar install. If you’re using the Docker image, you’ll need to run all commands within the container. They should look pretty much the same providing you mount the necessary directories to your container. Note that the conda environment does not need to be activated within the container.

Activating Sunbeam

Almost all commands from this point forward require us to activate the Sunbeam conda environment:

You should see ‘(SUNBEAM_ENV_NAME)’ in your prompt when you’re in the environment. To leave the environment, run source deactivate or close the terminal.

Tip

You can see a list of installed sunbeam environments using the command conda env list.

Creating a new project using local data

We provide a utility, sunbeam init, to create a new config file, profile, and sample list for a project. The utility takes one required argument: a path to your project folder. This folder will be created if it doesn’t exist. You can also specify the path to your gzipped fastq files, and Sunbeam will try to guess how your samples are named, and whether they’re paired.

In this directory, a new config file and a new sample list were created (by default named sunbeam_config.yml and samplelist.csv, respectively) as well as a profile file (named config.yaml). Edit the config and profile files in your favorite text editor. All the keys for the config are described below.

Note

Sunbeam will do its best to determine how your samples are named in the data_fp you specify. It assumes they are named something regular, like MP66_S109_L008_R1.fastq.gz and MP66_S109_L008_R2.fastq.gz. In this case, the sample name would be ‘MP66_S109_L008’ and the read pair indicator would be ‘1’ and ‘2’. Thus, the filename format would look like {sample}_R{rp}.fastq.gz, where {sample} defines the sample name and {rp} defines the 1 or 2 in the read pair.

If you have single-end reads, you can pass --single_end to sunbeam init and it will not try to identify read pairs.

If the guessing doesn’t work as expected, you can manually specify the filename format after the --format option in sunbeam init.

Finally, if you don’t have your data ready yet, simply omit the --data_fp option. You can create a sample list later with sunbeam list_samples > samples.csv.

If some config values are always the same for all projects (e.g. paths to shared databases), you can put these keys in a file and auto-populate your config file with them during initialization. For instance, if you have a custom trimmomatic adapter template located at /home/user/adapter.fa, you could have a file containing the following called common_values.yml:

When you make a new Sunbeam project, use the --defaults common_values.yml as part of the init command.

If you have Sunbeam extensions installed, in Sunbeam >= 3.0, the extension config options will be automatically included in new config files generated by sunbeam init.

If you want to customize options in the profile instead, you can create a custom profile template named sunbeamlib/data/custom_profile.yaml and fill it with whatever options you want included in each sunbeam run. Snakemake has a curated list of common profiles here for working with HPC platforms and job schedulers. A default and a slurm profile are included by default. You would use this custom profile with --profile custom as part of the init command.

Further usage information is available by typing sunbeam init --help.

Configuration

Sunbeam has lots of configuration options, but most don’t need individual attention. Below, each is described by section.

Sections

all

  • root: The root project folder, used to resolve any relative paths in the rest of the config file.

  • output_fp: Path to where the Sunbeam outputs will be stored.

  • samplelist_fp: Path to a comma-separated file where each row contains a sample name and one or two paths (if single- or paired-end) to raw gzipped fastq files. This can be created for you by sunbeam init or sunbeam list_samples.

  • paired_end: ‘true’ or ‘false’ depending on whether you are using paired- or single-end reads.

  • version: Automatically added for you by sunbeam init. Ensures compatibility with the right version of Sunbeam.

qc

  • suffix: the name of the subfolder to hold outputs from the quality-control steps

  • leading: (trimmomatic) remove the leading bases of a read if below this quality

  • trailing: (trimmomatic) remove the trailing bases of a read if below this quality

  • slidingwindow: (trimmomatic) the [width, avg. quality] of the sliding window

  • minlength: (trimmomatic) drop reads smaller than this length

  • adapter_template: (trimmomatic) path to the Illumina paired-end adaptors (templated with $CONDA_ENV) (autofilled)

  • fwd_adapters: (cutadapt) custom forward adaptor sequences to remove using cutadapt. Replace with "" to skip.

  • rev_adapters: (cutadapt) custom reverse adaptor sequences to remove using cutadapt. Replace with "" to skip.

  • cutadapt_opts: (cutadapt) options to pass to cutadapt. Replace with "" to pass no extra options.

  • kz_threshold: a value between 0 and 1 to determine the low-complexity boundary (1 is most stringent). Ignored if not masking low-complexity sequences.

  • pct_id: the percent identity threshold for filtering mapped reads.

  • frac: the minimum fraction of the read that must be mapped to the reference to be kept.

  • host_fp: the path to the folder with host/contaminant genomes (ending in *.fasta)

classify

  • suffix: the name of the subfolder to hold outputs from the taxonomic classification steps

assembly

  • suffix: the name of the folder to hold outputs from the assembly steps

annotation

  • suffix: the name of the folder to hold contig annotation results

blastdbs

  • root_fp: path to a directory containing BLAST databases (if they’re all in the same place)

mapping

  • suffix: the name of the subfolder to create for mapping output (bam files, etc)

benchmarks

  • suffix: the name of the subfolder to create for benchmark data

logs

  • suffix: the name of the subfolder to create for logs

Building Databases

A detailed discussion on building databases for tools used by Sunbeam, while important, is beyond the scope of this document. Please see the following resources for more details:

Tip

These were all moved to extensions in sunbeam v4. Some vestiges remain in the main pipeline for compatibility with extensions but these should be considered deprecated and will be removed in future versions.

Running

To run Sunbeam, make sure you’ve activated the sunbeam environment. Then run:

There are many options that you can use to determine which outputs you want. By default, if nothing is specified, this runs the entire pipeline. However, each section is broken up into subsections that can be called individually, and will only execute the steps necessary to get their outputs. These are specified after the command above and consist of the following:

  • all_qc: basic quality control on all reads

  • all_decontam: quality control and host read removal on all samples

To use one of these options, simply run it like so:

sunbeam run --profile path/to/project/ all_qc

In addition, since Sunbeam is really just a set of snakemake rules, all the (many) snakemake options apply here as well. Some useful ones are:

  • -n performs a dry run, and will just list which rules are going to be executed without actually doing so.

  • -k allows the workflow to continue with unrelated rules if one produces an error (useful for malformed samples).

  • -p prints the actual shell command executed for each rule, which is very helpful for debugging purposes.

  • --cores specifies the total number of cores used by Sunbeam. For example, if you run Sunbeam with --cores 100 and each rule/processing step uses 20 threads, it will run 5 rules at once.

Cluster options

Sunbeam inherits its cluster abilities from Snakemake. There’s nothing special about installing Sunbeam on a cluster, but in order to distribute work to cluster nodes, you have to enable snakemake’s cluster exectuor of your choice.

sunbeam init /path/to/cluster/project/ --data_fp /path/to/big/dataset/ --profile slurm
pip install snakemake-executor-plugin-slurm
sunbeam run --profile /path/to/cluster/project/

Edit any options set in the profile as if they are snakemake command line arguments.

Tip

Snakemake cluster executors can be installed with pip. See snakemake’s executor docs for more information.

Outputs

This section describes all the outputs from Sunbeam. Here is an example output directory.

 stats
└ sunbeam_output
   benchmarks
   logs
   qc

Quality control

   qc
├ 00_samples
├ 01_cutadapt
├ 02_trimmomatic
├ 03_komplexity
├ cleaned
├ decontam
├ log
│    decontam
│    komplexity
└ reports

This folder contains the trimmed, low-complexity filtered reads in cleaned. The decontam folder contains the cleaned reads that did not map to any contaminant or host genomes. In general, most downstream steps should reference the decontam reads.