Developing Sunbeam

Getting involved with developing Sunbeam can be a little daunting at first. This doc will try to break down the constituent parts from a developer’s perspective.

sunbeam (Python package)

Sunbeam is a Python package designed to help facilitate reproducible bioinformatics workflows. The core of the package is located in the sunbeam/ directory. This is where the main code for Sunbeam lives, and it is organized into subdirectories based on functionality.

bfx

The bfx/ directory contains bioinformatics utilties, used in the pipeline for transformation and reporting.

sunbeam.bfx.filter_ids(fp_in: Path, fp_out: Path, ids: Set[str], log: TextIO) → None

Filter FASTQ records based on a set of IDs to remove.

Args:: fp_in (Path): Path to the input FASTQ file. fp_out (Path): Path to the output FASTQ file. ids (Set[str]): Set of IDs to filter. log (TextIO): TextIO object to write log messages.
Returns:: None: This function does not return anything.
Raises:: AssertionError: If the number of removed IDs does not match the expected count.

sunbeam.bfx.get_mapped_reads(fp: str, min_pct_id: float, min_len_frac: float) → Iterator[str]: Takes a SAM file and returns an iterator of read names that are mapped

sunbeam.bfx.parse_decontam_log(f: TextIO) → OrderedDict[str, str]

sunbeam.bfx.parse_fasta(f: TextIO) → Iterator[Tuple[str, str]]

sunbeam.bfx.parse_fastq(f: TextIO) → Iterator[Tuple[str, str, str, str]]

sunbeam.bfx.parse_fastqc_quality(filename: str) → DataFrame

sunbeam.bfx.parse_komplexity_log(f: TextIO) → OrderedDict[str, str]

sunbeam.bfx.parse_sam(f: TextIO) → Iterator[Dict[str, int | float | str | Tuple[int, str]]]

sunbeam.bfx.parse_trim_summary_paired(f: TextIO) → OrderedDict[str, str]

sunbeam.bfx.parse_trim_summary_single(f: TextIO) → OrderedDict[str, str]

sunbeam.bfx.remove_pair_id(id: str, log: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) → str

Removes the pair identifier from the given ID.

Args:: id (str): The ID to remove the pair identifier from. log (TextIO): The log file to write any messages to.
Returns:: str: The ID with the pair identifier removed.

sunbeam.bfx.summarize_qual_decontam(tfile: str, dfile: str, kfile: str, paired_end: bool) → DataFrame: Return a dataframe for summary information for trimmomatic and decontam rule

sunbeam.bfx.write_fasta(record: Tuple[str, str], f: TextIO) → None

sunbeam.bfx.write_fastq(record: Tuple[str, str, str, str], f: TextIO) → None

configs

The configs/ directory contains sample configuration files for the pipeline. These are used by sunbeam init to set up a project and define the parameters for the workflow.

extensions

This is the default location for extensions, although it can be configured by setting $SUNBEAM_EXTENSIONS.

project

The project/ directory contains project management utilities, used for initializing and managing Sunbeam projects.

class sunbeam.project.SampleList(fp: Path = None, paired_end: bool = True, format_str: str = None)

check(): Check the sample list for duplicates, missing values, and invalid values.

static format_string_to_regex(format_str: str) → Pattern: Convert a format string like ‘{sample}_R{rp}.fastq.gz’ to a regex pattern.

generate_subset(func: callable) → SampleList: Generate a subset of the sample list based on a function. The function takes three args: sample_name, r1, and r2, and should return True if the sample should be included in the subset.

get_samples(): Getting samples in a format that is backwards compatible Convert “r1” and “r2” to “1” and “2”, make sure all fields are strings

guess_format_string(fnames: list[Path]) → Pattern

load_from_dir(fp: Path, format_str: str = None) → dict[str, dict[str, str]]: Load a sample list from a directory.

load_from_file(fp: Path) → dict[str, dict[str, str]]: Load a sample list from a file.

to_file(fp: Path): Write the sample list to a file.

class sunbeam.project.SunbeamConfig(config: Dict[str, str | Dict] = {})

Sunbeam configuration file

Defining samples:

Run ‘sunbeam list_samples <data_dir>’ to create a list of samples and associated fastq.gz files. Samples must be in gzipped fastq format.

Paths:

Paths are resolved through the following rules:

If the path is absolute, the path is parsed as-is
If the path is not absolute, the path at ‘root’ is appended to it
If the path is not ‘output_fp’, the path is checked to ensure it exists

Suffixes:

Each subsection contains a ‘suffix’ key that defines the folder under: ‘output_fp’ where the results of that section are put.

fill_missing(extensions_dir: Path = None): Fill in missing extension config values with defaults

classmethod from_file(config_fp: Path) → SunbeamConfig: Create a SunbeamConfig object from a file

classmethod from_template(template_fp: Path, root_fp: Path, extensions_dir: Path = None) → SunbeamConfig: Create a SunbeamConfig object from a template file

static get_extension_rules(extension_fp: Path) → list[Path]: Find all .smk and .rules files in the extension directory using glob

static get_extensions(extensions_dir: Path = None) → dict[str, Path]: Get a list of all extensions in the extensions directory

modify(change_str: str): Modify the config file with the specified changes change_str should be a string in the format “root_key: {sub_key: value}”

resolved_paths() → dict[str, Path | str]: Resolve all paths in the config file (any field ending in “_fp”) Relative paths are resolved relative to the ‘root’ key

to_file(config_fp: Path): Write the config to a file

class sunbeam.project.SunbeamProfile(config: dict = {})

classmethod from_template(template_fp: Path) → SunbeamProfile

to_file(config_fp: Path)

sunbeam.project.output_subdir(cfg: dict[str, dict[str, str]], section: str) → Path: Get the output subdirectory for a given section. Here mostly for backwards compatibility.

scripts

The scripts/ directory contains scripts for running the pipeline and managing the workflow.

sunbeam.scripts.Config(argv)

sunbeam.scripts.Extend(argv=['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html'])

sunbeam.scripts.Init(argv=['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html'])

sunbeam.scripts.Run(argv: list[str] = ['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html']): CLI entry point for running Sunbeam.

sunbeam.scripts.Sunbeam()

workflow

The core of the work done by Sunbeam is handled by the Snakemake workflow, located in workflow/. Once a project is setup properly, the workflow can be run with all the benefits of Snakemake. The core of the workflow is defined in workflow/Snakefile. Reference the Snakemake docs for help understanding Snakemake things better; they’re very good. From this core Snakefile, we import more .smk files from workflow/rules/ and extensions/sbx_*/.

Important Variables (Python)

Variables defined in the Python library and can optionally be imported into Snakemake:

__version__: str - The version of Sunbeam being run.
EXTENSIONS_DIR: () -> Path - A function that lazily loads the Path to the extensions directory.
WORKFLOW_DIR: Path - The Path to the workflow directory.
CONFIGS_DIR: Path - The Path to the configs directory.

Important Variables (Snakemake)

Variables defined in the main Snakefile can be accessed throughout the workflow. Some important variables include:

Samples: Dict[str, Dict[str, str]] - A dictionary where keys are sample names and values are dictionaries of read pairs mapping to file paths (Samples[sample] = {"1": r1, "2": r2}).
Pairs: List[str] - Either ["1", "2"] or ["1"] depending on if the project is paired end or not.
Cfg: Dict[str, Dict[str, str]] - The YAML config converted into dictionary form.
MIN_MEM_MB: int - A minimum value of the number of megabytes of memory to request for each job. This will only apply for jobs that rely on Sunbeam to guess their memory requirements.
MIN_RUNTIME: int - A minimum value of the number of minutes to request for each job. This will only apply for jobs that rely on Sunbeam to guess their runtime requirements.
HostGenomes: List[str] - A list of host genomes that are used for decontaminating reads.
HostGenomeFiles: List[str] - A list of files with host genomes that are used for decontaminating reads (not to be confused with sbx_mapping’s GenomeFiles variable, which it uses to track reference genome files).
QC_FP: Path - The Path to the project’s quality control output directory.
BENCHMARK_FP: Path - The Path to the project’s benchmarking output directory.
LOG_FP: Path - The Path to the project’s log output directory.
ASSEMBLY_FP: Path - The Path to the project’s assembly output directory.
ANNOTATION_FP: Path - The Path to the project’s annotation output directory.
CLASSIFY_FP: Path - The Path to the project’s classification output directory.
MAPPING_FP: Path - The Path to the project’s mapping output directory.
VIRUS_FP: Path - The Path to the project’s virus output directory.

Environment Variables

SUNBEAM_EXTS_INCLUDE: str - If set, will include the given extension in the workflow (and exclude the rest). This is useful for testing individual extensions. Can also be set in the args to sunbeam run.
SUNBEAM_EXTS_EXCLUDE: str - If set, will exclude the given extension from the workflow. This is useful for when namespaces between extensions collide (same rule name multiple times). Can also be set in the args to sunbeam run.
SUNBEAM_SKIP: str - If set, will skip either ‘qc’ or ‘decontam’. Can also be set in the args to sunbeam run.
SUNBEAM_DOCKER_TAG: str - If set, will use the given tag for the Docker image instead of the default. Can also be set in the args to sunbeam run.
SUNBEAM_MIN_MEM_MB: int - If set, will override the default minimum memory value.
SUNBEAM_MIN_RUNTIME: int - If set, will override the default minimum runtime value.
SUNBEAM_NO_ADAPTER: bool - If set, will not check that the adapter template file exists.

Contributing

Sunbeam is an open-source project, and we welcome contributions from the community. If you would like to contribute to Sunbeam, please follow these steps:

Fork the repository on GitHub.
Create a new branch for your changes.
Make your changes and test them locally.
Commit your changes and push them to your fork.
Create a pull request against the main branch of the Sunbeam repository (tagging any relevant issues).
Update the documentation if necessary.
Wait for feedback from the maintainers and make any necessary changes.
Once your changes are approved, they will be merged into the main branch and included in the next release!

Testing locally

To lint and test Sunbeam locally, you can use the following commands:

Automated tests

Tests will be run automatically on every push to the repository. The tests are run using GitHub Actions and are defined in the .github/workflows/ directory. They test across multiple Python version and environment managers for more thorough coverage. You can see the results in the PR on GitHub.

Updating docs

The docs/ directory contains the documentation for Sunbeam. The documentation is written in reStructuredText and is built with Sphinx.

Misc

Random tips and tricks that I don’t have a good home for. Hopefully these might save you a headache or two.

Importing `sunbeam` in scripts

Parts of the sunbeam package are meant to be imported into Snakemake and scripts. Things like parse_fastq() should be pretty general use functions for DIY scripts. HOWEVER, you should not import sunbeam into a rule that uses the conda keyword. Under certain configurations like having sunbeam installed in a venv and running the rule in a conda env, the sunbeam module won’t be available. Split up any logic that needs sunbeam logic and other dependencies into separate rules.