Developing Sunbeam
Getting involved with developing Sunbeam can be a little daunting at first. This doc will try to break down the constituent parts from a developer’s perspective.
sunbeam (Python package)
Sunbeam is a Python package designed to help facilitate reproducible bioinformatics workflows. The core of the package is located in the sunbeam/ directory. This is where the main code for Sunbeam lives, and it is organized into subdirectories based on functionality.
bfx
The bfx/ directory contains bioinformatics utilties, used in the pipeline for transformation and reporting.
- sunbeam.bfx.filter_ids(fp_in: Path, fp_out: Path, ids: Set[str], log: TextIO) None
Filter FASTQ records based on a set of IDs to remove.
- Args:
fp_in (Path): Path to the input FASTQ file. fp_out (Path): Path to the output FASTQ file. ids (Set[str]): Set of IDs to filter. log (TextIO): TextIO object to write log messages.
- Returns:
None: This function does not return anything.
- Raises:
AssertionError: If the number of removed IDs does not match the expected count.
- sunbeam.bfx.get_mapped_reads(fp: str, min_pct_id: float, min_len_frac: float) Iterator[str]
Takes a SAM file and returns an iterator of read names that are mapped
- sunbeam.bfx.parse_decontam_log(f: TextIO) OrderedDict[str, str]
- sunbeam.bfx.parse_fasta(f: TextIO) Iterator[Tuple[str, str]]
- sunbeam.bfx.parse_fastq(f: TextIO) Iterator[Tuple[str, str, str, str]]
- sunbeam.bfx.parse_fastqc_quality(filename: str) DataFrame
- sunbeam.bfx.parse_komplexity_log(f: TextIO) OrderedDict[str, str]
- sunbeam.bfx.parse_sam(f: TextIO) Iterator[Dict[str, int | float | str | Tuple[int, str]]]
- sunbeam.bfx.parse_trim_summary_paired(f: TextIO) OrderedDict[str, str]
- sunbeam.bfx.parse_trim_summary_single(f: TextIO) OrderedDict[str, str]
- sunbeam.bfx.remove_pair_id(id: str, log: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) str
Removes the pair identifier from the given ID.
- Args:
id (str): The ID to remove the pair identifier from. log (TextIO): The log file to write any messages to.
- Returns:
str: The ID with the pair identifier removed.
- sunbeam.bfx.summarize_qual_decontam(tfile: str, dfile: str, kfile: str, paired_end: bool) DataFrame
Return a dataframe for summary information for trimmomatic and decontam rule
- sunbeam.bfx.write_fasta(record: Tuple[str, str], f: TextIO) None
- sunbeam.bfx.write_fastq(record: Tuple[str, str, str, str], f: TextIO) None
configs
The configs/ directory contains sample configuration files for the pipeline. These are used by sunbeam init to set up a project and define the parameters for the workflow.
extensions
This is the default location for extensions, although it can be configured by setting $SUNBEAM_EXTENSIONS.
project
The project/ directory contains project management utilities, used for initializing and managing Sunbeam projects.
- class sunbeam.project.SampleList(fp: Path = None, paired_end: bool = True, format_str: str = None)
- check()
Check the sample list for duplicates, missing values, and invalid values.
- static format_string_to_regex(format_str: str) Pattern
Convert a format string like ‘{sample}_R{rp}.fastq.gz’ to a regex pattern.
- generate_subset(func: callable) SampleList
Generate a subset of the sample list based on a function. The function takes three args: sample_name, r1, and r2, and should return True if the sample should be included in the subset.
- get_samples()
Getting samples in a format that is backwards compatible Convert “r1” and “r2” to “1” and “2”, make sure all fields are strings
- guess_format_string(fnames: list[Path]) Pattern
- load_from_dir(fp: Path, format_str: str = None) dict[str, dict[str, str]]
Load a sample list from a directory.
- load_from_file(fp: Path) dict[str, dict[str, str]]
Load a sample list from a file.
- to_file(fp: Path)
Write the sample list to a file.
- class sunbeam.project.SunbeamConfig(config: Dict[str, str | Dict] = {})
Sunbeam configuration file
- Defining samples:
Run ‘sunbeam list_samples <data_dir>’ to create a list of samples and associated fastq.gz files. Samples must be in gzipped fastq format.
- Paths:
- Paths are resolved through the following rules:
If the path is absolute, the path is parsed as-is
If the path is not absolute, the path at ‘root’ is appended to it
If the path is not ‘output_fp’, the path is checked to ensure it exists
- Suffixes:
- Each subsection contains a ‘suffix’ key that defines the folder under
‘output_fp’ where the results of that section are put.
- fill_missing(extensions_dir: Path = None)
Fill in missing extension config values with defaults
- classmethod from_file(config_fp: Path) SunbeamConfig
Create a SunbeamConfig object from a file
- classmethod from_template(template_fp: Path, root_fp: Path, extensions_dir: Path = None) SunbeamConfig
Create a SunbeamConfig object from a template file
- static get_extension_rules(extension_fp: Path) list[Path]
Find all .smk and .rules files in the extension directory using glob
- static get_extensions(extensions_dir: Path = None) dict[str, Path]
Get a list of all extensions in the extensions directory
- modify(change_str: str)
Modify the config file with the specified changes change_str should be a string in the format “root_key: {sub_key: value}”
- resolved_paths() dict[str, Path | str]
Resolve all paths in the config file (any field ending in “_fp”) Relative paths are resolved relative to the ‘root’ key
- to_file(config_fp: Path)
Write the config to a file
- class sunbeam.project.SunbeamProfile(config: dict = {})
- classmethod from_template(template_fp: Path) SunbeamProfile
- to_file(config_fp: Path)
- sunbeam.project.output_subdir(cfg: dict[str, dict[str, str]], section: str) Path
Get the output subdirectory for a given section. Here mostly for backwards compatibility.
scripts
The scripts/ directory contains scripts for running the pipeline and managing the workflow.
- sunbeam.scripts.Config(argv)
- sunbeam.scripts.Extend(argv=['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html'])
- sunbeam.scripts.Init(argv=['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html'])
- sunbeam.scripts.Run(argv: list[str] = ['/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/envs/stable/lib/python3.11/site-packages/sphinx/__main__.py', '-T', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '/home/docs/checkouts/readthedocs.org/user_builds/sunbeam/checkouts/stable/_readthedocs//html'])
CLI entry point for running Sunbeam.
- sunbeam.scripts.Sunbeam()
workflow
The core of the work done by Sunbeam is handled by the Snakemake workflow, located in workflow/. Once a project is setup properly, the workflow can be run with all the benefits of Snakemake. The core of the workflow is defined in workflow/Snakefile. Reference the Snakemake docs for help understanding Snakemake things better; they’re very good. From this core Snakefile, we import more .smk files from workflow/rules/ and extensions/sbx_*/.
Important Variables (Python)
Variables defined in the Python library and can optionally be imported into Snakemake:
__version__: str - The version of Sunbeam being run.EXTENSIONS_DIR: () -> Path - A function that lazily loads the Path to the extensions directory.WORKFLOW_DIR: Path - The Path to the workflow directory.CONFIGS_DIR: Path - The Path to the configs directory.
Important Variables (Snakemake)
Variables defined in the main Snakefile can be accessed throughout the workflow. Some important variables include:
Samples: Dict[str, Dict[str, str]] - A dictionary where keys are sample names and values are dictionaries of read pairs mapping to file paths (Samples[sample] = {"1": r1, "2": r2}).Pairs: List[str] - Either["1", "2"]or["1"]depending on if the project is paired end or not.Cfg: Dict[str, Dict[str, str]] - The YAML config converted into dictionary form.MIN_MEM_MB: int - A minimum value of the number of megabytes of memory to request for each job. This will only apply for jobs that rely on Sunbeam to guess their memory requirements.MIN_RUNTIME: int - A minimum value of the number of minutes to request for each job. This will only apply for jobs that rely on Sunbeam to guess their runtime requirements.HostGenomes: List[str] - A list of host genomes that are used for decontaminating reads.HostGenomeFiles: List[str] - A list of files with host genomes that are used for decontaminating reads (not to be confused withsbx_mapping’sGenomeFilesvariable, which it uses to track reference genome files).QC_FP: Path - The Path to the project’s quality control output directory.BENCHMARK_FP: Path - The Path to the project’s benchmarking output directory.LOG_FP: Path - The Path to the project’s log output directory.ASSEMBLY_FP: Path - The Path to the project’s assembly output directory.ANNOTATION_FP: Path - The Path to the project’s annotation output directory.CLASSIFY_FP: Path - The Path to the project’s classification output directory.MAPPING_FP: Path - The Path to the project’s mapping output directory.VIRUS_FP: Path - The Path to the project’s virus output directory.
Environment Variables
SUNBEAM_EXTS_INCLUDE: str - If set, will include the given extension in the workflow (and exclude the rest). This is useful for testing individual extensions. Can also be set in the args tosunbeam run.SUNBEAM_EXTS_EXCLUDE: str - If set, will exclude the given extension from the workflow. This is useful for when namespaces between extensions collide (same rule name multiple times). Can also be set in the args tosunbeam run.SUNBEAM_SKIP: str - If set, will skip either ‘qc’ or ‘decontam’. Can also be set in the args tosunbeam run.SUNBEAM_DOCKER_TAG: str - If set, will use the given tag for the Docker image instead of the default. Can also be set in the args tosunbeam run.SUNBEAM_MIN_MEM_MB: int - If set, will override the default minimum memory value.SUNBEAM_MIN_RUNTIME: int - If set, will override the default minimum runtime value.SUNBEAM_NO_ADAPTER: bool - If set, will not check that the adapter template file exists.
Contributing
Sunbeam is an open-source project, and we welcome contributions from the community. If you would like to contribute to Sunbeam, please follow these steps:
Fork the repository on GitHub.
Create a new branch for your changes.
Make your changes and test them locally.
Commit your changes and push them to your fork.
Create a pull request against the main branch of the Sunbeam repository (tagging any relevant issues).
Update the documentation if necessary.
Wait for feedback from the maintainers and make any necessary changes.
Once your changes are approved, they will be merged into the main branch and included in the next release!
Testing locally
To lint and test Sunbeam locally, you can use the following commands:
Automated tests
Tests will be run automatically on every push to the repository. The tests are run using GitHub Actions and are defined in the .github/workflows/ directory. They test across multiple Python version and environment managers for more thorough coverage. You can see the results in the PR on GitHub.
Updating docs
The docs/ directory contains the documentation for Sunbeam. The documentation is written in reStructuredText and is built with Sphinx.
Misc
Random tips and tricks that I don’t have a good home for. Hopefully these might save you a headache or two.
Importing sunbeam in scripts
Parts of the sunbeam package are meant to be imported into Snakemake and scripts. Things like parse_fastq() should be pretty general use functions for DIY scripts. HOWEVER, you should not import sunbeam into a rule that uses the conda keyword. Under certain configurations like having sunbeam installed in a venv and running the rule in a conda env, the sunbeam module won’t be available. Split up any logic that needs sunbeam logic and other dependencies into separate rules.