Importing a Custom Pipeline into Quark

This documentation serves as a technical manual for Bioinformaticians to onboard high-performance computational workflows — such as the EvoNB mutation prediction model — into Quark.

The Custom flow allows for granular configuration of: specialized Docker environments, intensive GPU resource allocation, and inline execution logic.

In this documentation, the EvoNB model is used as an example to illustrate various steps.

Watch

Overview

Quark offers bioinformaticians an orchestration service where they can connect standalone scripts (including existing pipelines) into a unified, production-ready custom pipeline.

By defining custom steps and container capacities, you can transform tools like EvoNB into scalable workflows that automate the staging of large-scale ESM-2 model checkpoints and the rendering of 3D protein structures (to visualise the structural impact of predicted mutations).

Before You Start

Ensure the following technical components are ready for your workflow:

Containerization: A Docker image URI (e.g., from AWS ECR) containing your environment (e.g., Python 3.8+, PyTorch, and the transformers library for EvoNB).
Reference Data: Large-scale sequence databases, or fine-tuned model checkpoints in the case of EvoNB (EvoNB_1 through EvoNB_5) and the base ESM-2 (esm2_t33_650M_UR50D) weights, either pre-staged or uploaded to Quark directories as Filesystem Datasets.
Input Schema: A list of parameters (strings, integers, files) that your script expects as command-line arguments.
Environment Requirements: Knowledge of the required CPU, RAM and specific Docker images or environment variables ( $Env$ ). This includes benchmarked knowledge of required GPU VRAM for your specific model. For example, GPU VRAM requirements for the 650M parameter ESM-2 model is typically 16GB+ for efficient batching.
Execution Logic: For the EvoNB example, refer to the get\_mutation.py script from the EvoNB repository to handle sequence masking and probability calculation.

Step-by-Step Instructions

Start the Import

Navigate to the My Pipelines tab on your dashboard.
Click the New Pipeline button in the top-right corner. You will be given two options: Import Pipeline and Build Pipeline. Click on Import Pipeline.
In the Select Pipeline Type window, choose Custom and click Continue.

Pipelines

Click on `Import Pipeline`

Pipelines

Select `Custom` and click `Continue`

Step 1: General Pipeline Details

Define your pipeline to provide identifying information for the rest of your team.

Field	Description	Mandatory?	Bioinformatician Tip
Name	A user-friendly name	Yes	Keep it versioned to track model updates.
Summary	A brief summary of the function.	Yes	Describe the biological objective.
Category	Used to group particular pipelines.	Yes	Select Others or Proteomics.
About	In-depth info using Markdown syntax.	No	List model weights, library versions, and tool citations.
Tags	Metadata filtering keys/values.	No	Use tags like model: esm-2 or gpu: required.

For example, for an EvoNB workflow, you can provide the following:

Name & Summary: EvoNB-Mutation-Prediction. Summary: "A Protein Language Model-Based Workflow for Nanobody Mutation Prediction and Optimisation"
About: Provide a Markdown summary of the five independent models used to minimize random errors in mutation prediction.
Category: Select Immunology.
Tags: e.g., task: mutation-prediction, model: esm-2, gpu: required.

Pipelines

Step 2: Datasets (Mounts)

This step allows bioinformaticians to connect stable, pre-existing datasets (filesystems) required for the pipeline run.

Tools like EvoNB require access to model checkpoints; Quark uses Mounts for low-latency loading of weights.

Option 1:

Select Add New Dataset Mount to search for your pre-staged dataset (e.g., evonb or evonb-checkpoints from the huggingface repository for the EvoNB workflow.)

Pipelines

Mount Configuration: Provide a retrievable Name for your mount path. Specify the Sub Path (the path in the dataset that you would like to have mounted) and the Mount Path (the mount path where you would like to have the dataset mounted). For example, for the EvoNB workflow:
- Name: MODEL_DIR.
- Sub Path and Mount Path: /app/models/ (the location where your script expects the EvoNB\_1..5 directories).

Option 2:

Alternatively, since bioinformaticians may need to use files that the user uploads (or downloads from another pipeline's results), Quark also allows bioinformaticians to specify mount path directories — such as My Files, My Results, User (Shared), Project (Shared) and EFS data locations — as an option to upload datasets to the pipeline.

Under User File System, provide a Mount Point Name (e.g., MODEL_DIR), Sub Path (the specific subdirectory within the dataset (e.g., /app/)), and the Mount Path (the absolute path where the data appears inside the container (e.g., /models/)).

Pipelines

Once you've specified the mountpaths for your datasets, click Next.

Step 3: Define Pipeline Parameters

Parameters allow bioinformaticians to customize their pipeline execution and map UI elements to the variables their script consumes as CLI arguments.

Quark offers 5 data types that can be chosen for a particular parameter: String, Integer, Float, Boolean, and File.

Common fields include Default Value, Optional, Hide Field, Help, and Info.

Info: When hovering over the info icon, information of the parameter will be displayed.

Help: Information about the parameter will be displayed under it.

Hide Field: Allows the user to hide the parameter.

Data Type	Variation	Description	Key Features
String	Input	A text-based input parameter.	Input: A free-for-all input box where the user can give any kind of string. Validation: Can restrict the input (e.g., to one or two characters).
	Dropdown	Allows the user to select from multiple pre-defined options.	Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except `space` is supported
	Key Value Pair	Shows one value to the end user but passes the actual value to the pipeline in a different manner. (e.g., "test 1") is mapped to the value passed to the pipeline (e.g., "checkpoint 1").	Multi-Select: Allows the user to select multiple keys at the same time; the respective values of which will be passed. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except `space` is supported
Integer	Input	A whole number input.	Range: Can define a range (e.g., between 0 and 100) Step: value (e.g., 2, meaning only even numbers can be given as input).
	Dropdown	Allows the user to select from multiple pre-defined options.	Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except `space` is supported
Float	Input	A decimal number input.	Range: Can define a range (e.g., between 0 and 1)
	Dropdown	Allows the user to select from multiple pre-defined options.	Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except `space` is supported
Boolean		A simple true or false value.	Default value can be set to true (checked) or false (unchecked).
File		Allows the user to select a file or directory.	Browse: Only allows selecting from exposed data sets. Directory Only: Restricts selection to only directories and not files. File Types: Can restrict file selection to specific extensions (e.g., .pdf, .fasta, etc., but executables are not allowed).

Advanced Parameter Features

File Parameters: Toggle Directory Only if the script requires an entire folder path rather than a single file.
Scalar Variations: Use Dropdown to "guardrail" the pipeline against typos in model names or preset configurations.
Conditional Parameters: Configure fields to appear only when another parameter matches a value (e.g., show "Custom MSA" field only if "Skip MSA Generation" is toggled on).

EvoNB Example:

Map the CLI arguments from get\_mutation.py to UI elements for the user.

Parameter Type	CLI Argument	Description	Example Value
String (Dropdown)	-input_type	Choose between raw sequence or file processing.	fasta or csv
String	-sequence	The amino acid sequence to optimize.	QVQLVES...
File	-input_csv	A CSV file containing multiple VHH sequences.	nanobody_batch.csv
Integer	-n	The probability multiplier threshold for valid mutations.	5
String	-model_checkpoints	Concatenated list of models to use.	EvoNB_1+EvoNB_2+EvoNB_3
String (Dropdown)	-device	Target hardware for inference.	cuda or cpu

EvoNB Example Parameters:

Pipelines

Step 4: Steps (Execution Configuration)

Define the computational "unit of work" where the script or model actually runs. For example, for EvoNB, you can use this step to define the GPU environment where the mutation screening occurs.

Name: Provide a name for your step (e.g., mutations-prediction)
Image Details: Provide the Docker image URI (e.g., docker.io/username/evonb:1.1.0).
Capacity Requirements
- CPU and Memory: Specify CPUs and RAM. For large structures, ensure memory allocation accounts for the overhead of loading massive model weights.
- GPU Checkbox: Mandatory for folding models. Specify if the step requires a GPU to run. E.g., for EvoNB, GPU: Mandatory.
Bioinformatician Tip: In your Workstation, specify an instance with at least 16GB VRAM (e.g., NVIDIA T4 or A100).
Commands and Args:
- Command: Use proper bash format. You can reference parameters using the syntax: \<<params.parameter_name\>>. EvoNB example: Reference your parameters using the \<<params.name>> syntax.
- Execution Script (the input is a .fasta in the following EvoNB example)
  
  /bin/bash
  
  -c
  
  python get_mutation.py \ -input_type fasta \ -input_data example.fasta \
  -output_csv ./results/out_mut.csv \ -n 5 \ -model_checkpoints <<params.model_checkpoints>> \ -device cuda
- Env: Define environment variables for your workflow.
- Root Access: Toggle this if the step must write to system-level directories for caching.

Pro Tip: Use <<params.file_name.id>> to access a file's ID without the extension, or <<params.file_name.ext>> for just the extension.

Step 5: Visualization App

Attach a viewer so users can interpret results directly in Quark. Structural biology results require 3D rendering.

Select Add New Visualization App.
Choose the App Name (e.g., Mol* or NGLView for 3D protein rendering).
Set the Display Name, which becomes a tab in the results dashboard (e.g., Mutant Structure Viewer)

Note: In the EvoNB pipeline, the output is a downloadable .csv file, from which the individual protein sequences (with predicted mutations) can be viewed using Mol*

Pipelines

Step 6: Review and Submit

Review the GPU capacity, Arg mapping, and Mount Paths for accuracy. Once satisfied, click Submit. Your pipeline will now appear under My Pipelines.

Pipelines

Troubleshooting Guide

EvoNB Example:

Problem	Potential Cause	Resolution
"CUDA Out of Memory"	Sequence Length	Large sequences or high batch sizes may exceed VRAM. Switch to a higher GPU tier in Step 4.
"Checkpoint not found"	Mount Path	Ensure the path in `<<params.model\_checkpoints\>>` matches the folder structure inside the Mount Path defined in Step 2.
"Invalid Input Type"	Parameter Sync	Ensure the dropdown values in Step 3 match the script’s expected strings (`sequence` vs `csv`) exactly.
"ModuleNotFoundError"	Image Mismatch	Verify the Docker image contains the transformers and esm libraries required for `EvoNB`.

Next Step: Proceed to Version and Publish to make your pipeline available for your team.