Skip to content

Importing a Custom Pipeline into Quark

This documentation serves as a technical manual for Bioinformaticians to onboard high-performance computational workflows — such as the EvoNB mutation prediction model — into Quark.

The Custom flow allows for granular configuration of: specialized Docker environments, intensive GPU resource allocation, and inline execution logic.

In this documentation, the EvoNB model is used as an example to illustrate various steps.

Watch


Overview

Quark offers bioinformaticians an orchestration service where they can connect standalone scripts (including existing pipelines) into a unified, production-ready custom pipeline.

By defining custom steps and container capacities, you can transform tools like EvoNB into scalable workflows that automate the staging of large-scale ESM-2 model checkpoints and the rendering of 3D protein structures (to visualise the structural impact of predicted mutations).

Before You Start

Ensure the following technical components are ready for your workflow:

  • Containerization: A Docker image URI (e.g., from AWS ECR) containing your environment (e.g., Python 3.8+, PyTorch, and the transformers library for EvoNB).
  • Reference Data: Large-scale sequence databases, or fine-tuned model checkpoints in the case of EvoNB (EvoNB_1 through EvoNB_5) and the base ESM-2 (esm2_t33_650M_UR50D) weights, either pre-staged or uploaded to Quark directories as Filesystem Datasets.
  • Input Schema: A list of parameters (strings, integers, files) that your script expects as command-line arguments.
  • Environment Requirements: Knowledge of the required CPU, RAM and specific Docker images or environment variables ($Env$). This includes benchmarked knowledge of required GPU VRAM for your specific model. For example, GPU VRAM requirements for the 650M parameter ESM-2 model is typically 16GB+ for efficient batching.
  • Execution Logic: For the EvoNB example, refer to the get\_mutation.py script from the EvoNB repository to handle sequence masking and probability calculation.

Step-by-Step Instructions

Start the Import

  1. Navigate to the My Pipelines tab on your dashboard.
  2. Click the New Pipeline button in the top-right corner. You will be given two options: Import Pipeline and Build Pipeline. Click on Import Pipeline.
  3. In the Select Pipeline Type window, choose Custom and click Continue.

Pipelines

Click on `Import Pipeline`

Pipelines

Select `Custom` and click `Continue`


Step 1: General Pipeline Details

Define your pipeline to provide identifying information for the rest of your team.

Field Description Mandatory? Bioinformatician Tip
Name A user-friendly name Yes Keep it versioned to track model updates.
Summary A brief summary of the function. Yes Describe the biological objective.
Category Used to group particular pipelines. Yes Select Others or Proteomics.
About In-depth info using Markdown syntax. No List model weights, library versions, and tool citations.
Tags Metadata filtering keys/values. No Use tags like model: esm-2 or gpu: required.

For example, for an EvoNB workflow, you can provide the following:

  • Name & Summary: EvoNB-Mutation-Prediction. Summary: "A Protein Language Model-Based Workflow for Nanobody Mutation Prediction and Optimisation"
  • About: Provide a Markdown summary of the five independent models used to minimize random errors in mutation prediction.
  • Category: Select Immunology.
  • Tags: e.g., task: mutation-prediction, model: esm-2, gpu: required.

Pipelines

Pipelines


Step 2: Datasets (Mounts)

This step allows bioinformaticians to connect stable, pre-existing datasets (filesystems) required for the pipeline run.

Tools like EvoNB require access to model checkpoints; Quark uses Mounts for low-latency loading of weights.

Option 1:

  • Select Add New Dataset Mount to search for your pre-staged dataset (e.g., evonb or evonb-checkpoints from the huggingface repository for the EvoNB workflow.)

Pipelines

Pipelines

  • Mount Configuration: Provide a retrievable Name for your mount path. Specify the Sub Path (the path in the dataset that you would like to have mounted) and the Mount Path (the mount path where you would like to have the dataset mounted). For example, for the EvoNB workflow:

    • Name: MODEL_DIR.
    • Sub Path and Mount Path: /app/models/ (the location where your script expects the EvoNB\_1..5 directories).

    Pipelines


Option 2:

Alternatively, since bioinformaticians may need to use files that the user uploads (or downloads from another pipeline's results), Quark also allows bioinformaticians to specify mount path directories — such as My Files, My Results, User (Shared), Project (Shared) and EFS data locations — as an option to upload datasets to the pipeline.

  • Under User File System, provide a Mount Point Name (e.g., MODEL_DIR), Sub Path (the specific subdirectory within the dataset (e.g., /app/)), and the Mount Path (the absolute path where the data appears inside the container (e.g., /models/)).

Pipelines

  • Once you've specified the mountpaths for your datasets, click Next.

Step 3: Define Pipeline Parameters

Parameters allow bioinformaticians to customize their pipeline execution and map UI elements to the variables their script consumes as CLI arguments.

Quark offers 5 data types that can be chosen for a particular parameter: String, Integer, Float, Boolean, and File.

Common fields include Default Value, Optional, Hide Field, Help, and Info.

Info: When hovering over the info icon, information of the parameter will be displayed.

Help: Information about the parameter will be displayed under it.

Hide Field: Allows the user to hide the parameter.

Data Type Variation Description Key Features
String Input A text-based input parameter. Input: A free-for-all input box where the user can give any kind of string. Validation: Can restrict the input (e.g., to one or two characters).
Dropdown Allows the user to select from multiple pre-defined options. Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except space is supported
Key Value Pair Shows one value to the end user but passes the actual value to the pipeline in a different manner. (e.g., "test 1") is mapped to the value passed to the pipeline (e.g., "checkpoint 1"). Multi-Select: Allows the user to select multiple keys at the same time; the respective values of which will be passed. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except space is supported
Integer Input A whole number input. Range: Can define a range (e.g., between 0 and 100) Step: value (e.g., 2, meaning only even numbers can be given as input).
Dropdown Allows the user to select from multiple pre-defined options. Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except space is supported
Float Input A decimal number input. Range: Can define a range (e.g., between 0 and 1)
Dropdown Allows the user to select from multiple pre-defined options. Multi-Select: Allows the user to select multiple options at the same time. Custom Delimiter: When multi-select is enabled, you can provide a custom delimiter (e.g., semicolon instead of the default comma). All characters except space is supported
Boolean A simple true or false value. Default value can be set to true (checked) or false (unchecked).
File Allows the user to select a file or directory. Browse: Only allows selecting from exposed data sets. Directory Only: Restricts selection to only directories and not files. File Types: Can restrict file selection to specific extensions (e.g., .pdf, .fasta, etc., but executables are not allowed).

Advanced Parameter Features

  • File Parameters: Toggle Directory Only if the script requires an entire folder path rather than a single file.
  • Scalar Variations: Use Dropdown to "guardrail" the pipeline against typos in model names or preset configurations.
  • Conditional Parameters: Configure fields to appear only when another parameter matches a value (e.g., show "Custom MSA" field only if "Skip MSA Generation" is toggled on).

EvoNB Example:

Map the CLI arguments from get\_mutation.py to UI elements for the user.

Parameter Type CLI Argument Description Example Value
String (Dropdown) -input_type Choose between raw sequence or file processing. fasta or csv
String -sequence The amino acid sequence to optimize. QVQLVES...
File -input_csv A CSV file containing multiple VHH sequences. nanobody_batch.csv
Integer -n The probability multiplier threshold for valid mutations. 5
String -model_checkpoints Concatenated list of models to use. EvoNB_1+EvoNB_2+EvoNB_3
String (Dropdown) -device Target hardware for inference. cuda or cpu

EvoNB Example Parameters:

Pipelines


Pipelines


Pipelines

Pipelines


Pipelines


Pipelines



Step 4: Steps (Execution Configuration)

Define the computational "unit of work" where the script or model actually runs. For example, for EvoNB, you can use this step to define the GPU environment where the mutation screening occurs.

  1. Name: Provide a name for your step (e.g., mutations-prediction)
  2. Image Details: Provide the Docker image URI (e.g., docker.io/username/evonb:1.1.0).
  3. Capacity Requirements

    • CPU and Memory: Specify CPUs and RAM. For large structures, ensure memory allocation accounts for the overhead of loading massive model weights.
    • GPU Checkbox: Mandatory for folding models. Specify if the step requires a GPU to run. E.g., for EvoNB, GPU: Mandatory.

    Bioinformatician Tip: In your Workstation, specify an instance with at least 16GB VRAM (e.g., NVIDIA T4 or A100).

    Pipelines


  4. Commands and Args:

    • Command: Use proper bash format. You can reference parameters using the syntax: \<<params.parameter_name\>>. EvoNB example: Reference your parameters using the \<<params.name>> syntax.
    • Execution Script (the input is a .fasta in the following EvoNB example)

      /bin/bash

      -c

      python get_mutation.py \ -input_type fasta \ -input_data example.fasta \
      -output_csv ./results/out_mut.csv \ -n 5 \ -model_checkpoints <<params.model_checkpoints>> \ -device cuda

    • Env: Define environment variables for your workflow.

    • Root Access: Toggle this if the step must write to system-level directories for caching.

    Pipelines

Pro Tip: Use <<params.file_name.id>> to access a file's ID without the extension, or <<params.file_name.ext>> for just the extension.


Step 5: Visualization App

Attach a viewer so users can interpret results directly in Quark. Structural biology results require 3D rendering.

  • Select Add New Visualization App.
  • Choose the App Name (e.g., Mol* or NGLView for 3D protein rendering).
  • Set the Display Name, which becomes a tab in the results dashboard (e.g., Mutant Structure Viewer)

Note: In the EvoNB pipeline, the output is a downloadable .csv file, from which the individual protein sequences (with predicted mutations) can be viewed using Mol*

Pipelines


Step 6: Review and Submit

Review the GPU capacity, Arg mapping, and Mount Paths for accuracy. Once satisfied, click Submit. Your pipeline will now appear under My Pipelines.

Pipelines Pipelines Pipelines


Troubleshooting Guide

EvoNB Example:

Problem Potential Cause Resolution
"CUDA Out of Memory" Sequence Length Large sequences or high batch sizes may exceed VRAM. Switch to a higher GPU tier in Step 4.
"Checkpoint not found" Mount Path Ensure the path in <<params.model\_checkpoints\>> matches the folder structure inside the Mount Path defined in Step 2.
"Invalid Input Type" Parameter Sync Ensure the dropdown values in Step 3 match the script’s expected strings (sequence vs csv) exactly.
"ModuleNotFoundError" Image Mismatch Verify the Docker image contains the transformers and esm libraries required for EvoNB.

Next Step: Proceed to Version and Publish to make your pipeline available for your team.