Skip to content

Reference Data

Overview

The Reference Data section is where the Infrastructure Administrator uploads and manages shared static files that pipelines across the platform can draw on — genome assemblies, model weights, annotation files, and any other dataset that needs to be centrally maintained and consistently versioned. Uploading reference data here makes it available to authorised projects without requiring individual users to manage their own copies.

Once uploaded, reference datasets also become visible in the Datasets → Catalog, where researchers can browse their contents and summary statistics.

Navigation: Select Reference Data from the left-hand navigation pane.

Screenshot: Reference Data landing page


The Reference Data Screen

The Reference Data screen displays all datasets that have been uploaded to the platform as cards. Each card shows the dataset name, a brief description, and when it was last updated.

Click any card to open a flash card summary showing the dataset's upload details — source, cloud account, bucket, and the project it has been made available under.

Screenshot: Reference Data flash card


Adding New Reference Data

To upload a new reference dataset to the platform:

  1. Click + Add New in the top-right corner of the Reference Data screen.
  2. A source selection screen appears. Choose the cloud storage source from which you want to upload the reference data — for example, Amazon S3.

Screenshot: Source selection screen

You will then be guided through two configuration steps before reviewing and creating the upload.


Step 1: General

Provide the identifying information for this reference dataset:

Field Description
Name A clear, descriptive name for the reference dataset as it will appear in the catalog (e.g., EvoNb-Model-Weights-v2).
Description A brief summary of what the dataset contains and what it is used for (e.g., EvoNb model weights for antibody sequence analysis).

Screenshot: General tab

Tip: A precise name and description makes reference data easy to identify when researchers browse the catalog. Include the version or date in the name if this dataset will be updated over time.


Step 2: Specs

Configure the source location and project scope for this dataset:

Field Description
Project The project under which this reference dataset will be made available. Only users in the selected project will be able to access it.
Cloud Account The connected cloud account that holds the source data. Defaults to default — your organisation's primary cloud account.
Bucket The specific storage bucket within the selected cloud account where the reference data resides. This field is populated based on the cloud source selected in the previous step (e.g., an S3 bucket name for Amazon S3 uploads).
Path (optional) A specific directory path within the bucket, if the reference data is located in a subdirectory rather than at the bucket root.

Screenshot: Specs tab


Step 3: Review and Create

Review all configuration details before finalising the upload.

  • Confirm the dataset name and description are correct.
  • Verify the project, cloud account, and bucket are accurate.
  • Click Create to upload the reference dataset to the platform.

Screenshot: Review screen

Once created, the dataset appears as a new card on the Reference Data screen and becomes accessible to users in the selected project.


Viewing Reference Data in the Catalog

After uploading, the reference dataset is also discoverable in the Datasets → Catalog section. To navigate there:

  1. Select Datasets from the left-hand navigation pane.
  2. Click the Catalog tab.
  3. Use the project selector dropdown in the top-right corner of the screen to confirm which project's datasets you are viewing. Select the project you made the reference data available under.

Screenshot: Datasets Catalog tab

Important: The person-level and specimen-level tables are visible to Infrastructure Administrators only. General users see the aggregate summary statistics on the dashboard, but the underlying record-level tables are hidden until they have received explicit approval. This ensures that sensitive patient and sample data is not exposed before appropriate access controls are in place.


What's Next

  • Datasets — Browse all datasets published to the platform catalog to specific projects.
  • Computes — Ensure the compute configurations that will run pipelines against this reference data are correctly provisioned.
  • Workstation Templates — If researchers need direct workstation access to reference files, confirm the relevant templates have the appropriate storage paths configured.