Xenium File Format Documentation

This is an overview of the output files for a Xenium in-situ analysis. This documents the pre-release file formats. The formats may change slightly in the final version of the Xenium system. Please contact [email protected] for clarifications or additional information.

Xenium Experiment File (experiment.xenium)

Open this file in the Xenium Explorer browser to visualize experimental results. This is a small JSON manifest file that includes references to the other data files in the output folder.

Transcript Count Data

cell_feature_matrix

The cell-feature matrix files contain the counts of the number of times each gene was observed inside the segmented boundary of each cell. This file is in sparse mtx (Matrix Market Exchange) format. The matrix is represented by a directory containing the count matrix in mtx format and tsv.gz files containing the feature labels (features.tsv.gz), and cell identifiers (barcodes.tsv.gz). This is the primary output file format to be used for secondary analysis, including cell typing, differential expression, and comparing spatial expression patterns. By default, only includes transcripts that pass the default QV threshold of Q20.

This is the same format as used for Cell Ranger, but contains negative control features that should be ignored for biological analysis. The “feature type” column of the features.tsv.gz lists this. There are two classes of negative controls: “Negative control probe” features are probes that exist in the panel, but target ERCC or other non-biological sequences, which can be used to assess the specificity of the assay. “Negative control codeword” features are valid codewords that do not have any probes that should yield that code, so they can be used to assess the specificity of the decoding algorithm. QVs produced for each transcript call are calibrated using the Negative control codeword features.

cell_feature_matrix.h5

Feature-cell matrix in HDF5 (H5) format. More efficient version of the mtx matrix file. Only includes transcripts that pass the default QV threshold of Q20. Matches the format used by Cell Ranger documented here

transcripts.csv.gz

Transcripts in gzipped CSV format. Contains one row for each decoded transcript, containing the following columns:

transcript_id: unique id of transcript
cell_id: unique id of cell
overlaps_nucleus: does this transcript fall within the segmented nucleus of the cell
feature_name: gene or control name
x_location: X location in microns
y_location: Y location in microns
z_location: Z location in microns
qv: Phred-scaled quality value estimating the probability of incorrect call.

transcripts.parquet

Transcripts in parquet format. Contains same data as the transcripts.csv.gz, but can be faster to load and filter with the appropriate reader method.

transcripts.zarr.zip

This is the representation of the transcript data used by Xenium Explorers. It is not recommended for downstream consumers other than Xenium Explorer

cells.csv.gz

Cell summary file in gzipped CSV format. Contains one row for each cell, with the following columns:

cell_id: unique id of cell
x_centroid: cell centroid X location in microns
y_centroid: cell centroid Y location in microns
transcript_counts: molecule count of gene features
control_probe_counts: molecule count of negative control probes
control_codeword_counts: count of negative control codewords
total_counts: total counts, sum of previous 3 columns.
cell_area: 2D area covered by the cell in square micrometers
nucleus_area: 2D area covered by the cell in square micrometers

cells.parquet

Cell summary file in Parquet format. Contains the same information as cells.csv.gz, but can be faster to load and filter with the appropriate reader method.

panel.tsv

The panel gene list, with annotation of the cell type that expresses this gene.

analysis/

Secondary analysis outputs. Quick analysis and visualization in the analysis summary and external tools. Directory with subdirectories and respective CSVs within each clustering (graph-based and k-means), differential expression, PCA (components, dispersion, features_selected, project, variance CSVs), and UMAP projection.csv. These files match the outputs of Cell Ranger documented here

analysis.zarr.zip

Secondary analysis outputs in zipped Zarr format. For use in visualization by Xenium Explorer. Includes cell clustering results.

Morphology images

The cell morphology images are the DAPI image in OME-TIFF format. These file includes a pyramid of resolutions, and tiled chunks of image data, so can be used to back efficient interactive image viewing.

The pixel size of Xenium 0.2125 microns. Coordinates in microns from cells.csv.gz and transcripts.csv.gz can be converted to pixel coordinates by dividing by the pixel size. The origin of the coordinate system is the upper left of the TIFF image.

The ome.tif files are 16-bit grayscale, and compressed with JPEG-2000. They are loadable by tifffile python package, bioformats Java library/CLI.

Example python code to read a layer of the image using the tifffile package:

import tifffile
r = tifffile.TiffReader("morphology_mip.ome.tif")
image = r.pages[1].asarray()

morphology_mip.ome.tif

Cell morphology DAPI with the maximum intensity projection (MIP) of the DAPI Z-stack image.

morphology_focus.ome.tif

Cell morphology DAPI with the single best-focus Z-plane from the DAPI Z-stack image.

he_image.ome.tiff

An H&E image of the tissue section, taken after the Xenium protocol, if available. Note, this image is acquired on a different microscope, and is not registered to the morphology image. Some post-processing will be required to register the H&E image with the DAPI morphology image.

Cell and Nucleus Segmentation

Nucleus boundaries are determined by a nucleus segmentation algorithm that runs on the DAPI morphology image. Cell boundaries are determined by expanding the nucleus boundaries up to 15 µm, or until the expanded boundary hit another cell.

nucleus_boundaries.csv.gz and cell_boundaries.csv.gz

The CSV representation of the nucleus and cell boundaries. Each row represents a vertex in the boundary polygon of one cell. The boundary point for each cell appear in clockwise order, and the first and last point are duplicates to indicate the closed polygon. The columns of these file are:

cell_id: the id of the cell
vertex_x: the x-coordinate of the boundary point in µm
vertex_y: the y-coordinate of the boundary point in µm

nucleus_boundaries.parquet or cell_boundaries.parquet

Nucleus and cell boundaries in parquet format. Contains same data as the nucleus_boundaries.csv.gz and cell_boundaries.csv.gz, but can be faster to load and filter with the appropriate reader method.

cell_segmentation.zarr.zip

Cell boundaries (Zarr). A zipped Zarr file containing the results of the cell segmentation algorithm. Contains polygons and instance masks for 2 related segmentations. Polygon set 0 and mask 0 represent the segmented nuclei regions. Polygon set 1 and mask 1 represent the nucleus expanded with the expand_labels method derived from CellProfiler.

You can open a Zarr file using the zarr python package with this snippet:

import zarr
f = "hBreast_ffpe_rep1/cell_segmentation_dataset.zarr.zip"
zf = zarr.open(f)

The Zarr file contains the follwing array datasets that can be loaded into numpy arrays.

cell_summary

An array containing information about each cell & nucleus. Currently it contains the following columns, which correspond to the cells.csv.gz file documented above.

cell_centroid_x’
cell_centroid_y’
cell_area’
nucleus_centroid_x’
nucleus_centroid_y’
nucleus_area’

# Load the cell summary dataset - this is [num_cells, 6] array
# with metadata about each cell
cell_summary = zf["cell_summary"][:]

# The meaning of each column is available in the 'column_names' attribute:
col_names = zf["cell_summary"].attrs["column_names"]

polygon_vertices

Polygon vertices. There are two sets of polygons, corresponding to the S axis of the array, with 0 corresponding to the segmented nuclei boundaries and 1 corresponding to the segmented cell boundaries

Shape (S, C, 2*V) where S is the number of polygon sets, C is the number of cells, V is the maximum number of vertices per polygon. The last dimension is the vertex coordinates laid out as (X1, Y1, …, XN, YN) where N is the maximum number of vertices supported per polygon. The first vertex is repeated at the end. The last vertex is repeated to fill out the array.

polygon_num_vertices

The number of polygon vertices. Shape (S, C) where S is the number of polygon sets, C is the number of cells. Each element is the number of vertices for that polygon, INCLUDING the repetition of the initial vertex.

# Load the nucleus/cell polygon datasets
polygon_num_vertices = zf["polygon_num_vertices"][:]
polygon_vertices = zf["polygon_vertices"][:]

# Nucleus boundary polygons for first 5 cells
polygon_vertices[0,:5,:]
# Number of vertices in nucleus boundary for first 5 cells.
polygon_num_vertices[0, :5]

masks

The instance masks corresponding to each polygon set.

homogeneous_transform

The 3x3 transform matrix. It is used to move the data into the physical coordinate space (in microns). Currently this is a trivial scaling by 1.0/pixel_size.