This is an overview of the output files for a Xenium in-situ analysis. This documents the pre-release file formats. The formats may change slightly in the final version of the Xenium system. Please contact [email protected] for clarifications or additional information.
Open this file in the Xenium Explorer browser to visualize experimental results. This is a small JSON manifest file that includes references to the other data files in the output folder.
The cell-feature matrix files contain the counts of the number of times each gene was observed inside the segmented boundary of each cell. This file is in sparse mtx (Matrix Market Exchange) format. The matrix is represented by a directory containing the count matrix in mtx format and tsv.gz files containing the feature labels (features.tsv.gz), and cell identifiers (barcodes.tsv.gz). This is the primary output file format to be used for secondary analysis, including cell typing, differential expression, and comparing spatial expression patterns. By default, only includes transcripts that pass the default QV threshold of Q20.
This is the same format as used for Cell Ranger, but contains negative control features that should be ignored for biological analysis. The “feature type” column of the features.tsv.gz
lists this. There are two classes of negative controls: “Negative control probe” features are probes that exist in the panel, but target ERCC or other non-biological sequences, which can be used to assess the specificity of the assay. “Negative control codeword” features are valid codewords that do not have any probes that should yield that code, so they can be used to assess the specificity of the decoding algorithm. QVs produced for each transcript call are calibrated using the Negative control codeword features.
Feature-cell matrix in HDF5 (H5) format. More efficient version of the mtx matrix file. Only includes transcripts that pass the default QV threshold of Q20. Matches the format used by Cell Ranger documented here
Transcripts in gzipped CSV format. Contains one row for each decoded transcript, containing the following columns:
Transcripts in parquet format. Contains same data as the transcripts.csv.gz, but can be faster to load and filter with the appropriate reader method.
This is the representation of the transcript data used by Xenium Explorers. It is not recommended for downstream consumers other than Xenium Explorer
Cell summary file in gzipped CSV format. Contains one row for each cell, with the following columns:
Cell summary file in Parquet format. Contains the same information as cells.csv.gz, but can be faster to load and filter with the appropriate reader method.
The panel gene list, with annotation of the cell type that expresses this gene.
Secondary analysis outputs. Quick analysis and visualization in the analysis summary and external tools. Directory with subdirectories and respective CSVs within each clustering (graph-based and k-means), differential expression, PCA (components, dispersion, features_selected, project, variance CSVs), and UMAP projection.csv. These files match the outputs of Cell Ranger documented here
Secondary analysis outputs in zipped Zarr format. For use in visualization by Xenium Explorer. Includes cell clustering results.
The cell morphology images are the DAPI image in OME-TIFF format. These file includes a pyramid of resolutions, and tiled chunks of image data, so can be used to back efficient interactive image viewing.
The pixel size of Xenium 0.2125 microns. Coordinates in microns from cells.csv.gz
and transcripts.csv.gz
can be converted to pixel coordinates by dividing by the pixel size. The origin of the coordinate system is the upper left of the TIFF image.
The ome.tif files are 16-bit grayscale, and compressed with JPEG-2000. They are loadable by tifffile python package, bioformats Java library/CLI.
Example python code to read a layer of the image using the tifffile
package:
import tifffile
r = tifffile.TiffReader("morphology_mip.ome.tif")
image = r.pages[1].asarray()
Cell morphology DAPI with the maximum intensity projection (MIP) of the DAPI Z-stack image.
Cell morphology DAPI with the single best-focus Z-plane from the DAPI Z-stack image.
An H&E image of the tissue section, taken after the Xenium protocol, if available. Note, this image is acquired on a different microscope, and is not registered to the morphology image. Some post-processing will be required to register the H&E image with the DAPI morphology image.
Nucleus boundaries are determined by a nucleus segmentation algorithm that runs on the DAPI morphology image. Cell boundaries are determined by expanding the nucleus boundaries up to 15 µm, or until the expanded boundary hit another cell.
The CSV representation of the nucleus and cell boundaries. Each row represents a vertex in the boundary polygon of one cell. The boundary point for each cell appear in clockwise order, and the first and last point are duplicates to indicate the closed polygon. The columns of these file are:
Nucleus and cell boundaries in parquet format. Contains same data as the nucleus_boundaries.csv.gz and cell_boundaries.csv.gz, but can be faster to load and filter with the appropriate reader method.
Cell boundaries (Zarr). A zipped Zarr file containing the results of the cell segmentation algorithm. Contains polygons and instance masks for 2 related segmentations. Polygon set 0 and mask 0 represent the segmented nuclei regions. Polygon set 1 and mask 1 represent the nucleus expanded with the expand_labels method derived from CellProfiler.
You can open a Zarr file using the zarr
python package with this snippet:
import zarr
f = "hBreast_ffpe_rep1/cell_segmentation_dataset.zarr.zip"
zf = zarr.open(f)
The Zarr file contains the follwing array datasets that can be loaded into numpy arrays.
An array containing information about each cell & nucleus. Currently it contains the following columns, which correspond to the cells.csv.gz file documented above.
# Load the cell summary dataset - this is [num_cells, 6] array
# with metadata about each cell
cell_summary = zf["cell_summary"][:]
# The meaning of each column is available in the 'column_names' attribute:
col_names = zf["cell_summary"].attrs["column_names"]
Polygon vertices. There are two sets of polygons, corresponding to the S axis of the array, with 0 corresponding to the segmented nuclei boundaries and 1 corresponding to the segmented cell boundaries
Shape (S, C, 2*V) where S is the number of polygon sets, C is the number of cells, V is the maximum number of vertices per polygon. The last dimension is the vertex coordinates laid out as (X1, Y1, …, XN, YN) where N is the maximum number of vertices supported per polygon. The first vertex is repeated at the end. The last vertex is repeated to fill out the array.
The number of polygon vertices. Shape (S, C) where S is the number of polygon sets, C is the number of cells. Each element is the number of vertices for that polygon, INCLUDING the repetition of the initial vertex.
# Load the nucleus/cell polygon datasets
polygon_num_vertices = zf["polygon_num_vertices"][:]
polygon_vertices = zf["polygon_vertices"][:]
# Nucleus boundary polygons for first 5 cells
polygon_vertices[0,:5,:]
# Number of vertices in nucleus boundary for first 5 cells.
polygon_num_vertices[0, :5]
The instance masks corresponding to each polygon set.
The 3x3 transform matrix. It is used to move the data into the physical coordinate space (in microns). Currently this is a trivial scaling by 1.0/pixel_size
.