Reproducing BEELINE Results

This page describes how to reproduce the results from the original BEELINE paper (Pratapa et al., Nature Methods, 2020).

Environment Setup

Before running any experiments, complete the standard BEELINE setup described in Getting Started:

  1. Install and configure Docker.

  2. Create the BEELINE conda environment:

    . utils/setupAnacondaVENV.sh
    
  3. Pull the algorithm Docker images:

    . utils/initialize.sh
    

Downloading Data from Zenodo

The datasets used in the paper are available on Zenodo at https://doi.org/10.5281/zenodo.3378975.

Note

BEELINE v1.1 requires Zenodo dataset version v4. Earlier versions of BEELINE should use Zenodo dataset version v3. If you are using an older version of BEELINE, please refer to the documentation for that version rather than this page. The significant change is a streamlined dircetory structure that allows users to easily replicate certain figures.

The archive contains two top-level items:

  • BEELINE-data/ — the expression, pseudotime, and ground truth files for the Synthetic and Curated datasets, as well as the raw scRNA-seq experimental data.

  • BEELINE-Networks/ — reference networks (ChIP-seq and STRING) and transcription factor lists used for preprocessing the experimental datasets.

Extracting the data

Copy the contents of BEELINE-data/inputs/ into the inputs/ directory at the root of this repository, and place BEELINE-Networks/ at the repository root as well. Move the config files from configs to the config-files directory:

inputs/
├── Curated/
│   ├── GSD/            (GroundTruthNetwork.csv + 10 run subdirectories)
│   ├── GSD-q50/        (50% dropout variant)
│   ├── GSD-q70/        (70% dropout variant)
│   ├── HSC/
│   ├── HSC-q50/
│   ├── HSC-q70/
│   ├── mCAD/
│   ├── mCAD-q50/
│   ├── mCAD-q70/
│   ├── VSC/
│   ├── VSC-q50/
│   └── VSC-q70/
├── Synthetic/
│   ├── dyn-BF/
│   │   ├── dyn-BF-100/ (10 run subdirectories)
│   │   ├── dyn-BF-200/
│   │   ├── dyn-BF-500/
│   │   ├── dyn-BF-2000/
│   │   └── dyn-BF-5000/
│   ├── dyn-BFC/        (same cell-count structure)
│   ├── dyn-CY/
│   ├── dyn-LI/
│   ├── dyn-LL/
│   └── dyn-TF/
└── scRNA-Seq/
    ├── hESC/
    ├── hHep/
    ├── mDC/
    ├── mESC/
    ├── mHSC-E/
    ├── mHSC-GM/
    └── mHSC-L/

BEELINE-Networks/
├── Networks/
│   ├── human/      (ChIP-seq and STRING networks)
│   └── mouse/      (ChIP-seq and STRING networks)
├── human-tfs.csv
└── mouse-tfs.csv

Synthetic and Curated Datasets (Figures 2 and 4)

Figures 2 and 4 evaluate algorithm performance on BoolODE-simulated Synthetic and BoolODE-simulated Curated datasets.

Each of the four Curated datasets (GSD, HSC, mCAD, VSC) is split into three top-level directories by dropout level: full cell count (e.g. GSD), 50% dropout (GSD-q50), and 70% dropout (GSD-q70). Each directory contains a shared GroundTruthNetwork.csv and 10 run subdirectories (e.g. GSD-2000-1 through GSD-2000-10), each with ExpressionData.csv and PseudoTime.csv.

Each of the six Synthetic networks (dyn-BF, dyn-BFC, dyn-CY, dyn-LI, dyn-LL, dyn-TF) contains five cell-count subdirectories (100, 200, 500, 2000, 5000 cells), each with 10 replicate run directories (e.g. dyn-BF/dyn-BF-100/dyn-BF-100-1/). Each leaf run directory contains ExpressionData.csv, PseudoTime.csv, and GroundTruthNetwork.csv.

Step 1: Run the inference algorithms

Note

SCNS is not included in BEELINE v1.1 due to its long run time.

Config files for each dataset are in config-files/Curated/ and config-files/Synthetic/. Run BLRunner.py once for each config file:

python BLRunner.py --config config-files/Curated/GSD.yaml
python BLRunner.py --config config-files/Curated/HSC.yaml
python BLRunner.py --config config-files/Curated/mCAD.yaml
python BLRunner.py --config config-files/Curated/VSC.yaml
python BLRunner.py --config config-files/Synthetic/dyn-BF.yaml
python BLRunner.py --config config-files/Synthetic/dyn-BFC.yaml
python BLRunner.py --config config-files/Synthetic/dyn-CY.yaml
python BLRunner.py --config config-files/Synthetic/dyn-LI.yaml
python BLRunner.py --config config-files/Synthetic/dyn-LL.yaml
python BLRunner.py --config config-files/Synthetic/dyn-TF.yaml

Step 2: Evaluate algorithm output

Use BLEvaluator.py with the plot config files, which aggregate results across all datasets for a given collection:

# Synthetic networks
python BLEvaluator.py --config config-files/Curated/PlotCuratedNetworks.yaml \
    --auc --spearman --epr --sepr

# Curated networks
python BLEvaluator.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \
    --auc --spearman --epr --sepr

Step 3: Generate plots

Figure 2 — AUPRC ratio and Spearman stability summary heatmap:

python BLPlotter.py --config config-files/Curated/PlotCuratedNetworks.yaml \
    --output outputs/plots/Curated --summary

python BLPlotter.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \
    --output outputs/plots/Synthetic --summary

Figure 4 — EPR and signed EPR summary heatmap:

python BLPlotter.py --config config-files/Curated/PlotCuratedNetworks.yaml \
    --output outputs/plots/Curated --epr-summary

python BLPlotter.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \
    --output outputs/plots/Synthetic --epr-summary

Output PDFs are written to the specified --output directories as Summary.pdf and EPRSummary.pdf respectively.

Experimental Datasets (Figures 5 and 6)

Figures 5 and 6 include evaluation on seven real experimental scRNA-seq datasets from inputs/scRNA-Seq/:

hESC

Human embryonic stem cells

hHep

Human mature hepatocytes

mDC

Mouse dendritic cells

mESC

Mouse embryonic stem cells

mHSC-E

Mouse hematopoietic stem cells — erythroid lineage

mHSC-GM

Mouse hematopoietic stem cells — granulocyte/monocyte lineage

mHSC-L

Mouse hematopoietic stem cells — lymphoid lineage

Each dataset directory contains ExpressionData.csv, PseudoTime.csv, and GeneOrdering.csv (a ranked list of genes with p-values and variance statistics). These datasets do not include a pre-built ground truth network file; reference networks must be sourced from BEELINE-Networks/.

Preprocessing with generateExpInputs.py

The script utils/generateExpInputs.py produces a filtered ExpressionData.csv by selecting the most variable and statistically significant genes. It can optionally restrict a reference network to only those genes. Run it once per dataset/configuration you want to evaluate. For example, for hESC with 500 non-TF genes against the ChIP-seq reference network:

python utils/generateExpInputs.py \
    --expFile inputs/scRNA-Seq/hESC/ExpressionData.csv \
    --geneOrderingFile inputs/scRNA-Seq/hESC/GeneOrdering.csv \
    --TFFile BEELINE-Networks/human-tfs.csv \
    --netFile BEELINE-Networks/Networks/human/hESC-ChIP-seq-network.csv \
    --pVal 0.01 \
    --numGenes 500 \
    --outPrefix inputs/Experimental/hESC/hESC-500

For mouse datasets, substitute BEELINE-Networks/mouse-tfs.csv and the appropriate file from BEELINE-Networks/Networks/mouse/. Run python utils/generateExpInputs.py --help for the full list of options.

The output files should be placed in the appropriate subdirectory of inputs/ and referenced from a BEELINE config file as described in Config Files.

Note on reproducibility

Figures 5 and 6 are the result of a large number of runs across all seven experimental datasets with significantly customized configurations. Fully replicating these figures requires several weeks of compute time, and step-by-step instructions for reproducing them exactly are not available. A user familiar with BEELINE can configure the pipeline to reproduce these results from the data available on Zenodo.