Reproducing BEELINE Results =========================== This page describes how to reproduce the results from the original BEELINE paper (Pratapa et al., *Nature Methods*, 2020). Environment Setup ----------------- Before running any experiments, complete the standard BEELINE setup described in :doc:`getting-started`: 1. Install and configure Docker. 2. Create the BEELINE conda environment: .. code:: bash . utils/setupAnacondaVENV.sh 3. Pull the algorithm Docker images: .. code:: bash . utils/initialize.sh Downloading Data from Zenodo ----------------------------- The datasets used in the paper are available on Zenodo at `https://doi.org/10.5281/zenodo.3378975 `_. .. note:: **BEELINE v1.1 requires Zenodo dataset version v4.** Earlier versions of BEELINE should use Zenodo dataset version v3. If you are using an older version of BEELINE, please refer to the documentation for that version rather than this page. The significant change is a streamlined dircetory structure that allows users to easily replicate certain figures. The archive contains two top-level items: - ``BEELINE-data/`` — the expression, pseudotime, and ground truth files for the Synthetic and Curated datasets, as well as the raw scRNA-seq experimental data. - ``BEELINE-Networks/`` — reference networks (ChIP-seq and STRING) and transcription factor lists used for preprocessing the experimental datasets. **Extracting the data** Copy the contents of ``BEELINE-data/inputs/`` into the ``inputs/`` directory at the root of this repository, and place ``BEELINE-Networks/`` at the repository root as well. Move the config files from ``configs`` to the config-files directory: .. code:: text inputs/ ├── Curated/ │ ├── GSD/ (GroundTruthNetwork.csv + 10 run subdirectories) │ ├── GSD-q50/ (50% dropout variant) │ ├── GSD-q70/ (70% dropout variant) │ ├── HSC/ │ ├── HSC-q50/ │ ├── HSC-q70/ │ ├── mCAD/ │ ├── mCAD-q50/ │ ├── mCAD-q70/ │ ├── VSC/ │ ├── VSC-q50/ │ └── VSC-q70/ ├── Synthetic/ │ ├── dyn-BF/ │ │ ├── dyn-BF-100/ (10 run subdirectories) │ │ ├── dyn-BF-200/ │ │ ├── dyn-BF-500/ │ │ ├── dyn-BF-2000/ │ │ └── dyn-BF-5000/ │ ├── dyn-BFC/ (same cell-count structure) │ ├── dyn-CY/ │ ├── dyn-LI/ │ ├── dyn-LL/ │ └── dyn-TF/ └── scRNA-Seq/ ├── hESC/ ├── hHep/ ├── mDC/ ├── mESC/ ├── mHSC-E/ ├── mHSC-GM/ └── mHSC-L/ BEELINE-Networks/ ├── Networks/ │ ├── human/ (ChIP-seq and STRING networks) │ └── mouse/ (ChIP-seq and STRING networks) ├── human-tfs.csv └── mouse-tfs.csv Synthetic and Curated Datasets (Figures 2 and 4) ------------------------------------------------- Figures 2 and 4 evaluate algorithm performance on BoolODE-simulated Synthetic and BoolODE-simulated Curated datasets. Each of the four Curated datasets (GSD, HSC, mCAD, VSC) is split into three top-level directories by dropout level: full cell count (e.g. ``GSD``), 50% dropout (``GSD-q50``), and 70% dropout (``GSD-q70``). Each directory contains a shared ``GroundTruthNetwork.csv`` and 10 run subdirectories (e.g. ``GSD-2000-1`` through ``GSD-2000-10``), each with ``ExpressionData.csv`` and ``PseudoTime.csv``. Each of the six Synthetic networks (dyn-BF, dyn-BFC, dyn-CY, dyn-LI, dyn-LL, dyn-TF) contains five cell-count subdirectories (100, 200, 500, 2000, 5000 cells), each with 10 replicate run directories (e.g. ``dyn-BF/dyn-BF-100/dyn-BF-100-1/``). Each leaf run directory contains ``ExpressionData.csv``, ``PseudoTime.csv``, and ``GroundTruthNetwork.csv``. **Step 1: Run the inference algorithms** .. note:: SCNS is not included in BEELINE v1.1 due to its long run time. Config files for each dataset are in ``config-files/Curated/`` and ``config-files/Synthetic/``. Run ``BLRunner.py`` once for each config file: .. code:: bash python BLRunner.py --config config-files/Curated/GSD.yaml python BLRunner.py --config config-files/Curated/HSC.yaml python BLRunner.py --config config-files/Curated/mCAD.yaml python BLRunner.py --config config-files/Curated/VSC.yaml python BLRunner.py --config config-files/Synthetic/dyn-BF.yaml python BLRunner.py --config config-files/Synthetic/dyn-BFC.yaml python BLRunner.py --config config-files/Synthetic/dyn-CY.yaml python BLRunner.py --config config-files/Synthetic/dyn-LI.yaml python BLRunner.py --config config-files/Synthetic/dyn-LL.yaml python BLRunner.py --config config-files/Synthetic/dyn-TF.yaml **Step 2: Evaluate algorithm output** Use ``BLEvaluator.py`` with the plot config files, which aggregate results across all datasets for a given collection: .. code:: bash # Synthetic networks python BLEvaluator.py --config config-files/Curated/PlotCuratedNetworks.yaml \ --auc --spearman --epr --sepr # Curated networks python BLEvaluator.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \ --auc --spearman --epr --sepr **Step 3: Generate plots** *Figure 2* — AUPRC ratio and Spearman stability summary heatmap: .. code:: bash python BLPlotter.py --config config-files/Curated/PlotCuratedNetworks.yaml \ --output outputs/plots/Curated --summary python BLPlotter.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \ --output outputs/plots/Synthetic --summary *Figure 4* — EPR and signed EPR summary heatmap: .. code:: bash python BLPlotter.py --config config-files/Curated/PlotCuratedNetworks.yaml \ --output outputs/plots/Curated --epr-summary python BLPlotter.py --config config-files/Synthetic/PlotSyntheticNetworks.yaml \ --output outputs/plots/Synthetic --epr-summary Output PDFs are written to the specified ``--output`` directories as ``Summary.pdf`` and ``EPRSummary.pdf`` respectively. Experimental Datasets (Figures 5 and 6) ----------------------------------------- Figures 5 and 6 include evaluation on seven real experimental scRNA-seq datasets from ``inputs/scRNA-Seq/``: .. csv-table:: :widths: 20, 80 "**hESC**", "Human embryonic stem cells" "**hHep**", "Human mature hepatocytes" "**mDC**", "Mouse dendritic cells" "**mESC**", "Mouse embryonic stem cells" "**mHSC-E**", "Mouse hematopoietic stem cells — erythroid lineage" "**mHSC-GM**", "Mouse hematopoietic stem cells — granulocyte/monocyte lineage" "**mHSC-L**", "Mouse hematopoietic stem cells — lymphoid lineage" Each dataset directory contains ``ExpressionData.csv``, ``PseudoTime.csv``, and ``GeneOrdering.csv`` (a ranked list of genes with p-values and variance statistics). These datasets do **not** include a pre-built ground truth network file; reference networks must be sourced from ``BEELINE-Networks/``. **Preprocessing with generateExpInputs.py** The script ``utils/generateExpInputs.py`` produces a filtered ``ExpressionData.csv`` by selecting the most variable and statistically significant genes. It can optionally restrict a reference network to only those genes. Run it once per dataset/configuration you want to evaluate. For example, for hESC with 500 non-TF genes against the ChIP-seq reference network: .. code:: bash python utils/generateExpInputs.py \ --expFile inputs/scRNA-Seq/hESC/ExpressionData.csv \ --geneOrderingFile inputs/scRNA-Seq/hESC/GeneOrdering.csv \ --TFFile BEELINE-Networks/human-tfs.csv \ --netFile BEELINE-Networks/Networks/human/hESC-ChIP-seq-network.csv \ --pVal 0.01 \ --numGenes 500 \ --outPrefix inputs/Experimental/hESC/hESC-500 For mouse datasets, substitute ``BEELINE-Networks/mouse-tfs.csv`` and the appropriate file from ``BEELINE-Networks/Networks/mouse/``. Run ``python utils/generateExpInputs.py --help`` for the full list of options. The output files should be placed in the appropriate subdirectory of ``inputs/`` and referenced from a BEELINE config file as described in :ref:`configfiles`. **Note on reproducibility** Figures 5 and 6 are the result of a large number of runs across all seven experimental datasets with significantly customized configurations. Fully replicating these figures requires several weeks of compute time, and step-by-step instructions for reproducing them exactly are not available. A user familiar with BEELINE can configure the pipeline to reproduce these results from the data available on Zenodo.