No description
  • PostScript 84.6%
  • TeX 13.8%
  • R 1.5%
  • Mathematica 0.1%
Find a file
2020-03-29 11:46:07 +02:00
00_metadata Added packrat summary table 2018-08-01 15:07:02 +02:00
00_rawdata Added example on how to generate complex-protein mapping 2020-03-29 11:46:07 +02:00
01_data Analysis 5.1 (preliminary) 2017-10-02 17:16:27 +02:00
02_performance Analysis 5.1 (preliminary) 2017-10-02 17:16:27 +02:00
03_data Analysis 5.3 2017-10-20 11:30:31 +02:00
03_performance Analysis 5.3 2017-10-20 11:30:31 +02:00
04_topology Run 5.3 bis 2017-10-21 12:08:51 +02:00
05_mashup Added packrat and other files for reproducibility 2017-12-27 18:23:38 +01:00
10_data All diseases + logistic models 2017-11-27 10:38:18 +01:00
11_topology Disease clustering 2017-11-28 12:48:22 +01:00
12_performance Predictions using genetic scores as well 2017-12-11 14:20:06 +01:00
13_complexes quick fix 2017-12-21 18:16:31 +01:00
20_data Added genetic scores histogram 2018-08-02 17:28:15 +02:00
21_topology Exported data frames with topological properties 2018-03-29 12:56:15 +02:00
22_performance CV runs on STRING 2018-02-06 12:17:38 +01:00
23_boxplots Models on all networks 2018-02-09 15:24:01 +01:00
23_contrasts Models on all networks 2018-02-09 15:24:01 +01:00
23_models Models on all networks 2018-02-09 15:24:01 +01:00
40_data Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now. 2018-03-27 18:47:57 +02:00
42_performance CV runs OmniPath 2018-02-06 16:26:34 +01:00
43_boxplots Models on all networks 2018-02-09 15:24:01 +01:00
43_contrasts Models on all networks 2018-02-09 15:24:01 +01:00
43_models Models on all networks 2018-02-09 15:24:01 +01:00
45_mashup Generated dataset, network and features for OmniPath. Added biomaRt library 2018-01-30 16:06:18 +01:00
63_boxplots Modified method ranking figure 2018-08-23 10:06:47 +02:00
63_models Added plot on predictions and updated text files (probably emmeans vs lsmeans) 2018-08-01 11:00:10 +02:00
packrat Finally installed the GGally dependence 2018-07-30 11:51:48 +02:00
.gitignore Added packrat sources - experimental as they are large 2017-12-27 18:29:14 +01:00
.Renviron Added missing dependency 2017-12-27 19:51:55 +01:00
.Rprofile Added packrat and other files for reproducibility 2017-12-27 18:23:38 +01:00
00_packrat_table.R Added packrat summary table 2018-08-01 15:07:02 +02:00
01_preprocessing.Rmd Analysis 5.1 (preliminary) 2017-10-02 17:16:27 +02:00
02_diffusion_scores.Rmd Analysis 5.1 (preliminary) 2017-10-02 17:16:27 +02:00
03_config.R First descriptive statistics on complex data 2017-12-04 17:16:19 +01:00
03_multiple_disease.Rmd Analysis 5.3 2017-10-20 11:30:31 +02:00
03_preprocessing.Rmd Added scripts for analysing 4 diseases 2017-10-11 12:23:44 +02:00
04_positives_analysis.Rmd Run 5.3 bis 2017-10-21 12:08:51 +02:00
05_mashup.m All diseases + logistic models 2017-11-27 10:38:18 +01:00
05_mashup_features.Rmd All diseases + logistic models 2017-11-27 10:38:18 +01:00
10_preprocessing.Rmd Now genes are not filtered if no drugs or genetic association is known 2017-11-15 16:55:31 +01:00
11_positives_analysis.Rmd Disease clustering 2017-11-28 12:48:22 +01:00
11_upgma.R Disease clustering 2017-11-28 12:48:22 +01:00
12_multiple_disease.Rmd Predictions using genetic scores as well 2017-12-11 14:20:06 +01:00
13_complexes.Rmd Added simulated CV folds 2017-12-05 15:29:14 +01:00
13_pilot_cv_schemes.Rmd quick fix 2017-12-21 18:16:31 +01:00
20_config.R Added fold imbalance plot 2018-04-03 13:39:42 +02:00
20_preprocessing.Rmd Added genetic scores histogram 2018-08-02 17:28:15 +02:00
21_positives_analysis.Rmd Exported data frames with topological properties 2018-03-29 12:56:15 +02:00
22_performance.Rmd CV runs on STRING 2018-02-06 12:17:38 +01:00
23_models.Rmd Models on all networks 2018-02-09 15:24:01 +01:00
40_config.R Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now. 2018-03-27 18:47:57 +02:00
40_preprocessing.Rmd Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now. 2018-03-27 18:47:57 +02:00
42_performance.Rmd CV runs OmniPath 2018-02-06 16:26:34 +01:00
43_models.Rmd Models on all networks 2018-02-09 15:24:01 +01:00
45_mashup.m Generated dataset, network and features for OmniPath. Added biomaRt library 2018-01-30 16:06:18 +01:00
60_abbreviations.R Added topology analysis on STRING. Abbreviations are now in a config file 60_abbreviations.R 2018-03-28 11:48:45 +02:00
60_config.R Added new boxplots by disease/method 2018-03-29 11:36:32 +02:00
60_palette25.txt Added new boxplots by disease/method 2018-03-29 11:36:32 +02:00
63_models.Rmd Modified method ranking figure 2018-08-23 10:06:47 +02:00
config.R exploratory analysis 2017-09-18 19:43:15 +02:00
genease.Rproj exploratory analysis 2017-09-18 19:43:15 +02:00
LICENSE Create LICENSE 2018-10-23 17:51:20 +02:00
README.md Update README 2018-10-09 16:24:13 +02:00

Introduction

The genedise project aims at finding druggable genes for a specific disease based on previously essayed targets. Whether these targets were successful or not is not the primary concern - the fact that there was enough evidence to try them is enough for us. In this way, we aim at mimicking the time-consuming task of proposing new reasonable targets.

The suggestion of new disease genes uses data from OpenTargets as seed gene lists and the STRING protein-protein interaction network to infer new genes.

The project is almost entirely coded using R. Some Matlab code has been necessary to include state of the art approaches.

Structure

The files and directories of this project are proceded by a number that indicates the chronological order of their execution. Scripts are stored in Rmd files. Their outputs are saved in folders sharing their prefix. The most relevant prefixes are:

  • 2X_: analysis on the STRING network
  • 4X_: analysis on the OmniPath network
  • 6X_: plots and models combining both networks (depends on the execution of the 2X an 4X scripts)

Reproducibility

Metadata files

The output of sessionInfo() is always stored in the directory 00_metadata to keep track of the package versions.

Configuration files

There are configuration files, such as 03_config.R, that contain a comprehensive amount of parameters, paths and file names. Generally, these parameters are sourced instead of being hardcoded in the scripts.

Package management

The project has package version control through packrat to ease portability between machines.

External files

Almost all the files in the project are included in the git repository at the moment. Exceptions:

  • STRING database files
  • Network kernel(s)

The route of these files (Sergi's machines) can be found in the config files.

Other

There are several set.seed calls throughout the code. Intermediate results are saved when the space required is not prohibitive.

Workflow

Data preprocessing

  • Check OpenTargets data sanity
  • Choose network: compromise between coverage and size
  • Compute and store graph kernel on chosen network
  • Save cleaned data, mapped to the network of choice

Topology analysis

  • Characterisation of disease genes in terms of network properties
  • Within-disease study
  • Between-disease study

Performance

  1. Load configuration files

  2. Load dataset

  3. Load network data

  4. Build CV folds

  5. Define functions for prediction

  6. Define performance metrics

  7. For each disease,input_type,fold

    1. Define train and validation
    2. Predict for every method using train
    3. Compute performance metrics
    4. Write to disk
  8. Plot metrics

  9. Build statistical models for comparing methods

System requirements

Hardware

The runs have been executed on the following hardware from the UPC:

Code profiling

Running the script is barely possible with 16GB of RAM. We recommend using 32GB to avoid spikes with swapping.

For reference, executing all the diseases under a single repeated CV scheme (25 repetitions, 3 folds per repetition) on eko takes one week. Likewise, sun is twice as fast. The code is a mixture between serial and parallel executions because not all the methods run in parallel.

On the other hand, the computationally intensive code was run on a torque-based cluster, but the parallel R package -part of the R base- was unable to clean up the child processes. This led to memory exhaustion and proved to be infeasible. Alternatives to tackle this while keeping reproducibility might be added in the future.