No description

PostScript 84.6%
TeX 13.8%
R 1.5%
Mathematica 0.1%

Find a file

Sergi Picart fd87822a79 Added example on how to generate complex-protein mapping		2020-03-29 11:46:07 +02:00
00_metadata	Added packrat summary table	2018-08-01 15:07:02 +02:00
00_rawdata	Added example on how to generate complex-protein mapping	2020-03-29 11:46:07 +02:00
01_data	Analysis 5.1 (preliminary)	2017-10-02 17:16:27 +02:00
02_performance	Analysis 5.1 (preliminary)	2017-10-02 17:16:27 +02:00
03_data	Analysis 5.3	2017-10-20 11:30:31 +02:00
03_performance	Analysis 5.3	2017-10-20 11:30:31 +02:00
04_topology	Run 5.3 bis	2017-10-21 12:08:51 +02:00
05_mashup	Added packrat and other files for reproducibility	2017-12-27 18:23:38 +01:00
10_data	All diseases + logistic models	2017-11-27 10:38:18 +01:00
11_topology	Disease clustering	2017-11-28 12:48:22 +01:00
12_performance	Predictions using genetic scores as well	2017-12-11 14:20:06 +01:00
13_complexes	quick fix	2017-12-21 18:16:31 +01:00
20_data	Added genetic scores histogram	2018-08-02 17:28:15 +02:00
21_topology	Exported data frames with topological properties	2018-03-29 12:56:15 +02:00
22_performance	CV runs on STRING	2018-02-06 12:17:38 +01:00
23_boxplots	Models on all networks	2018-02-09 15:24:01 +01:00
23_contrasts	Models on all networks	2018-02-09 15:24:01 +01:00
23_models	Models on all networks	2018-02-09 15:24:01 +01:00
40_data	Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now.	2018-03-27 18:47:57 +02:00
42_performance	CV runs OmniPath	2018-02-06 16:26:34 +01:00
43_boxplots	Models on all networks	2018-02-09 15:24:01 +01:00
43_contrasts	Models on all networks	2018-02-09 15:24:01 +01:00
43_models	Models on all networks	2018-02-09 15:24:01 +01:00
45_mashup	Generated dataset, network and features for OmniPath. Added biomaRt library	2018-01-30 16:06:18 +01:00
63_boxplots	Modified method ranking figure	2018-08-23 10:06:47 +02:00
63_models	Added plot on predictions and updated text files (probably emmeans vs lsmeans)	2018-08-01 11:00:10 +02:00
packrat	Finally installed the GGally dependence	2018-07-30 11:51:48 +02:00
.gitignore	Added packrat sources - experimental as they are large	2017-12-27 18:29:14 +01:00
.Renviron	Added missing dependency	2017-12-27 19:51:55 +01:00
.Rprofile	Added packrat and other files for reproducibility	2017-12-27 18:23:38 +01:00
00_packrat_table.R	Added packrat summary table	2018-08-01 15:07:02 +02:00
01_preprocessing.Rmd	Analysis 5.1 (preliminary)	2017-10-02 17:16:27 +02:00
02_diffusion_scores.Rmd	Analysis 5.1 (preliminary)	2017-10-02 17:16:27 +02:00
03_config.R	First descriptive statistics on complex data	2017-12-04 17:16:19 +01:00
03_multiple_disease.Rmd	Analysis 5.3	2017-10-20 11:30:31 +02:00
03_preprocessing.Rmd	Added scripts for analysing 4 diseases	2017-10-11 12:23:44 +02:00
04_positives_analysis.Rmd	Run 5.3 bis	2017-10-21 12:08:51 +02:00
05_mashup.m	All diseases + logistic models	2017-11-27 10:38:18 +01:00
05_mashup_features.Rmd	All diseases + logistic models	2017-11-27 10:38:18 +01:00
10_preprocessing.Rmd	Now genes are not filtered if no drugs or genetic association is known	2017-11-15 16:55:31 +01:00
11_positives_analysis.Rmd	Disease clustering	2017-11-28 12:48:22 +01:00
11_upgma.R	Disease clustering	2017-11-28 12:48:22 +01:00
12_multiple_disease.Rmd	Predictions using genetic scores as well	2017-12-11 14:20:06 +01:00
13_complexes.Rmd	Added simulated CV folds	2017-12-05 15:29:14 +01:00
13_pilot_cv_schemes.Rmd	quick fix	2017-12-21 18:16:31 +01:00
20_config.R	Added fold imbalance plot	2018-04-03 13:39:42 +02:00
20_preprocessing.Rmd	Added genetic scores histogram	2018-08-02 17:28:15 +02:00
21_positives_analysis.Rmd	Exported data frames with topological properties	2018-03-29 12:56:15 +02:00
22_performance.Rmd	CV runs on STRING	2018-02-06 12:17:38 +01:00
23_models.Rmd	Models on all networks	2018-02-09 15:24:01 +01:00
40_config.R	Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now.	2018-03-27 18:47:57 +02:00
40_preprocessing.Rmd	Added OT network stats & complex stats. Updated complex stats, empty complexes are ignored now.	2018-03-27 18:47:57 +02:00
42_performance.Rmd	CV runs OmniPath	2018-02-06 16:26:34 +01:00
43_models.Rmd	Models on all networks	2018-02-09 15:24:01 +01:00
45_mashup.m	Generated dataset, network and features for OmniPath. Added biomaRt library	2018-01-30 16:06:18 +01:00
60_abbreviations.R	Added topology analysis on STRING. Abbreviations are now in a config file 60_abbreviations.R	2018-03-28 11:48:45 +02:00
60_config.R	Added new boxplots by disease/method	2018-03-29 11:36:32 +02:00
60_palette25.txt	Added new boxplots by disease/method	2018-03-29 11:36:32 +02:00
63_models.Rmd	Modified method ranking figure	2018-08-23 10:06:47 +02:00
config.R	exploratory analysis	2017-09-18 19:43:15 +02:00
genease.Rproj	exploratory analysis	2017-09-18 19:43:15 +02:00
LICENSE	Create LICENSE	2018-10-23 17:51:20 +02:00
README.md	Update README	2018-10-09 16:24:13 +02:00

README.md

Introduction

The genedise project aims at finding druggable genes for a specific disease based on previously essayed targets. Whether these targets were successful or not is not the primary concern - the fact that there was enough evidence to try them is enough for us. In this way, we aim at mimicking the time-consuming task of proposing new reasonable targets.

The suggestion of new disease genes uses data from OpenTargets as seed gene lists and the STRING protein-protein interaction network to infer new genes.

The project is almost entirely coded using R. Some Matlab code has been necessary to include state of the art approaches.

Structure

The files and directories of this project are proceded by a number that indicates the chronological order of their execution. Scripts are stored in Rmd files. Their outputs are saved in folders sharing their prefix. The most relevant prefixes are:

2X_: analysis on the STRING network
4X_: analysis on the OmniPath network
6X_: plots and models combining both networks (depends on the execution of the 2X an 4X scripts)

Reproducibility

Metadata files

The output of sessionInfo() is always stored in the directory 00_metadata to keep track of the package versions.

Configuration files

There are configuration files, such as 03_config.R, that contain a comprehensive amount of parameters, paths and file names. Generally, these parameters are sourced instead of being hardcoded in the scripts.

Package management

The project has package version control through packrat to ease portability between machines.

External files

Almost all the files in the project are included in the git repository at the moment. Exceptions:

STRING database files
Network kernel(s)

The route of these files (Sergi's machines) can be found in the config files.

Other

There are several set.seed calls throughout the code. Intermediate results are saved when the space required is not prohibitive.

Workflow

Data preprocessing

Check OpenTargets data sanity
Choose network: compromise between coverage and size
Compute and store graph kernel on chosen network
Save cleaned data, mapped to the network of choice

Topology analysis

Characterisation of disease genes in terms of network properties
Within-disease study
Between-disease study

Performance

Load configuration files
Load dataset
Load network data
Build CV folds
Define functions for prediction
Define performance metrics
For each disease,input_type,fold
1. Define train and validation
2. Predict for every method using train
3. Compute performance metrics
4. Write to disk
Plot metrics
Build statistical models for comparing methods

System requirements

Hardware

The runs have been executed on the following hardware from the UPC:

eko:
- 12 threads (Intel(R) Xeon(R) CPU E7310@1.60GHz)
- 32GB RAM
sun:
- 32 threads (Intel(R) Xeon(R) CPU E5-2450@2.10GHz)
- 32GB RAM

Code profiling

Running the script is barely possible with 16GB of RAM. We recommend using 32GB to avoid spikes with swapping.

For reference, executing all the diseases under a single repeated CV scheme (25 repetitions, 3 folds per repetition) on eko takes one week. Likewise, sun is twice as fast. The code is a mixture between serial and parallel executions because not all the methods run in parallel.

On the other hand, the computationally intensive code was run on a torque-based cluster, but the parallel R package -part of the R base- was unable to clean up the child processes. This led to memory exhaustion and proved to be infeasible. Alternatives to tackle this while keeping reproducibility might be added in the future.