Before we meet: Installation & readings

Installation & readings

Rstudio is an IDE (Integrated Development Environment) that makes it easier to write and execute R code.

R is a programming language that is used for a lot for data analysis and statistics. We will introduce you to it’s usage. Time is short, please :

panes

Note: Should you want to know more, the guided tour for Rstudio is available.

install.packages("tidyverse", dependencies = TRUE)

This can be more challenging, if you have enough time to try:

You will need to create a ssh key (The starts is the most difficult! last step, but it is worth it !)

PLEASE LET ME KNOW IF YOU DID NOT MANAGE THIS BEFORE WE MEET. We can eg. have a look during an online meeting.

Why we need to do this ? It’s about Rproducible science and Open data science

This part will be discussed in the course.

The way we will work will help us to do reproducible research. It will help you (and us) to organize the data analysis work, and document what you have done, including the reasons behind your choices.

Three months from now YOU might not remember the reasoning and all the steps you have done in your analyses. Documenting what you are doing at the same time you are doing it, is a very good practice.

This will save you time and struggles. What you have done will be essential information for publication and manuscript revision.

Moreover, working this way, will allow you to start setting up your analyses BEFORE all the data are collected. You will be able to re-run all your code using updated data. This is helping you being pro-active.

What are the requirements of reproducible research ? The following article mention
10 Rules of reproducible research

  1. For Every Result, Keep Track of How It Was Produced
  2. Avoid Manual Data Manipulation Steps
  3. Archive the Exact Versions of All External Programs Used
  4. Version Control: Use it for all Customized Scripts
  5. Record All Intermediate Results, When Possible in Standardized Formats
  6. For Analyses That Include Randomness, Note Underlying Random Seeds
  7. Always Store Raw Data behind Plots
  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
  9. Connect Textual Statements to Underlying Results
  10. Provide Public Access to Scripts, Runs, and Results

We will see that using R and Rstudio can help you to relatively easy follow all steps but 4 and 10. The latest, which require using a version control system like Git, and eventually an associated online platform like github that allows to store your code on the web. Github in turn can be used during the publication and can facilitate the creation of DOI via integration with Zenodo. This then allows people to cite your data analysis work and code!

WARNING: NEVER PUT YOUR DATA ON GITHUB, only the code to process the data

Additional ressources you can eventually read

We might not have time to go into details with git and github, but we will try to

But if you feel like it, you can learn Git for version control (it can be used with the command line - aka Terminal panel that is available in Rstudio):
- Git-Novice is a good lesson to start with.

Preliminary training program

  • github and R - get started with a project (understand the basic principles)
  • data wrangling
  • data exploration and visualization (ensure data quality, what do I have)
  • eventually … going further

Back to Index

