Chap 2. Basic R and github with Rstudio
1 Basic start up
1.1 Starting point
The detailed git setup instructions must have been followed.
- a github repository
learning_R
has been made in your account - the github repository has been cloned locally on your PC
Now we will learn how to use git in Rstudio to backup your code to the cloud server: github. Git version control system can be used for much more. But learning how to backup your data on the cloud is enough for a starter.
1.2 Create/Open a project in Rstudio
External resource: see also Project Management with RStudio
- Open Rstudio
- Browse to the directory where you cloned the
learning_R
git repository:
Note: after to re-open the project you will need to do:
1.3 Git used to backup notes and code via Rstudio
DiagrammeR::mermaid(
"graph TB
A[R project directory] --git add--> B[Staging area: preparation]
B --git commit --> C[Local Repository : version control locally on PC]
C --push --> D[copy code to remote Repository : github]
D --pull --> A
")
Fig. A simple workflow to backup your code and notes to github cloud server
In the git tab in one of the Rstudio panel:
The status (git status
) allows you to check if files
have been modified, if they are under-version control (tracked) or
not.
- yellow
?
the file is not tracked = not currently version controlled and not ignored - red
D
deleted, the file has been deleted (or renamed outside of git) - rosa
R
renamed, the file has been renamed - blue
M
the file has been modified
Staging the files means preparing the files for saving them in the version in the version control system
- You can stage one or several files.
- It is usually best to only stage several files when there is a logical block of work that has been done.
- you select (checked box) the file(s) to stage and then you can commit.
You need to write a commit message that explains what you have done since your previous commit.
Example: “Summary statistics for AMR detection in E.coli added.”
This create a “saving” point in your version control system,
- Then you need to push: this sends the changes to the remote repository (github / cloud)
It is important to do this process often, because you create saving points, where you can go back, should you for some reason loose your code on your PC. Each time you commit, the commit message and the status of your files are registered in the version control system. A log in git system (you can look at it on github) can help you find back the files at the saving point you want.
If you work on different computer, or with different people, you might need to pull changes. Pull takes the changes from the cloud repository (github) and put them in the local repository. Note that we likely wont have time to do this during this course, but it is good to know that its possible to do so if needed.
You can learn how to do that using the git for novice lesson
1.4 Exercise to learn how
to use git in R studio: modify .gitignore
and push the
changes to github
Note:
- we wont version control raw and intermediary data, because we do not want them to finish in the public directory (among other things).
- Raw data need to be backed-up on their own.
- The raw data should NEVER be modified manually.
- It should be possible to recover the processed data (eg. outlier removed, quality ensured) using code for pre-processing the raw data. Therefore processed data should be recoverable by solely re-running the pre-processing code.
- Both the pre-processing code and the code that is further used for your analyses should be version controlled.
Modify or create .gitignore
file in your project
directory.
A .gitignore
file is a file that tells git what files
the version control system should ignore (eg. any files that contains
raw data).
**/.*xlsx
**/.*csv
data/
results/
**.Rprofile
**.Rhistory
**.Rproj
**.sqlite
You will now be able to version control your code code and notes. If
you do not want a file to be tracked, you need to add its path (relative
to the path in the git repository) to the .gitigore
file.
You can look at the ignoring things lesson to go further on how to ignore files.
2 Using a notebook in Rstudio: anatomy, text, code and rendering
A Notebook (here Rmarkdown) is a way to write code and text (notes, publications, reports) in the same document.
The text can be easily formatted using the “Markdown syntax”. See Those lesson to learn about Markdown syntax. For now, it will be sufficient to use this cheatcheat, have it open in a web-browser and use it !
A rendering (knit button) of the document allows to create different types of documents (html, pdf, word, slides).
Create a Rmarkdown document :
-
directory, in a sub-directory called notes (as you will use it to take notes during this course) choose eg,
date_learningR.Rmd
as file name. ISO date is eg. 2024-10-09. ! Do not use spaces in the file name, use underscore_
instead.
NB: code directory can be done at saving time, using right click on the mouse then new directory and write: notes
2.1 Exercise Using a R markdown document
¨ Here is what we will now do:
PS: - Note the importance of formatting in YALM header - We wrote the tutorial for the course using Markdown - you can download the Rmd document (top right of the html file under code), and look at the code of this file in Rstudio to see how it was done.
We will do a series of small exercises to get you familiar with Rstudio and how code is written in R. It is important than you stop us if you do not understand what the code is doing
See also Help in R studio
length(my_message)
nchar(my_message)
nb_char <- nchar(my_message)
nb_char
nb_char + 10
nb_char * nchar(my_message)
Bellow we show you a small introduction to: - assignment of objects in R - types of objects in R - character vectors - lists and sub-setting of list using indexes - R numbering stats at 1.
strsplit(my_message, split = "")
split_test <- strsplit(my_message, split = "")
split_test
typeof(split_test)
split_test[1]
split_test[[1]]
split_test[[1]][1]
split_test[[1]][1:3]
- transforming a string into a list of characters (types conversion)
- reassignment of objects: replacement in memory
unlist(split_test)
split_test <-unlist(split_test)
split_test
split_test[1]
typeof(split_test)
length(split_test)
- manipulations with functions that allows to transform objects
- several operations are possible one after another
- brackets priority in maths is respected (mathematics).
Important to note the difference between a list and a vector.
2.2 Tricks
Some tricks to make your life easier.
Do not worry, if this goes fast. We will come back to that during the course. During the course you will have to pay attention to the key-words in bold, and figure out what they mean. Those words are usefull when you need to find on the web how you should do things further.
What you have to retain from here is that you need to have critical sense and check your results, to be sure the code is doing what we want it to do
2.3 Do not forget
3 Installing packages
Packages are a way to extend the functionality of R.
# A comment
# Another way to get help on a function
?install.packages
help("install.packages")
# F1 on : install.packages()
- we can install here. It is a simple package that will allow reference to files in the project by their path in an easy way. It allow is great to have compatibility between linux and windows based systems.
Library is a function that allows you to load (into memory), and make the functions contained a package available to you, in the current R session.
The function here() gives the path of the project
NB: You can also install packages using Rstudio (BUT its better to keep a trace of what you have installed), so please do that using code.
You can call functions from packages that are not loaded to memory if they are installed on your system. This can be convenient if you only want to use one function, this avoids cluttering the memory of your computer.
Calling functions while specifying the package can be useful also, if you have loaded some packages that have functions with the same name. It is generally a good practice to do so (but takes time to do so …. )
3.1 Packages and different ways to do the same thing
There are many packages in R, and there are usually many ways to do the same thing, some functions share also some similarities
library(dplyr)
# both functions allow to have a look at the structure of your data
glimpse(my_message)
dplyr::glimpse(my_message)
str(my_message)
library(stringr)
# both functions allow to split a string
str_split_1(my_message, pattern = "")
#stringr::str_split(my_message, pattern = "")
unlist(strsplit(my_message, split = ""))
#base::unlist(base::strsplit(my_message, split = ""))
3.2 Functions are objects
In packages, functions functions that do complex things, are often build using several simpler functions.
Here we show that there are several ways to obtain the same results.
my_strsplit1 <- function(char_var){
temp <- strsplit(char_var, split = "")
return(unlist(temp))
}
my_strsplit1(my_message)
Different way to write : (explain)
# return is factlatif if an oject is returned at creation
my_strsplit2 <- function(char_var){
unlist(
strsplit(char_var, split = "")
)
}
my_strsplit2(my_message)
Using Pipes …
my_strsplit3 <- function(char_var){
# R pipe
strsplit(char_var, split = "") |>
unlist()
# importance ()
}
my_strsplit3(my_message)
magrittr which is part of the tidyverse multiverse (a collection of packages that work well together) offers another type of pipe, which is the one that you will see most commonly used.
my_strsplit4 <- function(char_var){
# R pipe
strsplit(char_var, split = "") %>%
unlist
# with %>% () can be omitted (though not advised)
}
my_strsplit4(my_message)
What you need to retain from here is:
there are different ways to do the same tings. R syntax is flexible and not always homogeneous. Do not learn a syntax
understand a syntax, check in the help when in doubt
there are several ways to obtain the result you want. While some are prettier or more efficient from other, the most important is that it does what you want it to do.
3.3 Do not forget
3.4 Exercise: Printing your code in a word document
unhide by clicking on the Code
button for hints:
4 Summary
This was a lot for today. You got a first overview of what can be done and why it can be worth to use this system, ultimately, this will help you being more efficient and do reproducible research.
Now we will actually use some of those things and repeat them a lot to do data analysis,.
5 When you are ready to go further
With Git (lessons) :
- Git novice lesson, learn you how to use the command line in a terminal with git, how to use version control to recover lost data or recover for changes that made your code not working, and introduces you on how to use of the github remote repository using the graphical interface.
- Introduction to Open Data Science with R particularly the lesson 8 https://carpentries-incubator.github.io/open-science-with-r/08-git/index.html
- Git and Rstudio (short)
Using notebooks:
Quarto a new and very similar system to Rmarkdown (its a good alternative to use it), however, it might still be a bit young and sometimes workaround are necessary
Back to Index