We begin by looking at the Data Analysis workflow. The version above is a concept created by Hadley Wickham and Adapted from Bradley Boehmke.
The diagram shows the natural flow of how we work with data and perform research. We will begin to explore what this means as we continue.
The first steps we take in any Data Analysis is Data Wrangling. Before we can do any kind of analysis we need to be able to collect our data. Sometimes this comes in from one source but many times this comes from multiple data sources.
Once we have this data we find that very rarely is it ever in a useful form. In fact Dasau and Johnson suggest that this data preparation of cleaning may take up to 80% of the time.
This is where R has great power and by the end of this course you will be able to work with multiple data sources and wrangle data like a pro.
When it comes to importing your data R is very powerful. R can grab data from many courses including
Tidying Data is the process in making data useful. In this concept we
have ecah column of data represent a variable and each row of data
represents a single observation. This format is quite useful for data
analysis. In this course we will rely heavily on the tidyr
package.
Once we have data into R and begin to tidy the data we usually need to transform multiple aspects of the data. R has many tools that allow a user to manipulate and transform data.
R is one of the most capable languages to explore and analyze data. With over 10,000 packages it can be hard to find models or plots that do not already have multiple functions in R.
There are multiple ways to vizualize data in R. The base graphics are easy to use and outperform Stata, SAS and SPSS. In this course we will focus on using the ggplot2 package. This package is actually a language for grahpics and once a user becomes proficient you can create grahs like the one shown below which is created by Harvard Institute for Quantitative Social Science. The original plot came from the economist:
Then below can be created by ggplot without any outside graphics software.
Modeling Data ————-
Once data has been vizualized it is important to model the data. R can handle anything from a simple t-test to the working with data that is overa terrabyte in size.
In every field it is key to be able to communicate what we learn and publish this work so that it can be beneficial to others. It is also important that statisticians can collaborate with others as well. Rstudio provides many tools to do these things:
RStudio
connects easily to:
shiny
provides an interactive data visualization and java
script environment.plotly
allows for interactive graphics for webpages.RStudio
can make webpages, books, slides, and many other documents
that can help relay data. (In fact this course was entirely built in
RStudio using Github).