Introduction to R
Installing R, Packages in R, Data Types in R
Let's start the course!
But before we do, let us address a few questions.
1. What is data science
?
2. Why data science?
3. Why R for data science?
Data Science
Well the name itself gives you an idea but then again, everything must have a complex definition like this:
"Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems."
In simpler words, a field which aims to provide a meaning to vast amounts of complex data. Everyday tons and tons of data is generated from sensors, websites, social media, e-commerce and other sectors. Even though with a larger volume of data you might get better averages, but then there is a lot of noise i.e. useless data as well. To extract useful information to enable decision making is a data scientist's job.
Now there are several fields related to data science like data mining, predictive analysis, statistical inference, machine learning (KDD) etc. All of these are subsets of the superset Big Data.
To get an idea of how versatile the field and its applications are, go through this link Stock Market Prediction with R and see the video below.
This leads to the next question WHY DATA SCIENCE?
Normally this is where websites will tell you stuff that data science does this,produces valuable insights or provides statistical inferences.You will get to know these things eventually as you proceed with the course.What we're going to give you a is a very practical look about the paradigm shift that has occurred in the industry and use this to answer as to why data science.
According to New York Times, data science "promises to revolutionize industries from business to government,healthcare to academia"..Have a look at this, it is estimated that by 2018,4 million to 5 million jobs in United States will require data analysis skills,and a recent study from global leader McKinsey Global Institute found "a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge". So let's sum up this answer in two words DEMAND and OPPORTUNITY.
Now there are a number of softwares available today which data scientist's use for data analytics like R,Python,SAS etc but as we had said we will be using R here as the medium of instruction but there are a couple of important reasons for it.
- It is free(unlike SAS)
- It has a comprehensive set of packages
- Data Access
- Data Cleaning
- Analysis
- Data Reporting
- It has one of the best development environments (www.rstudio.com)
- It has an amazing ecosystem for developers
- Packages are easy to install and "play nicely together"
PACKAGES IN R
We'll be talking an unconventional path here and tell you what are 'packages' in R before we talk about programming in R.This is coming completely from experience as after taking a number of analytics sessions we believed 'packages' in R need to be talked prior to entering into the programming world of R.So Analysts! take out your pen and notepads.This is the point you start scribbling.
'Packages' in R are something which you'll be using in every piece of code written in R.They are collections of R functions,data and compiled code in a well defined format.The directory where packages are stored is called 'library'.R already comes with a standard set of packages.Others are available for download and installation.The standard command to download and install a package in R is as follows.
>install.packages("Name of package") #to download the package#
>library("Name of the package") #to load the package in the present R session.#
For the complete list of packages visit this link CRAN Packages by Name
Another useful website to visit if you want to know about R packages is R Documentation.
Some interesting packages in R:
1. forecast
Time series analysis is incomplete without using this function. The package has helped data scientists analyze time series data and make reliable predictions. The package also allows us to fit time series models like ARIMA, ARMA, AR etc. to our data.
2. plyr
The functions available in this package are used as a substitute to Base R functions like split, apply,combine etc.
3. ggplot2
The perfect package for multivariate visualizations.
data <- c("East","West","North","East","North","West","South")
# Apply the factor function #
Here is more about data frames.
We can also check for missing values using the is.na() function. Missing values are usually denoted by NA or NaN. Let us understand this with the help of an example.
x <- c(1,2,3,4,NA) #create a vector.
is.na(x) #check for missing values.
The output of the above operation is
FALSE,FALSE,FALSE,FALSE,TRUE
We also have an is.nan() function that is used to detect NaN terms.
Note:
In the next blog we will talk about some more data types like Names Attribute and Summary. We will also cover reading tabular data and data formats. Feel free to address your concerns in the comments section.
We know you're having doubts at this point, about how exactly packages are useful in R. But do not worry you'll soon get to know more about them as we start using them in our codes. And just to let you know,by the end of this comprehensive R course you'll be surprised that you'll be well capable to create your own packages in R. Now won't that be fun. People will be downloading your contributions to a programming language to make their coding easier!
For the complete list of packages visit this link CRAN Packages by Name
Another useful website to visit if you want to know about R packages is R Documentation.
Some interesting packages in R:
1. forecast
Time series analysis is incomplete without using this function. The package has helped data scientists analyze time series data and make reliable predictions. The package also allows us to fit time series models like ARIMA, ARMA, AR etc. to our data.
2. plyr
The functions available in this package are used as a substitute to Base R functions like split, apply,combine etc.
3. ggplot2
The perfect package for multivariate visualizations.
Let us explore the list based on the number of downloads!
1. Rcpp Seamless R and C++ Integration (693288 downloads, 3.2/5
by 10, users)
2. ggplot2 An Implementation of the Grammar of Graphics (598484
downloads, 4.0/5 by 82 users)
3. stringr Simple, Consistent Wrappers for Common String
Operations.(543434 downloads, 4.1/5 by 18 users)
4. plyr Tools for Splitting, Applying and Combining Data(523220
downloads, 3.8/5 by 65 users)
5. digest Create Cryptographic Hash Digests of R Objects.
(521344 downloads)
6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape
Package (483065 downloads, 4.1/5 by 18 users)
7. colorspace Color Space Manipulation (476304 downloads, 4.0/5
by 2 users)
8. RColorBrewer ColorBrewer Palettes(453858 downloads, 4.0/5 by
17 users)
manipulate Interactive Plots for RStudio. (395232 downloads)
9. scales Scale Functions for Visualization(394389 downloads)
source: KD Nuggets
Other interesting packages include the RCurl package (Provides functions to allow one to compose general HTTP requests and provides convenient functions to fetch URLs, get & post forms, etc) and bitops package (Functions for bitwise operations on integer vectors)
That's all about packages for now.
The next topic we'll be discussing are data types in R
Now everything that we are going to use in R will be a kind of object as it is with every other programming language and these objects can store different types of data.
The objects we'll be using are:
- Lists
- Vectors
- Matrices
- Arrays
- Factors
- Data Frames
-
Character
- Numeric
- Logical (True/False)
- Integer
- Complex
- Vectors : A vector is a sequence of data elements of the same basic type. Members of vectors are called components.Following are the ways vectors can be initialized.Try this out in your R console.
x <- c(2,3,5) # a vector containing components 2,3,5 #
x <- c(TRUE,FALSE,TRUE) #a vector of logical components #
x <- c("abc","zyx") #a vector of character components #
- Lists : A list is a generic vector containing other objects.Below is a list containing different vectors.
s <- c("aa","bb","cc") # a string vector #
b <- c(TRUE,FALSE,TRUE) # a logical vector #
x <- list(n,s,b,3) # initializing a list
- Matrices: A matrix is a collection of data elements arranged in a rectangular format i.e. rows and columns.
A = matrix(
c(2, 4, 3, 1, 5, 7),
# the data elements
nrow=2, # number of rows
ncol=3, # number of columns
byrow = TRUE) # fill matrix by rows
(bycol=TRUE/byrow=FALSE #fill matrix by columns)
An element at the mth row, nth column of A can be accessed
by the expression A[m, n].
A[2, 3] # element at 2nd row, 3rd column
The entire nth column A can be extracted as A[ ,n].
A[ ,3] # the 3rd column
To extract more than one columns or rows at the same time
A[ ,c(1,3)] # the 1st and 3rd columns
Accessing elements is made more easier when we assign names
to dimensions of the matrices
to dimensions of the matrices
dimnames(A) = list(
c("row1", "row2"), # row names
c("col1", "col2", "col3")) # column names
c("row1", "row2"), # row names
c("col1", "col2", "col3")) # column names
Lets talk about some operations that are carried out on matrices.
Transpose of a matrix is obtained by interchanging it's rows and columns.
For this we can use the code t(matrix name) #transpose of matrix.
We can also combine matrices using the commands rbind() and cbind().
However, keep in mind that the function cbind() is used when the 2 matrices have equal number of rows. Similarly rbind() is used when the 2 matrices have equal columns.
cbind(matrix1, matrix2)
rbind(matrix1,matrix2)
Watch this video to know more about working with matrices in R.
- ARRAYS :- Arrays in R are data objects that can store data in more than two dimensions.For Example if we create an array of dimension(2,3,4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.And like any other programming language arrays can store only one data type.
vector1 <- c(5,9,3) #create two vectors of different lengths#
vector2 <- c(11,23,45,33,66,7)
#take these as inputs to the function#
result <- array(c(vector1,vector2),dim=c(3,3,2))
- FACTORS :- Factors are the data objects which are used to categorize the data and store it as levels.They can store both strings and integers.They are useful in the columns which have a limited number of unique values.They are highly useful in data analysis for statistical modelling.
#Create a vector as input#
data <- c("East","West","North","East","North","West","South")
print(data)
print(is.factor(data))
# Apply the factor function #
factor_data <- factor(data)
print(factor_data)
print(is.factor(factor_data)
The output of the code is as follows.
[1]"East" "West" "North" "East" "North" "West" "South
[1] FALSE
[1] East West North East North West South
Levels:East West North South
[1] TRUE
Watch the video to know more about factors.
Watch the video to know more about factors.
- DATAFRAMES:- A data frame is used for storing data tables.It is a list of vectors with equal lengths.For example,the following df is a data frame containing 3 vectors n,s,b.
s <- c("aa","bb","cc")
b <- c(TRUE,FALSE,TRUE)
df <- data.frame(n,s,b) #df is a data frame #
Here is more about data frames.
x <- c(1,2,3,4,NA) #create a vector.
is.na(x) #check for missing values.
The output of the above operation is
FALSE,FALSE,FALSE,FALSE,TRUE
We also have an is.nan() function that is used to detect NaN terms.
Note:
NaN (“Not a Number”) means 0/0.
NA (“Not Available”) is generally interpreted as a missing
value and has various forms – NA_integer_, NA_real_, etc.
Therefore, NaN ≠ NA and there is a need for NaN and NA.
In the next blog we will talk about some more data types like Names Attribute and Summary. We will also cover reading tabular data and data formats. Feel free to address your concerns in the comments section.
Hello,
ReplyDeleteThe Article on R programming language is nice .It give detail information about it .Thanks for Sharing the information about R Programming language. It's alternative to Python language for Data Science. hire data scientists
10 Best Titanium Stud Earrings | Titanium Art - Titsanium Art
ReplyDeleteTitanium Art is a well-made jewelry titanium straightener & artisanship titanium nail craftsmanship studio with over 2,200 stunning designs titanium easy flux 125 amp welder to create gr5 titanium a solo titanium razor premium piece.