Introduction to R

Installing R, Packages in R, Data Types in R


Let's start the course!





But before we do, let us address a few questions.

1. What is data science ?

2. Why data science?

3. Why R for data science?

Data Science

Well the name itself gives you an idea but then again, everything must have a complex definition like this: 

"Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems."


In simpler words, a field which aims to provide a meaning to vast amounts of complex data. Everyday tons and tons of data is generated from sensors, websites, social media, e-commerce and other sectors. Even though with a larger volume of data you might get better averages, but then there is a lot of noise i.e. useless data as well. To extract useful information to enable decision making is a data scientist's job. 


Now there are several fields related to data science like data mining, predictive analysis, statistical inference, machine learning (KDD) etc. All of these are subsets of the superset Big Data. 


To get an idea of how versatile the field and its applications are, go through this link Stock Market Prediction with R and see the video below.



                       


This leads to the next question WHY DATA SCIENCE?


Normally this is where websites will tell you stuff that data science does this,produces valuable insights or provides statistical inferences.You will get to know these things eventually as you proceed with the course.What we're going to give you a is a very practical look about the paradigm shift that has occurred in the industry and use this to answer as to why data science.

 According to New York Times, data science "promises to revolutionize industries from business to government,healthcare to academia"..Have a look at this, it is estimated that by 2018,4 million to 5 million jobs in United States will require data analysis skills,and a recent study from global leader McKinsey Global Institute found "a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge". So let's sum up this answer in two words DEMAND and OPPORTUNITY.

Now there are a number of softwares available today which data scientist's use for data analytics like R,Python,SAS etc but as we had said we will be using R here as the medium of instruction but there are a couple of important reasons for it.




  • It is free(unlike SAS)
  • It has a comprehensive set of packages  
  1.  Data Access
  2.  Data Cleaning
  3.  Analysis
  4.  Data Reporting

  • It has one of the best development environments      (www.rstudio.com)
  • It has an amazing ecosystem for developers 
  • Packages are easy to install and "play nicely together"
INSTALLING R studio

Now after the why and what is data science,and why R for data science we've finally come to the point where we can readily jump to the course.Instead of wasting your time by taking you through the steps of installing R,here is a quick video of the complete procedure.Watch the video carefully,we certainly hope there won't be any issues but if you face any difficulty at any point feel free to address it in the comment section.



PACKAGES IN R

We'll be talking an unconventional path here and tell you what are 'packages' in R before we talk about programming in R.This is coming completely from experience as after taking a number of analytics sessions we believed 'packages' in R need to be talked prior to entering into the programming world of R.So Analysts! take out your pen and notepads.This is the point you start scribbling.

'Packages' in R are something which you'll be using in every piece of code written in R.They are collections of R functions,data and compiled code in a well defined format.The directory where packages are stored is called 'library'.R already comes with a standard set of packages.Others are available for download and installation.The standard command to download and install a package in R is as follows.

>install.packages("Name of package") #to download the package#
>library("Name of the package") #to load the package in the present R session.# 
Here is a quick video on how to install and load packages in R.



We know you're having doubts at this point, about how exactly packages are useful in R. But do not worry you'll soon get to know more about them as we start using them in our codes. And just to let you know,by the end of this comprehensive R course you'll be surprised that you'll be well capable to create your own packages in R. Now won't that be fun. People will be downloading your contributions to a programming language to make their coding easier!

For the complete list of packages visit this link CRAN Packages by Name 

Another useful website to visit if you want to know about R packages is R Documentation.

Some interesting packages in R:


1. forecast

Time series analysis is incomplete without using this function. The package has helped data scientists analyze time series data and make reliable predictions. The package also allows us to fit time series models like ARIMA, ARMA, AR etc. to our data. 

2. plyr

The functions available in this package are used as a substitute to Base R functions like split, apply,combine etc.

3. ggplot2

The perfect package for multivariate visualizations. 


Let us explore the list based on the number of downloads!

1. Rcpp Seamless R and C++ Integration (693288 downloads, 3.2/5 by 10, users)

2. ggplot2 An Implementation of the Grammar of Graphics (598484 downloads, 4.0/5 by 82 users)

3. stringr Simple, Consistent Wrappers for Common String Operations.(543434 downloads, 4.1/5 by 18 users)

4. plyr Tools for Splitting, Applying and Combining Data(523220 downloads, 3.8/5 by 65 users)

5. digest Create Cryptographic Hash Digests of R Objects. (521344 downloads)

6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package (483065 downloads, 4.1/5 by 18 users)

7. colorspace Color Space Manipulation (476304 downloads, 4.0/5 by 2 users)

8. RColorBrewer ColorBrewer Palettes(453858 downloads, 4.0/5 by 17 users)
    manipulate Interactive Plots for RStudio. (395232 downloads)

9.  scales Scale Functions for Visualization(394389 downloads)
                                                                source: KD Nuggets


Other interesting packages include the RCurl package (Provides functions to allow one to compose general HTTP requests and provides convenient functions to fetch URLs, get & post forms, etc) and bitops package (Functions for bitwise operations on integer vectors)

That's all about packages for now.

The next topic we'll be discussing are data types in R

Now everything that we are going to use in R will be a kind of object as it is with every other programming language and these objects can store different types of data. 

The objects we'll be using are:
  • Lists
  • Vectors
  • Matrices
  • Arrays
  • Factors
  • Data Frames
Now these objects can contain different types of data, which are known as "atomic" classes of objects. These are: 

  • Character
  • Numeric
  • Logical (True/False)
  • Integer
  • Complex
The atomic classes of objects are pretty simple.For example, Character class includes single characters like "a","x" or strings like "jason","betty",Numeric class include numers as 1,2,3... and so on.Hence it's easy to understand them by the names they represent.What is more important and you'll be frequently using are the objects such as Lists,Vectors,Matrices,Arrays,Factors and Data Frames.So we'll be explaining them in detail.

  • VectorsA vector is a sequence of data elements of the same basic type. Members of vectors are called components.Following are the ways vectors can be initialized.Try this out in your R console.
    x <- 1:5 #a vector containing components 1,2,3,4,5 #
    x <- c(2,3,5) # a vector containing components 2,3,5 #
    x <- c(TRUE,FALSE,TRUE) #a vector of logical components #
    x <- c("abc","zyx") #a vector of character components #
  • Lists : A list is a generic vector containing other objects.Below is a list containing different vectors.
    n <- c(2,1,3) # a numeric vector #
    s <- c("aa","bb","cc") # a string vector #
    b <- c(TRUE,FALSE,TRUE) # a logical vector #
    x <- list(n,s,b,3) # initializing a list


  • Matrices: A matrix is a collection of data elements arranged in a rectangular format i.e. rows and columns. 
     A = matrix(
   c(2, 4, 3, 1, 5, 7), # the data elements
   nrow=2,              # number of rows
   ncol=3,              # number of columns
   byrow = TRUE)        # fill matrix by rows 

   (bycol=TRUE/byrow=FALSE #fill matrix by columns)

  An element at the mth row, nth column of A can be accessed by the expression A[m, n].

    A[2, 3]      # element at 2nd row, 3rd column 
  
  The entire nth column can be extracted as A[ ,n].

   A[ ,3]       # the 3rd column
   
 To extract more than one columns or rows at the same time
  
   A[ ,c(1,3)]  # the 1st and 3rd columns 

  Accessing elements is made more easier when we assign names 
  to dimensions of the matrices

 dimnames(A) = list( 
   c("row1", "row2"),         # row names 
   c("col1", "col2", "col3")) # column names 



Lets talk about some operations that are carried out on matrices. 

Transpose of a matrix is obtained by interchanging it's rows and columns.

For this we can use the code  t(matrix name) #transpose of matrix. 

We can also combine matrices using the commands rbind() and cbind().

However, keep in mind that the function cbind() is used when the 2 matrices have equal number of rows. Similarly rbind() is used when the 2 matrices have equal columns. 

  cbind(matrix1, matrix2)

  rbind(matrix1,matrix2)

Watch this video to know more about working with matrices in R.




  • ARRAYS :- Arrays in R are data objects that can store data in more than two dimensions.For Example if we create an array of dimension(2,3,4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.And like any other programming language arrays can store only one data type.
      An array is created using the array() function.It takes vectors as input and        uses the values in the dim parameter to create an array.Below is the code          to create an array. 
     
      vector1 <- c(5,9,3)  #create two vectors of different lengths#
      vector2 <- c(11,23,45,33,66,7)
   #take these as inputs to the function# 
   result <- array(c(vector1,vector2),dim=c(3,3,2))


  • FACTORS :- Factors are the data objects which are used to categorize the data and store it as levels.They can store both strings and integers.They are useful in the columns which have a limited number of unique values.They are highly useful in data analysis for statistical modelling.
      Factors are created using factor() function.Look through the code and try          it out in your console to get a feel of it.

      #Create a vector as input#
      
     data <- c("East","West","North","East","North","West","South")
      print(data)
      print(is.factor(data))
   
   # Apply the factor function #
   factor_data <- factor(data)
   print(factor_data)
   print(is.factor(factor_data)

   The output of the code is as follows.
     [1]"East" "West" "North" "East" "North" "West" "South
   [1] FALSE
   [1] East West North East North West South
   Levels:East West North South
   [1] TRUE

Watch the video to know more about factors.






  • DATAFRAMES:- A data frame is used for storing data tables.It is a list of vectors with equal lengths.For example,the following df is a data frame containing 3 vectors n,s,b.
        n <- c(2,3,5)
    s <- c("aa","bb","cc")
    b <- c(TRUE,FALSE,TRUE)
    df <- data.frame(n,s,b) #df is a data frame #

Here is more about data frames.




We can also check for missing values using the is.na() function.  Missing values are usually denoted by NA or NaN. Let us understand this with the help of an example.

         x <- c(1,2,3,4,NA) #create a vector.

     is.na(x) #check for missing values.

     The output of the above operation is

     FALSE,FALSE,FALSE,FALSE,TRUE


We also have an is.nan() function that is used to detect NaN terms. 





Note:

NaN (“Not a Number”) means 0/0.

NA (“Not Available”) is generally interpreted as a missing value and has various forms – NA_integer_, NA_real_, etc.


Therefore, NaN ≠ NA and there is a need for NaN and NA.


In the next blog we will talk about some more data types like Names Attribute and Summary. We will also cover reading tabular data and data formats. Feel free to address your concerns in the comments section. 

Comments

  1. Hello,
    The Article on R programming language is nice .It give detail information about it .Thanks for Sharing the information about R Programming language. It's alternative to Python language for Data Science. hire data scientists

    ReplyDelete
  2. 10 Best Titanium Stud Earrings | Titanium Art - Titsanium Art
    Titanium Art is a well-made jewelry titanium straightener & artisanship titanium nail craftsmanship studio with over 2,200 stunning designs titanium easy flux 125 amp welder to create gr5 titanium a solo titanium razor premium piece.

    ReplyDelete

Post a Comment

Popular posts from this blog

Stock Market Prediction with R

Sentiment Analysis Using R

Vive l'analyse!

Stuff That Gets You Interested !