Tag: data science

R, RStudio & Getting Started with swirl

I’d seen “R” in job descriptions (usually as “experience with Python or R”), but I didn’t really know what it was.

In the first course I took through Coursera, they had us install R and RStudio.

In short, R is a statistical programming language.

The “about R” page talks about it being a “dialect of S”, an older programming language.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

Also from the “About R” page:

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes: an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

It’s free and open source and expandable with “packages”. It seems to be very flexible in what it can do.

The first lectures in the R Programming course talk more about the history of the language.

As I mentioned in my first post, I didn’t feel like the lectures really explained what was needed for the course projects. The lectures and associated slides are good for providing some background and an understanding of why you’d want to do certain things, but are not really a how-to, at least not a step-by-step.

Fortunately, the course offered extra credit for doing assignments in the swirl package.

I wouldn’t have had a clue where to start if it wasn’t for this package.

It’s essentially a self-paced tutorial on the language, starting from the very basic (assigning variables) and getting increasingly more complex.

This was more valuable than the lectures, at least to me.

I had a few hiccups where I didn’t understand what was necessary to create a couple of the scripts in the early lessons, but was easily able to find assistance online – there really is a blog for everything!

Right now, I’m trying to decide if buying any books about R would be helpful or if I should be able to mostly rely on material available online.

If you want to get started, you can download R here. You’ll see references to CRAN, which stands for: Comprehensive R Archive Network, which is:

a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R.

R by itself looks like:

r_alone

RStudio is a more user-friendly interface for R.

rstudio2

I started doing the first couple of things in the plain R, but moved to RStudio pretty quickly.

To use swirl to start learning R, you have to install the swirl package. There are step-by-step instructions here. (I’m using a Windows 7 machine, so the path name is in Windows).


> install.packages(“swirl”)
Installing package into ‘C:/Users/me/Documents/R/win-library/3.1’
(as ‘lib’ is unspecified)
trying URL ‘http://cran.rstudio.com/bin/windows/contrib/3.1/swirl_2.2.21.zip’
Content type ‘application/zip’ length 132575 bytes (129 KB)
opened URL
downloaded 129 KB

package ‘swirl’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\me\AppData\Local\Temp\Rtmp08B97X\downloaded_packages


After it’s installed, you have to activate it by calling it from the library.


> library(“swirl”)

| Hi! Type swirl() when you are ready to begin.


And then it’s just a matter of following the directions.


swirl()

| Welcome to swirl!

| Please sign in. If you’ve been here before, use the same name as you did then. If you are new,
| call yourself something unique.

What shall I call you?


Advertisement

Learning: Data Science Specialization through Coursera

As I’m beginning to take my career into a new direction, I realize that I have a lot to learn.

I know math and basic statistics, but I need to know how to clean data,  how to present it so businesses can use it to make decisions,  and how to work with statistical programming languages.

I’m still trying to figure out if analysis or database administration is the better path for me, but to do that, I need more understanding of what is involved in each path.

In the last year, some of what I’ve really enjoyed doing was figuring out how to get information out of a database. That’s why I started studying SQL.

In my current position, I need to present information from varying sources (Google Analytics, Facebook Insights, other social media advertising and posting info, internal databases of customers, etc) in ways the company can use to make important decisions.

I started researching obtaining a Master’s degree or university extension certificate in Data Science, but the cost is beyond my capability for now.

When I heard that veterans can get a verified certificate through Coursera, I researched what was available.

The Johns Hopkins Data Science Specialization sequence seemed to be similar to the university extension certifications I was researching.

Each course in the sequence is 4 weeks long. Each course has video lectures, most with either PDF and/or HTML slides, a quiz each week, and programming projects.

Coursera rprog 2015

So far, I’ve taken two courses in this sequence.

The Data Science Toolkit is an overview of the sequence, with an emphasis on getting the basic programs and accounts setup. It has you install R (an open source statistical programming language), RStudio (a more user-friendly interface for R) and Git (a version control program) as well as set up an account at GitHub.

GitHub is pretty interesting. It’s a place where you can share programmers share and crowdsource their software projects and documentation. I was unaware of this tool before the class. I’m looking forward to using it more.

R Programming is a brief, very brief, introduction to some of R’s capabilities.

Included in this course is a study “package” called swirl. This was probably the most useful part of the course – and was counted as extra credit.

The lectures seemed to prepare a person for the quizzes, but were really inadequate for the programming assignments, especially if you’re a novice programmer. There were many discussions on the forums for the course about this.

I found that I did ok with swirl and a little outside the course research, but I downloaded all the course material and want to walk through the programming assignments again on my own time.

Getting and Cleaning Data starts this week.

I’ll be adding the verified certificates to my LinkedIn profile as I receive them.