Statistics and Analytics: Wrangling and Munging Data with SQL and R

"I didn't have time to write a short letter, so here's a long one" (Mark Twain). Such is the case with data in organizations - its natural state is sprawling, complex, often messy and overweight. Statistical analysis (including much data mining) needs trim, well-organized data.

Most big data are stored in relational database management systems in a set of related tables. Statistical analysis needs that data in a single table. Learn how to use SQL (structured query language) programming language to pull data from various tables in a database and assemble it in a format amenable to statistical analysis in R.

The purpose of this course is to teach you how to extract data from a relational database using SQL, and merge it into a single file in R, so that you can perform statistical operations. Join Kathryn Vasilaky in her online course "Wrangling and Munging with SQL and R". For more details please visit at http://www.statistics.com/data-munging/.

Who Should Take This Course:

This course is meant for beginners in data science who need to learn how to extract data from their relational databases. It can also be a good kickstarter course for individuals who have fluency in SPSS, SAS, or STATA. Such data scientists might not have had to work with relational databases directly before, or they may not be familiar with the syntax and logic of a more object oriented language like R.

Course Program:

Course outline: The course is structured as follows

Session 1: Thinking Ahead, How to Organize your Data?

Discussion and visuals of data models how data are stored in Postgres(SQL), i.e. entity relationship schemes, vs. R, flat files or data frames.

Unique identifiers, and primary vs. foreign keys

Defining your statistical objectives. Writing pseudo code to manipulate your data into its necessary form.
Questions we’ll ask: Will I need all the data (panel), or should I collapse it on a particular variable (cross section)?
Restoring a database in Postgres (this is repeated in Week Two?)

Session 2: Basic SQL Procedures and Functions

Using commands: Select, Where, Like, Order By
Using functions, Count, Avg, Sum
Using group-by and sortCopying tables to csv

Session 3: Bringing SQL into R

Joins (Inner, Outer, and multiple joins) in SQL
Using the ODBC driver to run SQL queries in R.
Reading in csv files
Joins vs merges?
Formatting dates
Subsetting data and merging it back in

Session 4: Working in R

The Reshape2 function in R to reshape data long or wide in R
Plyr function in R for grouping data

Performing a few basic statistical calculations in R
Data summaries
Linear regression
T-test
Basic Graphs with Ggplot

Dr. Kathryn Vasilaky has worked at the World Bank, the Gates Foundation, the Federal Reserve, TroopSwap (a tech start-up in Washington DC), and as a soloist dancer with the San Francisco Ballet. She has a wealth of experience extracting and preparing data for analysis and business decision-making: macroeconomic data, commodity data, data resulting from experiments, social network data, e-commerce data, and other web data.

This course takes place over the internet at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. The course typically requires 15 hours per week. Course participants will be given access to a private discussion board so that they will be able to ask questions and exchange comments with instructor, Dr. Iain Pardoe. The class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

For Indian participants statistics.com accepts registration for its courses at special prices in Indian Rupees through its partner, the Center for eLearning and Training (C-eLT), Pune.

For India Registration and pricing, please visit us at www.india.statistics.com.

Email: info@c-elt.com

Call: 020 66009116

Websites:

Statistics and Analytics

Monday 4 February 2013

Wrangling and Munging Data with SQL and R

No comments: