Most big data are stored in relational database management systems in a set of related tables. Statistical analysis needs that data in a single table. Learn how to use SQL (structured query language) programming language to pull data from various tables in a database and assemble it in a format amenable to statistical analysis in R.
The purpose of this course is to teach you how to extract data from a relational database using SQL, and merge it into a single file in R, so that you can perform statistical operations. Join Kathryn Vasilaky in her online course "Wrangling and Munging with SQL and R". For more details please visit at http://www.statistics.com/data-munging/.
- Discussion and visuals of data models how data are stored in Postgres(SQL), i.e. entity relationship schemes, vs. R, flat files or data frames.
- Unique identifiers, and primary vs. foreign keys
- Defining your statistical objectives. Writing pseudo code to manipulate your data into its necessary form.
- Questions we’ll ask: Will I need all the data (panel), or should I collapse it on a particular variable (cross section)?
- Restoring a database in Postgres (this is repeated in Week Two?)
Session 2: Basic SQL Procedures and Functions
- Using commands: Select, Where, Like, Order By
- Using functions, Count, Avg, Sum
- Using group-by and sortCopying tables to csv
Session 3: Bringing SQL into R
- Joins (Inner, Outer, and multiple joins) in SQL
- Using the ODBC driver to run SQL queries in R.
- Reading in csv files
- Joins vs merges?
- Formatting dates
- Subsetting data and merging it back in
Session 4: Working in R
Plyr function in R for grouping data
- Performing a few basic statistical calculations in R
- Data summaries
- Linear regression
- Basic Graphs with Ggplot
Dr. Kathryn Vasilaky has worked at the World Bank, the Gates Foundation, the Federal Reserve, TroopSwap (a tech start-up in Washington DC), and as a soloist dancer with the San Francisco Ballet. She has a wealth of experience extracting and preparing data for analysis and business decision-making: macroeconomic data, commodity data, data resulting from experiments, social network data, e-commerce data, and other web data.
This course takes place over the internet at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. The course typically requires 15 hours per week. Course participants will be given access to a private discussion board so that they will be able to ask questions and exchange comments with instructor, Dr. Iain Pardoe. The class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.