Data Analysis in Practice with R

General

Program Description

  • Teaching period: July 6 to July 17, 2020
  • Teaching hours: 45
  • Academic coordinator: Sergio Martínez Puertas/Manuel Sánchez Pérez
  • Knowledge area: Statistics and Operative Research / Marketing and Market Research
129478_photo-of-woman-using-laptop-3194518.jpg

Introduction

Multivariate data analysis techniques are essential in academic and research activity. R is one of the main open-source programming languages in statistics and data science. Its power, versatility and continuous update attract more users every year. The number of R users grows by about 40%, and it is widely used in academia and science. This course combines the explanation of main statistical tools with the use of R as software to develop all statistical procedures.

This course is for everyone, from college students interested in using R for a project (from business, economics, statistics, life sciences, communications), or just beginners who want to improve their data analysis skills.

Objectives

  • Provide an overview of the main multivariate data analysis technique
  • Learn what you need to know to get started with R
  • Perform exploratory analyses on a data set, visualizations, and inferences
  • Perform descriptive statistical analysis, confidence intervals, hypothesis testing and ANOVA test with several databases
  • Apply R to estimate dependent and interdependence multivariate data techniques
  • Implementation of panel data models with R
  • Learn to validate and compare regression models using the base R package, and some other packages, if needed
  • Perform the estimation of a system of regression equations with R and estimation of a linear model with a limited dependent variable
  • Understand techniques based on computer science as neural networks and Bayesian classifiers, and implementing them by using R packages

Content

Modules

Module A: Introduction to R

  • We will introduce the basics to start working with R. This module includes R objects, arithmetic in R, simple data entry and description, data frames and saving and loading R objects.
  • Students will perform several examples to understand the various fundamentals of R programming.

Module B: Descriptive Statistics, Confidence intervals, Hypothesis testing, and ANOVA

  • We will review the main descriptive statistics, confidence intervals, and usual hypothesis tests and introduce ANOVA for comparison of means.
  • Students will learn to perform descriptive statistical analysis with R by obtaining frequency tables, descriptive statistics, and plots.
  • It also includes obtaining confidence intervals and hypothesis testing.
  • Students will perform analyses with several databases.

Module C: Cluster and Factor Analysis

  • We introduce the classification techniques and cluster analysis. Cluster analysis techniques. Students learn how to partition a given data into a set of groups according to certain criteria.
  • Exploratory factor analysis allows identifying the latent trait structure among a set of variables, obtaining a narrower number of variables that account for the initial data.

Module D: Linear Regression

  • We will explain the concept regression in a simple and intuitive way focusing on the different applications. We will show the students some applications of this technique in different areas, as for example marketing, business, environmental or medicine. Basic concepts of multivariate regression will be explained, as well as inference in the model, variable selection and comparisons between models.

Module E: Regression analysis: Advanced issues

  • Extensions linear regression models
  • Simultaneous regression equations
  • Regression model with the limited dependent variable
  • Applications based on several databases

Module F: Structural Equation Modelling

  • Introduction to structural equation models based on covariance structure modeling: applications, notation, key assumptions, and modeling process
  • Data preparation
  • Measurement model: Confirmatory factor analysis, assessment and output interpretation. Practical examples
  • Structural model: Procedure, assessment and output interpretation. Practical examples

Module G: Discrete Choice Models

  • Main features of discrete choice models (also known as qualitative response models
  • Specification and use of models for the probabilities of events: probit and logit models
  • Estimation of different discrete choice models with R
  • Interpretation of discrete choice models: marginal effects
  • Random utility models
  • Applications of discrete choice models. Practical examples

Module H: Panel Data Analysis

  • We explain the concept of panel data and the conditions to estimate a panel data model
  • Different models for panel data are examined. In particular, static and dynamic models
  • Endogeneity treatment
  • Procedure to estimate panel data models with R. Recommendations in panel data analysis. Practical examples

Module I: Introduction to Neural Networks

  • Neural networks are a set of algorithms designed to recognize patterns and can be used for classification or predictive analysis (regression). We will study the basic elements of a network, different types of activation functions and learning methods
  • Using R packages, such us ‘neuralnet’, students will learn how to train, plot a neural network and to predict values using the network
  • Students will fit some neural networks, compare them and use the best option for prediction

Module J: Bayesian classifiers

  • We will introduce the concept of classification as the task of predicting the value of a target variable given some observed features. An example is the classification of a bank customer as a defaulter or non-defaulter attending to the value of some observable client's features. We will show how the problem can be satisfactorily solved in an intuitive and interpretable way using the so-called Bayesian classifiers.
  • Students will learn how to construct, use, validate and compare Bayesian classifiers using some relevant R packages, mainly 'bnlearn' and 'naiveByes', and will be informed about public repositories containing databases related to a wide variety of domains that can be solved using classifiers.
  • Students will have to construct, validate and compare two classifiers over various datasets.

Closing session

Methodology

Students will learn how to construct, use, validate and compare regression models using the base R package for different proposed problems. If needed, some other related software packages may be used. The approach is applied a practical. Each module consists of a brief theoretical background about the technique, training in procedures with R, and exercises. All classes are taught on the computer in a university computer-room.

Professional Visits and Complementary Academic Activities

It is scheduled a professional visit to the Science and Technology Park of Almeria (http://pitalmeria.es/en/) in which we will be able to meet companies located in a technological park, the developments on data analysis of marketing research companies and other sectors, as well as the development of research in the agri-food field.

Assessment

The evaluation procedure for passing the course is based on class attendance (30%) and the submission of a practical exercise in each module (70%).

Last updated Jan 2020

About the School

The University of Almería, Spain, organizes summer courses each July since 2013. They are designed by the most prestigious experts of the leading-edge fields of our University and are taught by Doctor ... Read More

The University of Almería, Spain, organizes summer courses each July since 2013. They are designed by the most prestigious experts of the leading-edge fields of our University and are taught by Doctors and Full Professors of proven expertise and experience in their respective areas of knowledge. All courses have an eminently practical focus and include visits to industries and companies of the field. Read less