Teaching R to students with little to no experience in programming or data analysis is a challenging task. Our talk at useR!2017 showed how different ingredients of our course Exploratory and Descriptive Data Analysis at UHasselt are used to facilitate the learning of R.

Firstly, the educational **environment** at UHasselt, based on *guided **self-study *and the use of small group working sessions, allows each student to have an individual pace of learning and gives them frequent feedback.

Secondly, it is important that the **content** of the course is accessible and allows students to obtain quick results. For that reason, we decided to keep the amount of base-R to a minimum and work with the different packages from the tidyverse. The consistent syntax and piping mechanism significantly lowers the learning curve. Furthermore, starting with visualizasing tidy and clean data gives *quick gains *to students, as they immediately observe the fruits of their labor and things become less abstract. Afterwards, content can gradually be made more complex, also including different preprocessing tasks.

Thirdly, students need to be **motivated**. This is especially important in programs where students, as in our situation, do not have an intrinsic motivation to learn data analysis, because it is not a major component of the curriculum. To increase the motivation, we not only try to make to course material interesting and attractive for students, but also ask them to submit 6 assignments at regular intervals. In the end, a lot of practice is needed to become proficient in analyzing data.

As a consequence of using assignments, the data gathered can be used to get more detailed information on the progress of students and to identify students who are likely to fail on the exam, if no action is undertaken.

The figure below shows different clusters based on the score-pattern for the assignments on the left. On the right the success rate for the exam is shown for each cluster. It can be seen that some clusters have a higher success rate than the overall rate (52%), while others are much lower. However, differences are relatively small and assigning students to one of the clusters can only be done if information on all the assignments is gathered, which means that any intervention will come too late.

In order to have a better idea about *good *and *bad *students, predictor models are learned at different intervals: i.e. each time information about an assignment is obtained, the predictions are modified. Both the rpart predictor (package rpart) and the svm predictor (package e1071) were applied. The graph below shows how the accuracy of the predictors increase with additional information. It can be seen that the accuracy of the svm-predictor is already higher than the baseline accuracy (majority classifier) after assignment 3.

However, what is even more important in this case is the failing students who we missed out on. Therefore, we should look at the false positive rate (FPR) (positive = pass).

It can be seen that the FPR of the svm is already at 40% after assignment 3 and at 30% at assignment 6. The combi-predictor shown in this figure shows the FPR for a conservative combination of both predictors. I.e. if at least one of them predicts a fail, the conservative *combi *will predict fail. If course, a low FPR comes at the cost of a high FNR, i.e. a lot of students without problems are labeled as *fail. *However, this is certainly a more *ethical *classification.

Note that the goal is not to explicitly label the students with *pass *or *fail. *However, the goal is to identify struggling students in order to be able to help them accordingly.

Overall, a good environment, which facilitates frequent feedback, well constructed course materials and assignments to increase students’ motivation and effort seem to be excellent ingredients to make learning R less challenging for students without much prior knowledge. Furthermore, frequent gathering of data on their progress, and the use of Learning Analytics can help to provide individual guiding to students in order to maximize the success-rate.

Interesting data.

Would it be possible to share the number of students per cluster?

Are cluster 4 drop-outs? (In other words are all the 0 scores really 0 or non-submissions?)

Did you consider non-submissions in your models?

Hi Lod

The number of students in each cluster is: 14,11,16,9,12,14, respectively. Non-submissions are indeed the zero scores (no student actually every got zero on a submitted assignments). As such, cluster 4 are indeed drop-outs, making their success rate of zero more or less expected.

Hello, I like your work…

I will be teaching a Business Stat course to 2nd year undergraduates in the fall. In the past I have incorporated some R material but it was peripheral. Now I would like to redesign the course so that they learn the course material while engaging in R exercises that are doable and engaging. I am interested in tidyquant to they can pull data from the web but would rather avoid the tidyverse thing as I think its preferable to them to learn R. That’s my thinking only I have no material to get me there – just the same dusty power points I have combed-over through the years.

Any suggestions for materials? Am very much interested in exercise based learning as well as how you did your Learning Analytics.

Hi Margaret

Redesigning your course to be centered around R definitely require a large investment of time, but the rewards are worth it. As most of our exercises are really focussed on tidyverse packages, there doesn’t seem to be a point in charing them. However, for learning base R, maybe you can look at http://www.datacamp.com. They have a large repository of R-courses. As an instructor, you can request a special group login. This will give your students access to all courses and you will be able to track their progress. Also, you can write your own exercises to use in your course. Maybe this is a good way to start.

Hi Gert, really interesting work! I am curious what clustering method you used to recognize the patterns of student scores?

Thanks!

Qixiang

Hi Qixiang

We just used regular k-means, using the scores on all individual exercises as input. Of course, more advanced clustering techniques/features could be used, which also take into account the temporal dimension.

Best,

Gert

Thanks Gert! That’s great to know!

Best,

Qixiang