RSS Feed from revolutionanalytics.com
- Because it's Friday: Infrastructure CollapsesRead more »
On November 7 1940, the Tacoma Narrows Bridge, opened just four months prior, suffered catastrophic collapse in a windstorrm. (The music in the video is annoying, so here's a version with an alternate soundtrack.)
The story behind the collapse is interesting: while it looks like a build-up of resonance is the culprit, it was actually the fluttering of the bridge desk that brought it down.
That's all from us for this week — we'll be back on Monday.
- Free guide to text mining with RRead more »
Jilia Silge and David Robinson are both dab hands at using R to analyze text, from tracking the happiness (or otherwise) of Jane Austen characters, to identifying whether Trump's tweets came from him or a staffer. If you too would like to be able to make statistical sense of masses of (possibly messy) text data, check out their book Tidy Tidy Text Mining with R, available free online and soon to be published by O'Reilly.
The book builds on the tidytext package (to which Gabriela De Queiroz also contributed) and describes how to handle and analyze text data. The "tidy text" of the title refers to a standardized way of handling text data, as a simple table with one term per row (where a "term" may be a word, collection of words, or sentence, depending on the application). Julia gave several examples of tidy text in her recent talk at the RStudio conference:
Once you have text data in this "tidy" format, you can apply a vast range of statistical tools to it, by assigning data values to the terms. For example, can use sentiment analysis tools to quantify terms by their emotional content, and analyze that. You can compare rates of term usage, such as between chapters or to compare authors, or simply create a word cloud of terms used. You coyld use topic modeling techniques, to classify a collection of documents into like kinds.
There are a wealth of data sources you can use to apply these techniques: documents, emails, text messages ... anything with human-readable text. The book includes examples of analyzing works of literature (check out the janesustenr and guternbergr packages), downloading Tweets and Usenet posts, and even shows how to use metadata (in this case, from NASA) as the subject of a text analysis. But it's just as likely you have data of your own to try tidy text mining with, so check out Tidy Text Mining with R and to get started.
- Microsoft R Server in the NewsRead more »
Since the release of Microsoft R Server 9 last month, there's been quite a bit of news in the tech press about the capabilities it provides for using R in production environments.
Infoworld's article, Microsoft’s R tools bring data science to the masses, takes a look back at Microsoft's vision for R since acquiring Revolution Analytics two years ago, and notes that now "R is everywhere in Microsoft’s ecosystem". The article gives some background on open source R, and describes the benefits of using it within Microsoft R Open, Microsoft R Server and SQL Server 2016 R Services.
ZDNet's article, Microsoft's R Server 9: more predictive analytics, in more places, focuses on some of the major new features including the MicrosoftML package, the new Swagger API for R function deployment, and support for Spark 2.0. It also notes that the integration with SQL Server means that "predictive analytics capabilities are now available ... to an entire generation of application developers".
Computerworld's article, Microsoft pushes R, SQL Server integration, focused on the operationalization capabilities for integrating R into production workflows, such as the new
publishServicesfunction. It also mentioned the various problem-specific solutions on GitHub, including the new Marketing Campaign Optimization template.
With SQL Server integration as a key component of the platform, you may also be interested in this blog post from the development team: SQL Server R Services – Why we built it.
- Diversity in the R CommunityRead more »
In the follow-up to the useR! conference in Stanford last year, the Women in R Task force took the opportunity to survey the 900-or-so participants about their backgrounds, experiences and interests. With 455 responses, the recently-published results provide an interesting snapshot about the R community (or at least that subset able to travel to the US and who were able to register before the conference sold out). Among the findings (there are summaries; check the report for the detailed breakdowns):
- 33% of attendees identified as women
- 26% of attendees identified as other than White or Caucasian
- 5% of attendees identified as LGBTQ
The report also includes some interesting demographic analysis of the attendees, including the map of home country distribution shown below. The report also offers recommendations for future conferences, one of which has already been implemented: the useR!2017 conference in Brussels will offer child-care for the first time.
Relatedly, the Women in R Task Force has since expanded its remit to promoting other under-represented groups in the R community as well. To reflect the new focus, the task force is now called Forwards, and invites members of the R community to participate. If you have an interest in supporting diversity in race, gender, sexuality, class or disability, follow that link to get in touch.
Forwards: Mapping useRs
- Git Gud with Git and RRead more »
If you're doing any kind of in-depth programming in the R language (say, creating a report in Rmarkdown, or developing a package) you might want to consider using a version-control system. And if you collaborate with another person (or a team) on the work, it makes things infinitely easier when it comes to coordinating changes. Amongst other benefits, a version-control system:
- Saves you from the worry of making irrevocable changes. Instead of keeping multiple versions of files around (are filenames like Report.Rmd; Report2.Rmd; Report-final.Rmd; Report-final-final.Rmd familiar?) you just keep the latest version of the file, knowing that the older versions are accessible should you need them.
- Keeps a remote backup of your files. If you accidentally delete a critical file, you can retrieve it. If your hard drive crashes, it's easy to restore the project.
- Makes it easy to work with others. Multiple people can work on the same file at the same time, and it's (relatively) easy to keep changes in sync.
- Relatedly, it makes it easy to get a collaborator. Even if your project is currently a solo effort, you may want to get help in the future, and a version-control system makes it easy to add project members. If it's an open-source project, you might even get contributions from people you don't know!
There are many version control systems out there, but a popular one is Git. You've possibly interacted with projects (especially R packages) managed under Git on Github, the online version of Git. And while you can get a fair bit done just with your browser and GitHub, the real power comes by installing Git on your desktop. Using git's command-line interface is a bear (here's a fake, but representative example of the documentation), but fortunately RStudio and RTVS provide interfaces that make things much easier.
If you want to get started with Git and RStudio, Jenny Bryan has provided an excellent guide to setting up your system and using version control: Happy Git and Gitgub for the R User. The guide is quite long and detailed, but fear not: the pace is brisk, and provides everything you need to get going. During a two-hour workshop that Jenny presented at the RStudio conference, I was able to install Git for Windows, configure it with my GitHub credentials, connect it to RStudio, commit changes to an existing R package, and create and share my own repository. It's easier than you think! Just start with the link below, and work your way through the sections.
Jenny Bryan: Happy Git and Gitgub for the R User
- The fivethirtyeight R packageRead more »
Andrew Flowers, data journalist and contributor to FiveThirtyEight.com, announced at last weeks' RStudio conference the availability of a new R package containing data and analyses from some of their data journalism features: the fivethirtyeight package. (Andrew's talk isn't yet online, but you can see him discuss several of these stories in his UseR!2016 presentation.) While not an official product of the FiveThirtyEight editorial team, it was developed by Albert Y. Kim, Chester Ismay and Jennifer Chunn under their guidance. Their motivation for producing the package was to provide a resource for teaching data science:
We are involved in statistics and data science education, in particular at the introductory undergraduate level. As such, we are always looking for data sets that balance being
- Rich enough to answer meaningful questions with, real enough to ensure that there is context, and realistic enough to convey to students that data as it exists “in the wild” often needs processing.
- Easily and quickly accessible to novices, so that we minimize the prerequisites to research.
The package includes data sets from dozens of data journalism stories, including stories about police killings in the USA, plane crashes, and even references to presidential candidates in hip-hop lyrics. There is also a complete worked analysis of performace of movies satisfying the Bechdel Test, presented as an Rmarkdown vignette.
- Because it's Friday: Code BurnRead more »
- A Call For Web Developers To Deprecate Their CSS
- Responsive Web Considered Harmful
- and my personal favourite, The Hassle of Haskell
Like any good satire it can be hard to spot, but if you had any doubt check out the byline at the end of each post. Don't miss the comments, either.
The story behind these posts is the subject of a interesting talk delivered by Jenn Schiffer late last year. It's well worth watching, not least as an insight into the experience of women in the tech industry. And also because it's very, very funny.
That's all from the blog for this week. We'll be back on Monday. In the meantime, have a great weekend!
- Microsoft R Server tips from the Tiger TeamRead more »
The Microsoft R Server Tiger Team assists customers around the world to implement large-scale analyytic solutions. Along the way, they discover useful tips and best practices, and share them on the Tiger Team blog. Here are a few recent tips from the Tiger Team on using Microsoft R Server:
- Gather metadata and exlore numeric summaries of large data sets held in XDF files
- Filter XDF files with regular expression matching using the rxDataStep function in the RevoScaleR package
- Import DBase .dbf files into Microsoft R Server as an XDF file
- Optimize performance when using rxExec to parallelize R code across a server or cluster
- Perform various data wrangling tasks on XDF files, including aggregations, merges, and calculating column-level statistics
- Confine Microsoft R Server computations to a subset of a Hadoop cluster using node labels
- Quantify risk associated with loans, via in-database model scoring with SQL Server R Services
For more tips, including tips on operationalizing R scripts and using Microsoft R Server with data platforms including Teradata and Cloudera, check out thre Tiger Team blog at the link below.
- Education Analytics with R and Cortana Intelligence SuiteRead more »
By Fang Zhou, Microsoft Data Scientist; Hong Ooi, Microsoft Senior Data Scientist; and Graham Williams, Microsoft Director of Data Science
Education is a relatively late adopter of predictive analytics and machine learning as a management tool. A keen desire for improving educational outcomes for society is now leading universities and governments to perform student predictive analytics to provide better-informed and timely decision making.
Student predictive analytics often aims to solve two key problems:
- Predict student academic outcomes so as to better target support.
- Predict students at risk of dropping out so as to prevent attrition.
Education systems face enormous diversity across regions and countries. Two case studies demonstrate the novel and unique landscape for machine learning in the education world.
- A mixed effects regression model has been developed in conjunction with an Australian education department to measure the influence of student characteristics and to predict student test scores in the presence of variation across students and schools. The model was implemented using R and then integrated with Azure Machine Learning for deployment to production through Power BI.
- A predictive model for student drop out has been developed in conjunction with an Indian state government using machine learning two-class boosted decision trees. For deployment an end-to-end pipeline was built using Azure services including Azure SQL Database, Azure ML and Azure Data Factory
Microsoft Data Scientists assisted with the analysis in both cases and we present details below with R code provided in a git repository to replicate the modelling on artificial data.
NAPLAN is a standardized testing system used by all schools in Australia to assess students’ basic skills—reading, writing, grammar, spelling and numeracy. A majority of students take the five tests in years 3, 5, 7, and 9. A goal of this use case is to identify talent based on NAPLAN test scores and to set individual targets across school cohorts.
The data were collected from 83,000 students across almost 140 schools in a major city. The data included information about yearly NAPLAN testing, student demographics, school records and school attributes.
We addressed the task as a regression problem, taking random effects for student and school into account. The lmer function from the R package lme4 was used to fit a mixed effects regression model to the NAPLAN test score data.
With this mixed-effects regression model we can measure the influence of the fixed effects in the presence of variation in students and schools, as well as fairly assess the quality of a student or a school while taking other factors into account. It is observed that students/schools with very similar characteristics can perform quite differently in NAPLAN tests. Also, a school/student with poor or with good NAPLAN scores can be characterized by combinations of variables exposed in the data.
The model was deployed into a cloud solution exposing this customized R model. Through the interface the education administrators can now easily gain insight into student performance. For example, trends can be detected for student scores over multiple years, key factors affecting academic achievement become apparent, comparative quality of education across schools can be explored, and talent can be identified and shared.
Many governments have a focus on reducing the number of school dropouts and thereby increase the overall skill levels of the citizens so as to increase the human capital. This is certainly the case in Andhra Pradesh and other states in India.
To achieve this objective complex data covering student performance, socio-economic situation, school infrastructure, and teacher skills is combined with external sources from NGOs and government agencies working in education.
Microsoft's solution has involved building and deploying machine learning models for binary classification that can predict the likelihood of a student dropping out, in addition to other educational outcomes at the school, district and state-levels.
R-based models using the latest advances in boosted decision trees were implemented within Azure Machine Learning Studio to achieve model performances with accuracy of 89%, precision of 94%, recall of 62%, F1 score of 75% and an AUC of 89%. With such high accuracy for predicting student drop-out governments are taking proactive measures to generate effective and targeted strategies for reducing student attrition. Non-academic characteristics often external to the school are also identified as playing significant roles in drop-out rates and support a focus for strategies which ensure a solid educational base for the future prosperity of the community.
Sample Solution Architecture
The student predictive analytics solution runs in the Azure cloud and leverages Cortana Intelligence Suite components. Prior to using Azure for insights into student performance or drop-out, the education data has been acquired by data integration components in a prescribed format, transformed, merged and cleansed. Azure SQL Database is used to support storage of both academic year data as well as historic data. We then leverage an Azure Machine Learning model (with R customization) trained on historic data to predict a student’s NAPLAN test score or whether a student in current academic year is likely to drop out. An Azure Data Factory pipeline is developed and deployed to automatically drive the gathering of data, transforming data into a format suitable for Azure ML model, and loading the processed data (with prediction results) back to the target Azure SQL Database for reporting by Power BI dashboard.
The student predictive analytics solution we’ve shown here demonstrates how to extend the capabilities of R with the Cortana Intelligence Suite by integrating custom R code with Azure ML studio, to solve problems in the education world. For a quick guide on how to use R in Azure ML studio, see these instructions and this online tutorial. For additional examples of Cortana Intelligence based solutions, see the Cortana Intelligence Gallery.
Data Science Design Pattern for Education Analytics
Based on our experience with these use cases and indeed from future cases we learn of or are involved in we have developed and maintain the Data Science Design Pattern for Education Analytics. This includes implementations of both Student Score Modeling and Student Drop-Out Prediction. This pattern provides a starting point for a data scientist exploring a new dataset in the education world using R. The GitHub repository includes a sample dataset and R scripts to build the models described above. Data Scientists working in the Education domain can replicate this modelling approach using their own internal datasets.
By no means is this the endpoint of the data science journey. The pattern is under regular revision and improvement and is provided as-is. To try out this pattern please download the provided R Markdown and Jupyter Notebook files. We welcome feedback on this pattern, and you are welcome to comment and contribute at the GitHub repository linked below.
Fang Zhou (GitHub): Data Science Design Pattern for Educational Analytics
- In case you missed it: December 2016 roundupRead more »
In case you missed them, here are some articles from December of particular interest to R users.
Power BI now has a gallery of custom visualizations built with R.
Chicago's Department of Public Health uses R to prioritize health inspections at restaurants.
A beautiful map of Switzerland municipalities combined with a relief map of the mountains, created with R.
Using the Azure Interface Tool to parallelize the problem of optimizing an R model across the hyperparameter space.
Animating Voronoi tesselations in R to create a greeting card.
The new AzureSMR package lets you manage Azure virtual machines, clusters and storage from R.
Interactive decision trees in Microsoft R Server.
The ompr package provides numerical optimization with mixed integer programming.
Predicting flu deaths in China with R.
Using the circlize package and Microsoft R Server's Spark interface to visualize millions of taxi trips.
The State of Indiana uses R to forecast employment.
"One Page R" is a free, multi-chapter tutorial on data science topics using R.
The Deputy Chief Economist at Freddie Mac used R to animate the different rates of housing price increases around the
I gave a talk about the value of ecosystems to open source projects, using R as an example.
A summary of some recent projects funded by the R Consortium.
Microsoft R Server 9.0, featuring R 3.3.2 and support for Spark 2.0, is now available.
The dplyrXdf package has been updated with new features for managing XDF data sets in Microsoft R.
A stylometric analysis of the speeches of the Prime Minister of Pakistan.
Using R and the d3heatmap package to visualize the emotional journey of characters in "War and Peace".
General interest stories (not related to R) in the past month included: the horrors of 2016, a Machinima Christmas carol, freezing bubbles, a dark comic strip, and a virtual flight along the US-Mexico border.
As always, thanks for the comments and please send any suggestions to me at [email protected]. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.