RSS Feed from revolutionanalytics.com
- Because it's Friday: Et tu?Read more »
I spent 6 years learning to speak French as a student in Australia, so naturally I was excited to try out my language skills on my first visit to France. Inevitably, I could understand no-one, and no-one could understand me. (The Australian accent doesn't translate well, as it turns out.) But even after some months of practice, there were still two aspects of the language I dreaded every time I spoke: mixing up feminine and masculine articles (using le instead of la) or — much worse! — using the informal pronoun when I should have used the formal variant. In standard English, we refer to just about everyone as "you". (One exception is God, who gets the archaic 'thee'.) In French, whether you refer to someone as tu or vous is an indication of your relative status to them, and if you get it wrong, well ... quelle horreur!
That's all from us for this week. It's a long weekend for many of us here in the US, so we'll be back on Tuesday. Enjoy your (long) weekend!
- Catterplots: Plots with catsRead more »
As a devotee of Tufte, I'm generally against chartjunk. Graphical elements that obscure interpretation of the data occasionally have a useful role to play, but more often than not that role is to entertain the expense of enlightenment, or worse, to actively mislead. So it's with mixed feelings that I refer you to catterplots, an R package by David Gibbs to create scatterplots using outlines of cats as the symbols.
But hey, it's the Friday before a long weekend, so here's a poorly annotated catterplot of households with dogs vs households with cats in the 48 continental US states (plus DC):
The data behind this chart I found at data.world, which looks to be an interesting new resource for data sets. (Registration is required to download data, however.) The code behind the chart is below. (If you want a variant featuring Nyan Cat, try
The catterplot package is available for on GitHub at the link below.
GitHub (GibbsDavidl): Catterplots
- Six Articles on using R with SQL ServerRead more »
Tomaž Kaštrun is developer and data analyst working for the IT group at SPAR (the ubiquitous European chain of convenience stores) in Austria. He blogs regularly about using Microsoft R and SQL Server for data analyis, and recently published a roundup of his articles about R and SQL Server.
Follow the link below for the an overview of the articles, which cover:
- using Microsoft R in enterprise environments,
- an introduction to SQL Server R Services,
- how to install R packages in SQL Server R Services,
- a tutorial on using SQL Server R Services to analyze sales data,
- using R to visualize data in Power BI and SQL Server Reporting Services,
- and using R to analyze data generated by SQL Server DBA activities.
TomazTSQL: R and SQL Server articles
- Performance improvements coming to R 3.4.0Read more »
R 3.3.3 (codename: "Another Canoe") is scheduled for release on March 6. This is the "wrap-up" release of the R 3.3 series, which means it will include minor bug fixes and improvements, but eschew major new features. Major changes are coming though, with the subsequent release of R 3.4.0. While the NEWS file announcing updates in 3.4.0 is still subject to change, it indicates several major changes aimed at improving the performance of R in various ways:
A "just-in-time" JIT compiler will be included. While the core R packages have been byte-compiled since 2011, and package authors also have the option of btye-compiling the R code they contain, it was tricky for ordinary R users to gain the benefits of byte-compilation for their own code. In 3.4.0, loops in your R scripts and functions you write will be byte-compiled as you use them ("just-in-time"), so you can get improved performance for your R code without taking any additional actions.
Linear algebra performance improvements. R uses a BLAS library for high-performance implementations of many linear algebra routines like matrix multiplication, and now R will use faster routines in some situations (e.g. for matrix-vector multiplications). It will also be slightly faster for each call, by reducing the time to check whether the data include missing values (which BLAS generally doesn't handle). This should improve the performance of all R distributions, including those like Microsoft R that are bundled with multi-threaded BLAS libraries.
Improvements for packages with compiled code. Many packages include code written in C or C++ (or even Fortran, still a powerful language for scientific computing) that is then called from R functions. R 3.4.0 will include a new system that allows package developers to choose to expose compiled functions to other packages or to keep them private. As a side benefit, this new "registration" system will speed up the process of calling compiled functions, particularly on Windows systems. The gain is measured in the order of microseconds per call, but when these functions are called thousands or millions of times the impact can be noticable. The system also adds additional checks to make sure calls to compiled functions are structured correctly — a reliability check that has already detected potential bugs in dozens of packages already on CRAN.
Accumulating vectors in a loop is faster. It's still a bad idea to extend the length of a vector with each iteration of a loop (it's a better idea to pre-allocate a vector of the needed length first), but code that follows that practice should now run faster thanks to R occasionally grabbing a bit more memory than needed.
Performance improvements to other functions. Sorting vectors of numbers is faster (thanks to the use of the radix-sort algorithm by default). Tables with missing values compute quicker. Long strings no longer cause slowness in the
sapplyfunction is faster when applied to arrays with dimension names.
There are several other improvements not related to performance, as well:
- An updated version of the Tcl/Tk graphics system in R for Windows.
- More consistent handling of missing values when constructing tables.
- Accuracy improvements for extreme values in some statistical functions.
- Better detection and warning of likely programmer errors, like comparing a vector with a zero-length array.
No release date has been provided for 3.4.0 provided by the R Core Group yet, but according to the R Developers page it's likely to be available in mid-April. (That's not a guarantee though: issues with the wrap-up release have delayed major updates in the past.) But whenever it's available, R 3.4.0 looks to be a significant improvement for R users, especially those that care about performance.
- Galaxy classification with deep learning and SQL Server R ServicesRead more »
One of the major "wow!" moments in the keynote where SQL Server 2016 was first introduced was a demo that automated the process classifying images of galaxies in a huge database of astronomical images.
The SQL Server Blog has since published a step-by-step tutorial on implementing the galaxy classifier in SQL Server (and the code is also available on GitHub). This updated version of the demo uses the new MicrosoftML package in Microsoft R Server 9, and specifically the
rxNeuralNetfunction for deep neural networks. The tutorial recommends using the Azure NC class of virtual machines, to take advantage of the GPU-accelerated capabilities of the function, and provides details on using the SQL Server interfaces to train the neural netowrk and run predictions (classifications) on the image database. For the details, follow the link below.
- A comparison of deep learning packages for RRead more »
Oksana Kutina and Stefan Feuerriegel fom University of Freiburg recently published an in-depth comparison of four R packages for deep learning. The packages reviewed were:
- MXNet: The R interface to the MXNet deep learning library. (The blog post refers to an older name for the package, MXNetR.)
- darch: An R package for deep architectures and restricted Boltzmann machines.
- deepnet: An R package implementing feed-forward neural networks, restricted Boltzmann machines, deep belief networks, and stacked autoencoders.
- h2o: The R interface to the H2O deep-learning framework.
The blog post goes into detail about the capabilities of the packages, and compares them in terms of flexibility, ease-of-use, parallelization frameworks supported (GPUs, clusters) and performance -- follow the link below for details. I include the conclusion from the paper here:
The current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages.
H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities.
darch offers a limited but targeted functionality focusing on deep belief networks.
Information Systems Research R Blog: Deep Learning in R
- Because it's Friday: Remembering Hans RoslingRead more »
Some sad news to share this week: Hans Rosling, the renowned statistician and pioneer in data visualization best remembered for Gapminder, died on February 7. If you're not familiar with his work, do yourself a favour: watch his TED talk from 2006, and share his passion for telling stories with data.
Robert Kosara also has a lovely remebrance of his achievements and influence.
That's all from us for this week. We'll be back on Monday — see you then.
- Update on R Consortium ProjectsRead more »
On January 31, the R Consortium presented a webinar with updates on various projects that have been funded (thanks to the R Consortium member dues) and are underway. Each project was presented by the project leader, a member of the R community. You can watch the recording of the webinar here, but here's a brief summary of what was covered, grouped by infrastructure projects (R packages and support systems) and community projects (events and groups).
R-hub [Gabor Csardi]: As one of the first projects funded by the R Consortium, the R-hub project is nearing completion. This R package-building services makes things easier for package developers (and CRAN maintainers) by building and testing R packages on Windows, Linux and (soon!) Mac.
Distributed Computing Working Group [Michael Lawrence]: This group is developing a standardized API in R for distributed computing, and are in the process of implementing the draft interface in the ddR package, with the assistance of an R Consortium-funded intern (Clark Fitzgerald).
Simple Features for R [Edzer Pebesma]: This project is developing the "sf" package for R, providing a standardized interface to spatial data used with geographic information systems.
Mapedit [Tim Appelhans]: The goal of this project is to provide an tool for quick-and-easy editing of spatial data visualizations. An alpha version of the mapedit package is available now.
Improving DBI [Kirill Müller]: This group aims to provide a unified database interface for R. The interface is defined by the DBI package, and it's already being used by the RSQLite package.
R Documentation Task Force [Andrew Webb]: This group is working to design and implement the next-generation documentation system for R.
Native APIs for R [Lukas Stadler]: This working group is looking to modernize the low-level APIs provided within R's underlying implementation and contribute improvements to the R Core team.
RUGS [Joseph Rickert]: The R Consortium now has an active project to fund local R user groups, and has provided grants to 25 groups thus far.
RIOT Workshop [Lukas Stadler]: The Workshop on R implementation, Optimization and Tooling — focused on core R engine development — was held last year in Stanford, and follow-up is being planned for 2017.
R-Ladies [Gabriela de Queiroz]: This group has founded over 35 chapters of R-Ladies user groups, serving more than 4000 female R users.
Personally, I'm so impressed with the contributions that all of these groups have made for all R users with such effective use of their R Consortium grants. (There's more to come, too: the R Consortium is accepting applications for the next round of grants, through today.) If you agree with me that these represent worthwhile projects, I hope you encourage your employer to become a member of the R Consortium. The more members (and the membership dues), the more such projects can be funded.
- Job trends for R and PythonRead more »
When we last looked at job trends from indeed.com, job listings for "R statistics" were on the rise but were still around half the volume of listings for "SAS statistics". Three-and-a-half years later, R has overtaken SAS in job listings for "statistics".
I added Python to the search this time; job listings for "Python statistics" have risen at a similar rate to those for R, but with a somewhat higher volume for R.
Since data science is popular job role these days, let's do the same search for "data scientist":
For "data scientist" jobs, R and Python track very closely, with Python just edging out R in the past few months. This is most likely because R and Python (but unlike SAS) appear together in many data scientist job listings.
You can explore other job titles at indeed.com. (And thanks to reader SK for the suggestion to revisit these searches!)
- ModernDive: A free introduction to statistics and data science with RRead more »
If you're thinking about teaching a course on statistics and data science using R, Chester Ismay and Albert Kim have created an online, open-source textbook for just that purpose. ModernDive is a textbook for that instructs students how to:
- use R to explore and visualize data;
- use randomization and simulation to build inferential ideas;
- effectively create stories using these ideas to convey information to a lay audience.
The book makes liberal use of R packages, and makes effective use real-world data sets to communicate key concepts. Though still a work-in-progress, the book already effectively covers the basics of data analysis (data wrangling and data exploration and data visualization, including the elegant roadmap for selecting a chart type shown below) and statistical comcepts including simulation, regression and hypothesis testing. The book also aims to give students an understanding of the overarching data analysis process, including concepts like reproducibility and telling stories with data.
Incidentally, the book itself was written in R, using the bookdown package, which makes it easy to combine R code and outbook into a book format. Contributions are welcomed, and the source code that generates ModernDive is available on Github.
You can find the ModernDive book at its homepage, moderndive.com.