RcmdrPlugin.KMggplot2_0.2-4 is on CRAN

This post was originally published on this site
https://cdn-ak.f.st-hatena.com/images/fotolife/t/triadsou/20161220/20161220235156.png

(This article was first published on R-bloggers – Triad sou., and kindly contributed to R-bloggers)

RcmdrPlugin.KMggplot2 v0.2-4 (an Rcmdr plug-in; a GUI for ‘ggplot2’) was released on Dec. 20, 2016.

NEWS

Changes in version 0.2-4 (2016-12-20)
  • ggplot2 2.2.0 was supported.
  • Added ggplot2’s colour/fill scales (scale_hue, scale_grey).
  • New plot: Beeswarm plot in Box plot (Thanks to DailyFunky @EpiFunky).
  • New function: lastcom() function to list the last command.
  • Fixed a bug: Histogram with missing values (Thanks to Dr. José G. Conde).
  • Fixed a bug: Boxplot with an inappropriate jitter position (Thanks to Dr. M. Felix Freshwater).

Beeswarm plot

f:id:triadsou:20161220235144p:plain

Example: with box plot
f:id:triadsou:20161220235149p:plain

Example: with CI
f:id:triadsou:20161220235152p:plain

ggplot2’s colour/fill scales

scale_*_hue
f:id:triadsou:20161220235156p:plain

scale_*_grey
f:id:triadsou:20161220235201p:plain

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers – Triad sou..

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Visualizing “The Best”

This post was originally published on this site
https://www.r-bloggers.com/

(This article was first published on max humber, and kindly contributed to R-bloggers)

How do you measure “The Best”?

It’s not immediately clear. Because, “The Best” is incredibly vague and subjective. My “Best” is not the same as your “Best”. And our “Best”s can converge and diverge depending on what we are measuring and how we measure it.

I think “The Best” is often the wrong question. Usually when we’re looking for “The Best” (are you sick of me saying “The Best” yet?) we’re really just trying to find “The Better”.

Answers to questions like “Who is the best player in the NBA?” or “What is the best city in the world?” or “Which Pokemon is the best?” are fraught with caveats and asterisks and clarifications. And they have to be! What do you mean “Best”? Best Shooter? Best of All Time? Best Last Year? Best in terms of Quality of Living? Best on measures of Entertainment? Best Speeds? Best Attacks? Best Best Best Best Best! Aggh!

To answer these “Best” questions we have to narrow down the problem and convert them into “Better” questions. By slimming down the pool of possible options we can actually start to make some progress!

  • “Who is the better basketball player: Steph Curry, Kawhi Leonard, or Larry Bird?”
  • “Which city is better: Toronto, London, or San Francisco?”
  • “Which starter Pokemon (from Gen 1) is better for my team?”

These are questions, I think, we can actually answer! But only if we’re explicit about the measures we’re using in our calculations and the weights that we assign to them.

For instance, if we want to find the better basketball player we could come up with some formula that includes assists, shooting, usage, defense, and rebounds. Maybe my formula is y = 2 * Shooting + 1.2 * Usage - 3 * Defense + 1.1 * Assists^2 + 1.3 * Rebounds. But I think reducing five incredibly rich measurements down to one value is absurd. It’s lossy compression! And you might not agree with my formula. You could have a better one. You might really value rebounds. Or you might want to replace usage with steals! Blah Blah Blah Blah Blah.

Worry not! We’ve finally reached the part where I present a method for finding “The Best” (or “The Better”, at least). A visulization for “The Best”. It is inspired by (more like entirely ripped off from) this FiveThirtyEight article.

All that is required is the tidyverse and forcats

Basketball

Step 1: Spin up the data

df <- tribble(
    ~player, ~assists, ~shooting, ~usage, ~defense, ~rebounds,
    "Larry Bird", 88, 89, 93, 92, 87,
    "Kawhi Leonard", 71, 94, 92, 93, 62,
    "Stephen Curry", 95, 92, 87, 43, 32,
    "Average", 72, 85, 32, 34, 30) %>% 
    gather(stat, percentile, -player) %>% 
    mutate(outof4 = percentile %/% 25 + 1) %>% 
    mutate(col = ifelse(player == "Average", "A", "B")) %>% 
    mutate(order = recode(stat, 
        assists = 5, shooting = 1, usage = 2, 
        defense = 3, rebounds = 4)) %>% 
    mutate(stat = recode(stat, 
        assists = "ASSISTnRATE", 
        shooting = "TRUEnSHOOTING", 
        usage = "USAGEnRATE", 
        defense = "DEFENSIVEnBPM", 
        rebounds = "REBOUNDnRATE")) %>% 
    mutate(stat = factor(stat))

Step 2: Graph!

df %>% 
    ggplot(aes(x = fct_reorder(stat, order), y = outof4)) + 
    geom_col(alpha = 0.5, aes(fill = col), 
        width = 1, show.legend = FALSE, color = "white") +
    geom_hline(yintercept = seq(0, 4, by = 1), 
        colour = "#949494", size = 0.5, lty = 3) +
    geom_vline(xintercept = seq(0.5, 5.5, 1), 
        colour = "#949494", size = 0.4, lty = 1) +
    facet_wrap(~player) +
    coord_polar() + 
    scale_fill_manual(values = c("#e4e4e4", "#00a9e0")) + 
    scale_y_continuous(
        limits = c(0, 4), 
        breaks = c(1, 2, 3, 4)) + 
    labs(x = "", y = "") + 
    theme(
        panel.background = element_rect(fill = "#FFFFFF"),
        plot.background = element_rect(fill = "#FFFFFF"),
        strip.background = element_rect(fill = "#FFFFFF"),
        strip.text = element_text(size = 10),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 7),
        panel.spacing = grid::unit(2, "lines"))

center

I’m not sure what to call this new type of visualization, but I love it! I love it because you can immediately see that Larry Bird is “The Best” basketball player, at least when benchmarked against Steph and Kawhi. And I love it because it’s not just exotic fluff. The bubbles actually help with the interpretation of the data. The bubbles are simply better than columns, in this instance.

Just compare… These two graphs literally contain the same data:

df %>% 
    ggplot(aes(x = stat, y = percentile, fill = player)) + 
    geom_col(alpha = 0.5, color = "white")

center

df %>% 
    ggplot(aes(x = stat, y = percentile, fill = player)) + 
    geom_col(alpha = 0.5, color = "white",
        position = position_dodge())

center

Like, I can sort of tell that all the blue bars are really tall, but I can’t see much beyond that. It’s hard to jump back and forth between the stats and the players and tease out patterns.

Pokemon

Moving to the Pokemon question we can grab data from http://pokemondb.net/

df <- tribble(
    ~Pokemon, ~HP, ~Attack, ~Defense, ~SpAtt, ~SpDef, ~Speed,
    "Charizard", 78, 84, 78, 109, 85, 100,
    "Blastoise", 79, 83, 100, 85, 105, 78,
    "Venusaur", 80, 82, 83, 100, 100, 80) %>% 
    gather(stat, value, -Pokemon) %>% 
    mutate(outof6 = value %/% 20 + 1) %>% 
    mutate(order = recode(stat, 
        HP = 6, Attack = 1, SpAtt = 2, Defense = 3, SpDef = 4, Speed = 5))

Display it in the same way:

df %>% 
    ggplot(aes(x = fct_reorder(stat, order), 
        y = outof6, fill = Pokemon)) + 
    geom_col(alpha = 3/4, width = 1, 
        show.legend = FALSE, color = "white") +
    geom_hline(yintercept = seq(0, 6, by = 1), 
        colour = "#949494", size = 0.5, lty = 3) +
    geom_vline(xintercept = seq(0.5, 5.5, 1), 
        colour = "#949494", size = 0.4, lty = 1) +
    facet_wrap(~Pokemon) +
    coord_polar() + 
    scale_fill_manual(values = c("#06AED5","#ED933C", "#65B54F")) + 
    scale_y_continuous(
        limits = c(0, 6), 
        breaks = c(1, 2, 3, 4, 5, 6)) + 
    labs(x = "", y = "") + 
    theme(
        panel.background = element_rect(fill = "#FFFFFF"),
        plot.background = element_rect(fill = "#FFFFFF"),
        strip.background = element_rect(fill = "#FFFFFF"),
        strip.text = element_text(size = 10),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 7),
        panel.spacing = grid::unit(2, "lines"))

center

And we can see that there really is no objective “Best” this time. It totally depends on what measures are important to us. Perhaps, I really value Speed and Attack in my Pokemon. Well, looking at the graph I can see that I ought to grab Charizard. And you might decide to go with Blastoise because you like big tanky defense pokemon. It totally depends! And single point value would be incredibly misleading here.

Cities

Just one last example to wrap it all up. Using the PWC Cities of Opportunity Index I can grab the measures that are important me, like, Broadband Quality (need that fast internet!), Entertainment, Quality of Living, Ease of Starting a Business, and Cost of Living to generate similar comparisons.

PWC actually has a visulatization tool that spits out:

center

But I think my bubbles are better!

df <- tribble(
    ~city, ~Broadband, ~Entertainment, ~`QOL`, ~Startup, ~`COL`,
    "Toronto", 17, 16, 30, 30, 12, 
    "London", 19, 30, 16, 20, 1,
    "San Francisco", 20, 13, 18, 18, 6) %>%
    gather(stat, value, -city) %>% 
    mutate(score = value %/% 6 + 1) %>% 
    mutate(order = recode(
        stat, Broadband = 5, Entertainment = 1, 
        COL = 2, QOL = 3, Startup = 4))

df %>% 
    ggplot(aes(x = fct_reorder(stat, order), y = score, fill = city)) + 
    geom_col(alpha = 1, width = 1, show.legend = FALSE, color = "white") +
    geom_hline(yintercept = seq(0, 6, by = 1), 
        colour = "#949494", size = 0.5, lty = 3) +
    geom_vline(xintercept = seq(0.5, 5.5, 1), 
        colour = "#949494", size = 0.4, lty = 1) +
    facet_wrap(~city) +
    coord_polar() + 
    scale_fill_manual(values = c("#0C2238","#BF4E22", "#AF0023")) + 
    scale_y_continuous(
        limits = c(0, 6), 
        breaks = c(1, 2, 3, 4, 5, 6)) + 
    labs(x = "", y = "") + 
    theme(
        panel.background = element_rect(fill = "#FFFFFF"),
        plot.background = element_rect(fill = "#FFFFFF"),
        strip.background = element_rect(fill = "#FFFFFF"),
        strip.text = element_text(size = 10),
        panel.grid = element_blank(),
        axis.ticks = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size = 7),
        panel.spacing = grid::unit(2, "lines"))

center

There you have it. “The Best” Visualization. Or at least a visualization for “The Best”.

To leave a comment for the author, please follow the link and comment on their blog: max humber.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Implementation of a basic reproducible data analysis workflow

This post was originally published on this site
https://www.r-bloggers.com/

(This article was first published on Joris Muller’s blog – Posts about R, and kindly contributed to R-bloggers)

In a previous post, I described the principles of my basic reproducible data analysis workflow. Today, let’s be more practical and see how to implement it.

Be noted that it is a basic workflow. The goal is to find a good balance between a minimal reproducible analysis and the ease to deploy it on any platform.

This workflow will allow you to run a complete analysis based on multiple files (data files, R script, Rmd files…) just by launching a single R script.

Summary

This workflow allows to process raw files in order to produce reports (in html or PDF). There are 3 main components:

  1. Sofwares:
    • R of course.
    • Rstudio IDE optionnaly. It could save you installation time because it’s come with both Pandoc and rmarkdown built-in.
    • If not using Rstudio IDE:
      • rmarkdown R’s package, to convert R markdown files to markdown, html and PDF. Just type in R install.packages("rmarkdown").
      • a recent version of Pandoc to process markdown files.
    • Git for control version.
  2. Files organisation (see above).
  3. One R script to rule them all (see above).

Files organisation

Most important is the organisation of the project. I understand by project a folder containing every file necessary to run the analysis: raw data, intermediate data, scripts and other assets (pictures, xml…).

My organisation is:

name_of_the_project
|-- assets
|-- functions
    |-- import_html_helper.R
    |-- make_report.R
|-- plot
|-- produced_data
    |-- imported.rds
    |-- model_result.rds
|-- run_all.R     # *** Most important file ***
|-- raw_data
    |-- FinalDataV3.xlsx
    |-- SomeExternalData.csv
    |-- NomenclatureFromWeb.html
|-- reports
    |-- 01-import_data.html
    |-- 02-data_tidying.html
    |-- 03-descriptive.html
    |-- 04-model1.html
    |-- 99-sysinfo.html
|-- rmds
    |-- import_data.Rmd
    |-- data_tidying.Rmd
    |-- descriptive.Rmd
    |-- model1.Rmd
    |-- sysinfo.Rmd
|-- rscripts
    |-- complicated_model.R
|-- name_of_the_project.Rproj # not mandatory but useful

As you can observe, there are some principles :

  • Directory names are as explicit as possible to be understandable by anyone getting these files.
  • Reports don’t belong to the script, because I don’t produce a report for every script or Rmd (there is child Rmd) and this way, it’s straightforward where people should look at the results (reports).
  • Reports are sorted using 2-digit numbers. Data science is also an art to tell stories in the right order.
  • I use rmarkdown files to produce my report, to have both the results and my comments. These comments are fundamental, this is an appreciation of the results by the data scientist.
  • A rmarkdown file, sysinfo.Rmd, will be used to produce a report keeping trace of the name and version of R’s package used (with sessionInfo()) and some extra information about the OS (Sys.info()). In an ideal workflow, these commands have to be called at the end of each report.
  • Everything lives in a subfolder except run_all.R (detail above).

One R script to run them all

This run_all.R script, as it name explicitly tell, it runs everything necessary to produce all the reports. Here an example:

source("functions/make_reports.R")

report("rmds/import_data.Rmd", n_file = "1")
report("rmds/data_tidying.Rmd", "2")
report("rmds/descriptive.Rmd", "3")
report("rmds/model1.Rmd", "4")

It’s straightforward: one line by rmarkdown files to process.

make_report.R is optional. It’s an help function to produce the report, set their name.

# Clean up the environment
rm(list = ls())

# Load the libraries
library(knitr)
library(rmarkdown)

# Set the root dir because my rmds live in rmds/ subfolder
opts_knit$set(root.dir = '../.')

# By default, don't open the report at the end of processing
default_open_file <- FALSE

# Main function
report <- function(file, n_file = "", open_file = default_open_file,  
                   report_dir = "reports") {

  ### Set the name of the report file ###
  base_name <- sub(pattern = ".Rmd", replacement = "", x = basename(file))

  # Make nfiles with always 2 digits
  n_file <- ifelse(as.integer(n_file) < 10, paste0("0", n_file), n_file)

  file_name <- paste0(n_file, "-", base_name, ".html")
  
  ### Render ###
  render(
    input = file,
    output_format = html_document(
      toc = TRUE,
      toc_depth = 1,
      code_folding = "hide"
    ),
    output_file = file_name,
    output_dir = report_dir,
    envir = new.env()
    )

  ### Under macOS, open the report file  ###
  ### in firefox at the end of rendering ###
  if(open_file & Sys.info()[1] == "Darwin") {
    result_path <- file.path(report_dir, file_name)
    system(command = paste("firefox", result_path))
  }

}

Usage

The core of this workflow is the rmarkdown files. Most of the time, I write .Rmd files, mixing comments about what I’m going to do, code, results and comment about these results.

If there is a heavy computation (e.g. models or simulations), I write a script in R and save the results in a .rds file. Often I use remote machine for this kind of computation. Then I insert an R script in a Rmd chunk not evaluated and load the results in an evaluated one.

 ## ```{r heavy_computation, eval=FALSE}
 ## source("rscript/complicated_model.R")
 ## ```
 ## ```{r load_results}
 ## mod_results <- readRDS("model_result.rds")
 ## ```

Because my .Rmd don’t live in the root directory, I add a setup chunk on all these files. This way, I’m to process my .Rmd directly or even use Rstudio’s notebook feature.

 ## ```{r setup, include=FALSE}
 ## knitr::opts_knit$set(root.dir = '../.')
 ## ```

When I have to rebuild all the reports (e.g. due to some raw data changes), I just run the script run_all.R.

Benefits in real life

This basic workflow works for me because:

  • It’s pragmatic: it fulfils all the constraints and goals I fixed without any bells or whistles.
  • It uses mainstream tools. Others can easily use it and these tools are not likely to be deprecated in the next decade.
  • It’s easy to implement and deploy.
  • It’s straightforward: peoples who want just to see the reports, know where to see, those who want to reproduce the analysis, just have to use the run_all.R file.

Limits

There are more layers needed for a perfect data analysis reproducibility (as described in this article). The main weakness of my workflow is the lack of back up for the good version of the software I use (R’ packages included). I already wasn’t able rerun an old data analysis due to change in the package used. (e.g. deprecated functions in dplyr or disappearance of a package in the CRAN). To improve this, I already try to add packrat to my workflow but it’s been a long time ago and the package wasn’t stable enough for my day to day work. Resolution for 2017: give another try to packrat!

Other improvements possible: run the scripts in a docker container or in a virtual machine. But this adds too much overhead in my daily work. Furthermore, it brake the platform agnostic and simplicity principles.

Conclusion

Currently, I’m happy with this workflow. It covers my needs of my daily data analysis reproducibility with few extra work.

In a later post, I will discuss about these possible extension of this basic workflow.

To leave a comment for the author, please follow the link and comment on their blog: Joris Muller’s blog – Posts about R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Because it’s Friday: Infrastructure Collapses

This post was originally published on this site

On November 7 1940, the Tacoma Narrows Bridge, opened just four months prior, suffered catastrophic collapse in a windstorrm. (The music in the video is annoying, so here’s a version with an alternate soundtrack.)

 

The story behind the collapse is interesting: while it looks like a build-up of resonance is the culprit, it was actually the fluttering of the bridge desk that brought it down.

That’s all from us for this week — we’ll be back on Monday.

Copyright Use-R.com 2012 - 2016 ©