RSS Feed : R-Statistics.com

RSS Feed from http://www.r-statistics.com

  • Data Driven Cheatsheets

    Guest post by Jonathan Sidi

    Cheatsheets are currently built and used exclusivley as a teaching tool. We want to try and change this and produce a cheat sheet that gives a roadmap to build a known product, but also is built as a function so users can input data into it to make the cheatsheet more personalized. This gives a versalility of a consistent format that people can share with each other, but has the added value of conveying a message through data driven visual changes.

    Example

    ggplot2 themes

    The ggplot2 theme object is an amazing object you can specify nearly any part of the plot that is not conditonal on the data. What sets the theme object apart is that its structure is consistent, but the values in it change. In addition to change a theme it is a single function that too has a consistent call. The reoccuring challenge for users is to remember all the options that can be used in the theme call (there are approximately 220 unique options to calibrate at last count) or bookmark the help page for the theme and remember how you deciphered it last time.

    This becomes a problem to pass all the information of the theme to someone who does not know what the values are set in your theme and attach instructions on it to let them recreate it without needing to open any web pages.

    In writing the library ggedit we tried to make it easy to edit your theme so you don’t have to know too much about ggplots to make a large number of changes at once, for a quick clip see here. We had to make it easy to track those changes for people who are not versed in R, and plot.theme() was the outcome. In short think of the theme as a lot of small images that are combined to create a singel portrait.

    <!---

    Photo mosaic by: yoniceedee @ Mosaically--->

    Converting a theme call to a single image

    theme(themeType.element=elementClass(elementClassArgument=value))
    • theme(): ggplot2 theme function
    • themeType: general part of the theme
      • line,rect, text, axis,legend, panel, strip, plot
    • element: element if a themeType
      • ie themeType=axis and element=ticks is written as axis.ticks
    • elementClass: each element is classified to an element_class() which is function that controls characteristics for the class
      • eg element_line(size,linetype,lineend,colour,arrow)
    • elementClassArgument: characteristic of an element which are arguments of an element_class function
    • value: the value given to an elementClassArgument

    There are themeType.elements that are not classified in a specific class thus are given values directly, like legend.

    theme(themeType.element=value)

    To tie this all together we can create this single template that can be replicated for any element in the theme object. To add more information to the ouput the class of the value given to an elementClassArgument (or a themeType.element) and the index to uniquely identify the element.

    library(ggedit)

    Grouping templates

    plot(theme_get(),themePart = 'legend',fnt=17,plotFrame = F)

    We add some colour to distinguish which elements are set to NULL (grey) and which ones have values (red).

    As an example this is how to read the output

    theme(legend.key=element_rect(fill='grey95',colour='white'),
          legend.text=element_text(size=rel(0.8)),
          legend.title=element_text(hjust=0),
          legend.box.spacing=unit(0.4,'cm'),
          legend.justification='center',
          legend.position='right',
          legend.spacing=unit(0.4,'cm'))

    Visualize the whole theme

    plot(theme_get(),plotFrame = F)

    Make this plot interactive by applying ggplotly from the plotly package.

    plot(theme_get(),as.plotly = T)


    Create the cheatsheet

    Toggling the plotFrame argument to the plot call will nest the plots into a generic cheatsheet layout that does a better job of finding the best width for each box and supplies instructions on the border of how to read the output with a caption on the bottom which theme was used.

    plot(theme_get(),plotFrame = T,fnt = 10)

    Compare themes

    Finally there is an option to compare themes. The same layout will be given but the color coding will change, where a blue color will indicate an update from the benchmark theme.

    library(ggthemes)
    plot(obj=theme_economist(),obj2 = theme_bw(),fnt = 10)

    When collaborating with many people and large changes are made to the theme this lets you have a single language everyone can understand for quick referencing and hopefully problem solving.


    Jonathan Sidi joined Metrum Researcg Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.

    To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.

    Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue in github.

    Read more »
  • ggedit 0.0.2: a GUI for advanced editing of ggplot2 objects

    Guest post by Jonathan Sidi, Metrum Research Group

    Last week the updated version of ggedit was presented in RStudio::conf2017. First, a BIG thank you to the whole RStudio team for a great conference and being so awesome to answer the insane amount of questions I had (sorry!). For a quick intro to the package see the previous post.

    To install the package:

    devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")

    Highlights of the updated version.

    • verbose script handling during updating in the gagdet (see video below)
    • verbose script output for updated layers and theme to parse and evaluate in console or editor
    • colourpicker control for both single colours/fills and and palletes
    • output for scale objects eg scale*grandient,scale*grandientn and scale*manual
    • verbose script output for scales eg scale*grandient,scale*grandientn and scale*manual to parse and evaluate in console or editor
    • input plot objects can have the data in the layer object and in the base object.
      • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))+geom_point()
      • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length))+geom_point(aes(colour=Species))
      • ggplot()+geom_point(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))
    • plot.theme(): S3 method for class ‘theme’
      • visualizing theme objects in single output
      • visual comparison of two themes objects in single output
      • will be expanded upon in upcoming post

    RStudio::conf2017 Presentation

    #devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")
    rm(list=ls())
    library(ggedit)
    #?ggedit
    
    p0=list(
      Scatter=iris%>%ggplot(aes(x =Sepal.Length,y=Sepal.Width))+
        geom_point(aes(colour=Species),size=6),
      
      ScatterFacet=iris%>%ggplot(aes(x =Sepal.Length,y=Sepal.Width))+
        geom_point(aes(colour=Species),size=6)+
          geom_line(linetype=2)+
        facet_wrap(~Species,scales='free')+
        labs(title='Some Title')
      )
    
    #a=ggedit(p.in = p0,verbose = T) #run ggedit
    dat_url <- paste0("https://raw.githubusercontent.com/metrumresearchgroup/ggedit/master/RstudioExampleObj.rda")
    load(url(dat_url)) #pre-run example
    
    ldply(a,names)
    ##                     .id      V1           V2
    ## 1          UpdatedPlots Scatter ScatterFacet
    ## 2         UpdatedLayers Scatter ScatterFacet
    ## 3 UpdatedLayersElements Scatter ScatterFacet
    ## 4     UpdatedLayerCalls Scatter ScatterFacet
    ## 5         updatedScales Scatter ScatterFacet
    ## 6    UpdatedScalesCalls Scatter ScatterFacet
    ## 7         UpdatedThemes Scatter ScatterFacet
    ## 8     UpdatedThemeCalls Scatter ScatterFacet
    plot(a)

    comparePlots=c(p0,a$UpdatedPlots)
    names(comparePlots)[c(3:4)]=paste0(names(comparePlots)[c(3:4)],"Updated")

    Initial Comparison Plot

    plot(as.ggedit(comparePlots))

    Apply updated theme of first plot to second plot

    comparePlots$ScatterFacetNewTheme=p0$ScatterFacet+a$UpdatedThemes$Scatter
    
    plot(as.ggedit(comparePlots[c("ScatterFacet","ScatterFacetNewTheme")]),
          plot.layout = list(list(rows=1,cols=1),list(rows=2,cols=1))
         )

    #Using Remove and Replace Function ##Overlay two layers of same geom

    (comparePlots$ScatterMistake=p0$Scatter+a$UpdatedLayers$ScatterFacet[[1]])

    Remove

    (comparePlots$ScatterNoLayer=p0$Scatter%>%
      rgg(oldGeom = 'point'))

    Replace Geom_Point layer on Scatter Plot

    (comparePlots$ScatterNewLayer=p0$Scatter%>%
      rgg(oldGeom = 'point',
          oldGeomIdx = 1,
          newLayer = a$UpdatedLayers$ScatterFacet[[1]]))

    Remove and Replace Geom_Point layer and add the new theme

    (comparePlots$ScatterNewLayerTheme=p0$Scatter%>%
      rgg(oldGeom = 'point',
          newLayer = a$UpdatedLayers$ScatterFacet[[1]])+
      a$UpdatedThemes$Scatter)

    Cloning Layers

    A geom_point layer

    (l=p0$Scatter$layers[[1]])
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity

    Clone the layer

    (l1=cloneLayer(l))
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity
    all.equal(l,l1)
    ## [1] TRUE

    Verbose copy of layer

    (l1.txt=cloneLayer(l,verbose = T))
    ## [1] "geom_point(mapping=aes(colour=Species),na.rm=FALSE,size=6,data=NULL,position=\"identity\",stat=\"identity\",show.legend=NA,inherit.aes=TRUE)"

    Parse the text

    (l2=eval(parse(text=l1.txt)))
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity
    all.equal(l,l2)
    ## [1] TRUE

    Back to our example

      #Original geom_point layer
      parse(text=cloneLayer(p0$ScatterFacet$layers[[1]],verbose = T))
    ## expression(geom_point(mapping = aes(colour = Species), na.rm = FALSE, 
    ##     size = 6, data = NULL, position = "identity", stat = "identity", 
    ##     show.legend = NA, inherit.aes = TRUE))
      #new Layer
      parse(text=a$UpdatedLayerCalls$ScatterFacet[[1]])
    ## expression(geom_point(mapping = aes(colour = Species), na.rm = FALSE, 
    ##     size = 3, shape = 22, fill = "#BD2020", alpha = 1, stroke = 0.5, 
    ##     data = NULL, position = "identity", stat = "identity", show.legend = NA, 
    ##     inherit.aes = TRUE))

    <!---

    Visualize Themes

    pTheme=list()
    (pTheme$Base=plot(a$UpdatedThemes$Scatter))

    Visualize Part of Themes

    (pTheme$Select=plot(a$UpdatedThemes$Scatter,themePart = c('plot','legend'),fnt = 18))

    Visually Compare Theme

    (pTheme$Compare=plot(obj=a$UpdatedThemes$Scatter,obj2 = ggplot2:::theme_get()))

    --->


    Jonathan Sidi joined Metrum Researcg Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.

    To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.

    Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue in github.

    Read more »
  • ggedit – interactive ggplot aesthetic and theme editor

    Guest post by Jonathan Sidi, Metrum Research Group

    ggplot2 has become the standard of plotting in R for many users. New users, however, may find the learning curve steep at first, and more experienced users may find it challenging to keep track of all the options (especially in the theme!).

    ggedit is a package that helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics just right, all while keeping everything portable for further research and collaboration.

    ggedit is powered by a Shiny gadget where the user inputs a ggplot plot object or a list of ggplot objects. You can run ggedit directly from the console from the Addin menu within RStudio.

    Installation

    devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")

    Layers

    The gadget creates a popup window which is populated by the information found in each layer. You can edit the aesthetic values found in a layer and see the changes happen in real time.


    You can edit the aesthetic layers while still preserving the original plot, because the changed layers are cloned from the original plot object and are independent of it. The edited layers are provided in the output as objects, so you can use the layers independent of the plot using regular ggplot2 grammar. This is a great advantage when collaborating with other people, where you can send a plot to team members to edit the layers aesthetics and they can send you back just the new layers for you to implement them.

    Themes

    ggedit also has a theme editor inside. You can edit any element in the theme and see the changes in real time, making the trial and error process quick and easy. Once you are satisfied with the edited theme you can apply it to other plots in the plot list with one click or even make it the session theme regardless of the gadget. As with layers, the new theme object is part of the output, making collaboration easy.


    Outputs

    The gadget returns a list containing 4 elements

    • updatedPlots
      • List containing updated ggplot objects
    • updatedLayers
      • For each plot a list of updated layers (ggproto) objects
      • Portable object
    • updatedLayersElements
      • For each plot a list elements and their values in each layer
      • Can be used to update the new values in the original code
    • updatedThemes
      • For each plot a list of updated theme objects
      • Portable object
      • If the user doesn’t edit the theme updatedThemes will not be returned

    rgg

    After you finish editing the plots the natural progression is to use them in the rest of the script. In ggedit there is the function rgg (remove and replace ggplot). Using this function you can chain into the original code changes to the plot without multiplying script needlessly.


    With this function you can

    Specify which layer you want to remove from a plot:

    ggObj%>%rgg('line')

    Provide an index to a specific layer, in instances where there are more than one layer of the same type in the plot

    ggObj%>%rgg('line',2)

    Remove a layer from ggObj and replace it with a new one from the ggedit output p.out

    ggObj%>%rgg('line',newLayer = p.out$UpdatedLayers)

    Remove a layer and replace it with a new one and the new theme

    ggObj%>%rgg('line',newLayer = p.out$UpdatedLayers)+p.out$UpdatedThemes

    There is also a plotting function for ggedit objects that creates a grid.view for you and finds the best grid size for the amount of plots you have in the list. And for the exotic layouts you can give specific positions and the rest will be done for you. If you didn’t use ggedit, you can still add the class to any ggplot and use the plotting function just the same.

    plot(as.ggedit(list(p0,p1,p2,p3)),list(list(rows=1,cols=1:3),
                                           list(rows=2,cols=2),
                                           list(rows=2,cols=1),
                                           list(rows=2,cols=3))
    )

    Addin Launch

    To launch the Shiny gadget from the addin menu highlight the code that creates the plot object or the plot name in the source pane of Rstudio, then click on the ggedit addin from the Addins the dropdown menu.



    Jonathan Sidi joined Metrum Researcg Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.

    To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.

    Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue in github.

    Read more »
  • R 3.3.2 is released!

    R 3.3.2 (codename “Sincere Pumpkin Patch”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes and new features is provided below.

    Upgrading to R 3.3.2 on Windows

    If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

    install.packages("installr") # install 
    setInternet2(TRUE) # only for R versions older than 3.3.0
    installr::updateR() # updating R.

    Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

    I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

    CHANGES IN R 3.3.2

    NEW FEATURES

    • extSoftVersion() now reports the version (if any) of the readline library in use.
    • The version of LAPACK included in the sources has been updated to 3.6.1, a bug-fix release including a speedup for the non-symmetric case of eigen().
    • Use options(deparse.max.lines=) to limit the number of lines recorded in .Traceback and other deparsing activities.
    • format(<AsIs>) looks more regular, also for non-character atomic matrices.
    • abbreviate() gains an option named = TRUE.
    • The online documentation for package methods is extensively rewritten. The goals are to simplify documentation for basic use, to note old features not recommended and to correct out-of-date information.
    • Calls to setMethod() no longer print a message when creating a generic function in those cases where that is natural: S3 generics and primitives.

    INSTALLATION and INCLUDED SOFTWARE

    • Versions of the readline library >= 6.3 had been changed so that terminal window resizes were not signalled to readline: code has been added using a explicit signal handler to work around that (when R is compiled against readline >= 6.3). (PR#16604)
    • configure works better with Oracle Developer Studio 12.5.

    UTILITIES

    • R CMD check reports more dubious flags in files ‘src/Makevars[.in]’, including -w and -g.
    • R CMD check has been set up to filter important warnings from recent versions of gfortran with -Wall -pedantic: this now reports non-portable GNU extensions such as out-of-order declarations.
    • R CMD config works better with paths containing spaces, even those of home directories (as reported by Ken Beath).

    DEPRECATED AND DEFUNCT

    • Use of the C/C++ macro NO_C_HEADERS is deprecated (no C headers are included by R headers from C++ as from R 3.3.0, so it should no longer be needed).

    BUG FIXES

    • The check for non-portable flags in R CMD check could be stymied by ‘src/Makevars’ files which contained targets.
    • (Windows only) When using certain desktop themes in Windows 7 or higher, Alt-Tab could cause Rterm to stop accepting input. (PR#14406; patch submitted by Jan Gleixner.)
    • pretty(d, ..) behaves better for date-time d (PR#16923).
    • When an S4 class name matches multiple classes in the S4 cache, perform a dynamic search in order to obey namespace imports. This should eliminate annoying messages about multiple hits in the class cache. Also, pass along the package from the ClassExtends object when looking up superclasses in the cache.
    • sample(NA_real_) now works.
    • Packages using non-ASCII encodings in their code did not install data properly on systems using different encodings.
    • merge(df1, df2) now also works for data frames with column names "na.last", "decreasing", or "method". (PR#17119)
    • contour() caused a segfault if the labels argument had length zero. (Reported by Bill Dunlap.)
    • unique(warnings()) works more correctly, thanks to a new duplicated.warnings() method.
    • findInterval(x, vec = numeric(), all.inside = TRUE) now returns 0s as documented. (Reported by Bill Dunlap.)
    • (Windows only) R CMD SHLIB failed when a symbol in the resulting library had the same name as a keyword in the ‘.def’ file. (PR#17130)
    • pmax() and pmin() now work with (more ?) classed objects, such as "Matrix" from the Matrix package, as documented for a long time.
    • axis(side, x = D) and hence Axis() and plot() now work correctly for "Date" and time objects D, even when “time goes backward”, e.g., with decreasing xlim. (Reported by William May.)
    • str(I(matrix(..))) now looks as always intended.
    • plot.ts(), the plot() method for time series, now respects cex, lwd and lty. (Reported by Greg Werbin.)
    • parallel::mccollect() now returns a named list (as documented) when called with wait = FALSE. (Reported by Michel Lang.)
    • If a package added a class to a class union in another package, loading the first package gave erroneous warnings about “undefined subclass”.
    • c()‘s argument use.names is documented now, as belonging to the (C internal) default method. In “parallel”, argument recursive is also moved from the generic to the default method, such that the formal argument list of base generic c() is just (...).
    • rbeta(4, NA) and similarly rgamma() and rnbinom() now return NaN‘s with a warning, as other r<dist>(), and as documented. (PR#17155)
    • Using options(checkPackageLicense = TRUE) no longer requires acceptance of the licence for non-default standard packages such as compiler. (Reported by Mikko Korpela.)
    • split(<very_long>, *) now works even when the split off parts are long. (PR#17139)
    • min() and max() now also work correctly when the argument list starts with character(0). (PR#17160)
    • Subsetting very large matrices (prod(dim(.)) >= 2^31) now works thanks to Michael Schubmehl’s PR#17158.
    • bartlett.test() used residual sums of squares instead of variances, when the argument was a list of lm objects. (Reported by Jens Ledet Jensen).
    • plot(<lm>, which = *) now correctly labels the contour lines for the standardized residuals for which = 6. It also takes the correct p in case of singularities (also for which = 5). (PR#17161)
    • xtabs(~ exclude) no longer fails from wrong scope, thanks to Suharto Anggono’s PR#17147.
    • Reference class calls to methods() did not re-analyse previously defined methods, meaning that calls to methods defined later would fail. (Reported by Charles Tilford).
    • findInterval(x, vec, left.open = TRUE) misbehaved in some cases. (Reported by Dmitriy Chernykh.)

    logo

    Read more »
  • Set Application Domain Name with Shiny Server

    Guest post by AVNER KANTOR

    I used the wonderful tutorial of Dean Attall to set my machine in Google cloud. After I finished to configure it successfully I wanted to redirect my domain to the Shiny application URL. This is a short description how you can do it.

    The first step is changing the domain server to your server supplier. You can find here a guide for several suppliers how to do it. I used Godaddy and Google Cloud DNS:

    • ns-cloud-b1.googledomains.com
    • ns-cloud-b2.googledomains.com
    • ns-cloud-b3.googledomains.com
    • ns-cloud-b4.googledomains.com

    The tricky part is the setting up Nginx virtual hosts. The DigitalOcean tutorial helped me again.

    We will create virtual hosts in etc/nginx/sites-available. In this directory you can find the default file you created when you set the environment (here is my file). Now you should add config file named after you domain name. Let’s assume it is example.com. Here is the config file you should have:

    $ sudo nano /etc/nginx/sites-available/example.com
    server {
      listen 80;
      listen [::]:80;
      root /var/www/html;
    
      server_name example.com www.example.com;
    
      location / {
        proxy_pass http://127.0.0.1:3838/example/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
      }
    }

    It’s important to check that default_server option is only enabled in a single active file – in our case: deafult. You can do it by grep -R default_server /etc/nginx/sites-enabled/.

    In order to enable the server block we will create symbolic links from these files to the sites-enabled directory, which Nginx reads from during startup.

    $ sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled/

    If no problems were found, restart Nginx to enable your changes:

    sudo systemctl restart nginx

    Now that you are all set up, you should test that your server blocks are functioning correctly. You can do that by visiting the domains in your web browser: http://example.com.

    Read more »
  • Presidential Election Predictions 2016 (an ASA competition)

    Guest post by Jo Hardinprofessor of mathematics, Pomona College.

    ASA’s Prediction Competition

    In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

    http://thisisstatistics.org/electionprediction2016/

    To get you started, I’ve written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

    Getting the Data

    Thanks to the Internet, there is a lot of polling data which is publicly accessible. For the competition, you are welcome to get your data from anywhere. However, I’m going to take mine from 538. http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/ (Other good sources of data are http://www.realclearpolitics.com/epolls/latest_polls/ and http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton and http://www.gallup.com/products/170987/gallup-analytics.aspx)

    Note the date indicated above as to when this R Markdown file was written. That’s the day the data were scraped from 538. If you run the Markdown file on a different day, you are likely to get different results as the polls are constantly being updated.

    Because the original data were scraped as a JSON file, it gets pulled into R as a list of lists. The data wrangling used to convert it into a tidy format is available from the source code at https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd.

    url = “http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/”
    doc <- htmlParse(url, useInternalNodes = TRUE)

    sc = xpathSApply(doc, “//script[contains(., ‘race.model’)]”,
                    function(x) c(xmlValue(x), xmlAttrs(x)[[“href”]]))

    jsobj = gsub(“.*race.stateData = (.*);race.pathPrefix.*”, \\1″, sc)

    data = fromJSON(jsobj)
    allpolls <- data$polls

    #unlisting the whole thing
    indx <- sapply(allpolls, length)
    pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, `length<-`, max(indx))))

     

    A Quick Visualization

    Before coming up with a prediction for the vote percentages for the 2016 US Presidential Race, it is worth trying to look at the data. The data are in a tidy form, so ggplot2 will be the right tool for visualizing the data.

    ggplot(subset(allpolldata, ((polltypeA == “now”) & (endDate > ymd(“2016-08-01”)))),
                            aes(y=adj_pct, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wtnow)) +
      labs(title = “Vote percentage by date and poll weight\n,
        y = “Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“538 Poll\nWeight”)

    <p “>A Quick Analysis

    Let’s try to think about the percentage of votes that each candidate will get based on the now cast polling percentages. We’d like to weight the votes based on what 538 thinks (hey, they’ve been doing this longer than I have!), the sample size, and the number of days since the poll closed.

    Using my weight, I’ll calculate a weighted average and a weighted SE for the predicted percent of votes. (The SE of the weighted variance is taken from Cochran (1977) and cited in Gatz and Smith (1995).) The weights can be used to calculate the average or the running average for the now cast polling percentages.

    allpolldata2 <- allpolldata %>%
      filter(wtnow > 0) %>%
      filter(polltypeA == “now”) %>%
      mutate(dayssince = as.numeric(today() – endDate)) %>%
      mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%
      mutate(votewt = wt*pct) %>%
      group_by(choice) %>%
      arrange(choice, -dayssince) %>%
      mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%
      mutate(cum.mean = cummean(pct))

    Plotting the Cumulative Mean / Weighted Mean

    In tidy format, the data are ready to plot. Note that the cumulative mean is much smoother than the cumulative weighted mean because the weights are much heavier toward the later polls.

    ggplot(subset(allpolldata2, ( endDate > ymd(“2016-01-01”))),
                            aes(y=cum.mean, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wt)) +
        labs(title = “Cumulative Mean Vote Percentage\n,
        y = “Cumulative Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“Calculated Weight”)

    ggplot(subset(allpolldata2, (endDate > ymd(“2016-01-01”))),
                            aes(y=cum.mean.wt, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wt)) +
      labs(title = “Cumulative Weighted Mean Vote Percentage\n,
        y = “Cumulative Weighted Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“Calculated Weight”)

    Additionally, the weighted average and the SE of the average (given by Cochran (1977)) can be computed for each candidate. Using the formula, we have our prediction of the final percentage of the popular vote for each major candidate!

    pollsummary <- allpolldata2 %>%
      select(choice, pct, wt, votewt, sampleSize, dayssince) %>%
      group_by(choice) %>%
      summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),
               std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))

    pollsummary

    ## # A tibble: 2 x 3
    ##    choice mean.vote  std.vote
    ##     <chr>     <dbl>     <dbl>
    ## 1 Clinton  43.64687 0.5073492
    ## 2   Trump  39.32071 1.1792667

    Other people’s advice

    Prediction is very difficult, especially about the future. – Niels Bohr

    Along with good data sources, you should also be able to find information about prediction and modeling. I’ve provided a few resources to get you started.

                 Andrew Gelman: http://andrewgelman.com/2016/08/17/29654/

                 Sam Wang: http://election.princeton.edu/2016/08/21/sharpening-the-forecast/

                 Fivethirtyeight: http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

                 Christensen and Florence: http://www.amstat.org/misc/tasarticle.pdf and https://tofu.byu.edu/electionpollproject/

     

     

    Read more »
  • The reproducibility crisis in science and prospects for R

    Guest post by Gregorio Santori (<gregorio.santori@unige.it>)

    The results that emerged from a recent Nature‘s survey confirm as, for many researchers, we are living in a weak reproducibility age (Baker M. Is there a reproducibility crisis? Nature 2016;533:453-454). Although the definition of reproducibility can vary widely between disciplines, in this survey was adopted the version for which “another scientist using the same methods gets similar results and can draw the same conclusions” (Reality check on reproducibility. Nature 2016;533:437). Already in 2009, Roger Peng formulated a definition of reproducibility very attractive: “In many fields of study there are examples of scientific investigations that cannot be fully replicated because of a lack of time or resources. In such a situation there is a need for a minimum standard that can fill the void between full replication and nothing. One candidate for this minimum standard is «reproducible research», which requires that data sets and computer code be made available to others for verifying published results and conducting alternative analyses” (Peng R. Reproducible research and Biostatistics. Biostatistics. 2009;10:405-408). For many readers of R-bloggers, the Peng’s formulation probably means in the first place a combination of R, LaTeX, Sweave, knitr, R Markdown, RStudio, and GitHub. From the broader perspective of scholarly journals, it mainly means Web repositories for experimental protocols, raw data, and source code.

    Although researchers and funders can contribute in many ways to reproducibility, scholarly journals seem to be in a position to give a decisive advancement for a more reproducible research. In the incipit of the “Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals“, developed by the International Committee of Medical Journals Editors (ICMJE), there is an explicit reference to reproducibility. Moreover, the same ICMJE Recommendations reported as “the Methods section should aim to be sufficiently detailed such that others with access to the data would be able to reproduce the results“, while “[the Statistics section] describe[s] statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results“.

    In December 2010, Nature Publishing Group launched Protocol Exchange, “[…] an Open Repository for the deposition and sharing of protocols for scientific research“, where “protocols […] are presented subject to a Creative Commons Attribution-NonCommercial licence“.

    In December 2014, PLOS journals announced a new policy for data sharing, resulted in the Data Availability Statement for submitted manuscripts.

    In June 2014, at the American Association for the Advancement of Science headquarter, the US National Institute of Health held a joint workshop on the reproducibility, with the participation of the Nature Publishing Group, Science, and the editors representing over 30 basic/preclinical science journals. The workshop resulted in the release of the “Principles and Guidelines for Reporting Preclinical Research“, where rigorous statistical analysis and data/material sharing were emphasized.

    In this scenario, I have recently suggested a global “statement for reproducibility” (Research papers: Journals should drive data reproducibility. Nature 2016;535:355). One of the strong points of this proposed statement is represented by the ban of “point-and-click” statistical software. For papers with a “Statistical analysis” section, only original studies carried out by using source code-based statistical environments should be admitted to peer review. In any case, the current policies adopted by scholarly journals seem to be moving towards stringent criteria to ensure more reproducible research. In the next future, the space for “point-and-click” statistical software will progressively shrink, and a cross-platform/open source language/environment such as R will be destined to play a key role.

     

    Read more »
  • Using 2D Contour Plots within {ggplot2} to Visualize Relationships between Three Variables

    Guest post by John Bellettiere, Vincent Berardi, Santiago Estrada

    The Goal

    To visually explore relations between two related variables and an outcome using contour plots. We use the contour function in Base R to produce contour plots that are well-suited for initial investigations into three dimensional data. We then develop visualizations using ggplot2 to gain more control over the graphical output. We also describe several data transformations needed to accomplish this visual exploration.

    The Dataset

    The mtcars dataset provided with Base R contains results from Motor Trend road tests of 32 cars that took place between 1973 and 1974. We focus on the following three variables: wt (weight, 1000lbs), hp (gross horsepower), qsec (time required to travel a quarter mile). qsec is a measure of acceleration with shorter times representing faster acceleration. It is reasonable to believe that weight and horsepower are jointly related to acceleration, possibly in a nonlinear fashion.

    head(mtcars)

    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

    Preliminary Visualizations

    To start, we look at a simple scatter plot of weight by horsepower, with each data point colored according to quartiles of acceleration. We first create a new variable to represent quartiles of acceleration using the cut and quantile functions.

    mtcars$quart <- cut(mtcars$qsec, quantile(mtcars$qsec))

    From here, we use ggplot to visualize the data. We selected colors that were sequential and color blind friendly using ColorBrewer and manually added them to the scale_colour_manual() argument within the ggplot() call below. Labels were also manually added to improve interpretation.

    ggplot(mtcars, aes(x = wt, y = hp, color = factor(quart))) +
           geom_point(shape = 16, size = 5) +
           theme(legend.position = c(0.80, 0.85),
                legend.background = element_rect(colour = “black”),
                panel.background = element_rect(fill = “black”)) +
           labs(x = “Weight (1,000lbs)”y = “Horsepower”) +
           scale_colour_manual(values = c(“#fdcc8a”, “#fc8d59”, “#e34a33”, “#b30000”),
                              name = “Quartiles of qsec”,
                              labels = c(“14.5-16.9s”, “17.0-17.7s”, “17.8-18.9s”, “19.0-22.9s”))

    This plot provides a first look at the interrelationships of the three variable of interest. To get a different representation of these relations, we use contour plots.

    Preparing the Data for Contour Plots in Base R

    The contour function requires three dimensional data as an input. We are interested in estimating acceleration for all possible combinations of weight and horsepower using the available data, thereby generating three dimensional data. To compute the estimates, a two-dimensional loess model is fit to the data using the following call:

    data.loess <- loess(qsec ~ wt * hp, data = mtcars)

    The model contained within the resulting loess object is then used to output the three-dimensional dataset needed for plotting. We do that by generating a sequence of values with uniform spacing over the range of wt and hp. An arbitrarily chosen distance of 0.3 between sequence elements was used to give a relatively fine resolution to the data. Using the predict function, the loess model object is used to estimate a qsec value for each combination of values in the two sequences. These estimates are stored in a matrix where each element of the wt sequence is represented by a row and each element of the hp sequence is represented by a column.

    # Create a sequence of incrementally increasing (by 0.3 units) values for both wt and hp
    xgrid <-  seq(min(mtcars$wt), max(mtcars$wt), 0.3)
    ygrid <-  seq(min(mtcars$hp), max(mtcars$hp), 0.3)
    # Generate a dataframe with every possible combination of wt and hp
    data.fit <-  expand.grid(wt = xgrid, hp = ygrid)
    # Feed the dataframe into the loess model and receive a matrix output with estimates of
    # acceleration for each combination of wt and hp
    mtrx3d <-  predict(data.loess, newdata = data.fit)
    # Abbreviated display of final matrix
    mtrx3d[1:4, 1:4]

    ##           hp
    ## wt         hp= 52.0 hp= 52.3 hp= 52.6 hp= 52.9
    ##   wt=1.513 19.04237 19.03263 19.02285 19.01302
    ##   wt=1.813 19.25566 19.24637 19.23703 19.22764
    ##   wt=2.113 19.55298 19.54418 19.53534 19.52645
    ##   wt=2.413 20.06436 20.05761 20.05077 20.04383

    We then visualize the resulting three dimensional data using the contour function.

    contour(x = xgrid, y = ygrid, z = mtrx3d, xlab = “Weight (1,000lbs)”, ylab = “Horsepower”)

    Preparing the Data for Contour Plots in GGPlots

    To use ggplot, we manipulate the data into “long format” using the melt function from the reshape2 package. We add names for all of the resulting columns for clarity. An unfortunate side effect of the predict function used to populate the initial 3d dataset is that all of the row values and column values of the resulting matrix are of type char, in the form of “variable = value“. The character portion of these values need to first be removed then the remaining values converted to numeric. This is done using str_locate (from the stringR package) to locate the “=” character, then use str_sub (also from stringR) to extract only the numerical portion of the character string. Finally, as.numeric is used to convert results to the appropriate class.

    # Transform data to long form
    mtrx.melt <- melt(mtrx3d, id.vars = c(“wt”, “hp”), measure.vars = “qsec”)
    names(mtrx.melt) <- c(“wt”, “hp”, “qsec”)
    # Return data to numeric form
    mtrx.melt$wt <- as.numeric(str_sub(mtrx.melt$wt, str_locate(mtrx.melt$wt, “=”)[1,1] + 1))
    mtrx.melt$hp <- as.numeric(str_sub(mtrx.melt$hp, str_locate(mtrx.melt$hp, “=”)[1,1] + 1))

    head(mtrx.melt)

    ##      wt hp     qsec
    ## 1 1.513 52 19.04237
    ## 2 1.813 52 19.25566
    ## 3 2.113 52 19.55298
    ## 4 2.413 52 20.06436
    ## 5 2.713 52 20.65788
    ## 6 3.013 52 20.88378

    Using GGPlots2 to Create Contour Plots

    Basic Contour Plot

    With the data transformed into “long” form, we can make contour plots with ggplot2. With the most basic parameters in place, we see:

    plot1 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +
             stat_contour()

    The resulting plot is not very descriptive and has no indication of the values of qsec.

    Contour plot with plot region colored using a continuous outcome variable (qsec).

    To aid in our plot’s descriptive value, we add color to the contour plot based on values of qsec.

    plot2 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +
             stat_contour(geom = “polygon”, aes(fill = ..level..)) +
             geom_tile(aes(fill = qsec)) +
             stat_contour(bins = 15) +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horsepower”) +
             guides(fill = guide_colorbar(title = “¼ Mi. Time (s)”))

    Contour plot with plot region colored using discrete levels

    Another option could be to add colored regions between contour lines. In this case, we will split qsec into 10 equal segments using the cut function.

    # Create ten segments to be colored in
    mtrx.melt$equalSpace <- cut(mtrx.melt$qsec, 10)
    # Sort the segments in ascending order
    breaks <- levels(unique(mtrx.melt$equalSpace))
    # Plot
    plot3 <- ggplot() +
             geom_tile(data = mtrx.melt, aes(wt, hp, qsec, fill = equalSpace)) +
             geom_contour(color = “white”, alpha = 0.5) +
             theme_bw() +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horsepower”) +
             scale_fill_manual(values = c(“#35978f”, “#80cdc1”, “#c7eae5”, “#f5f5f5”,
                                         “#f6e8c3”, “#dfc27d”, “#bf812d”, “#8c510a”,
                                         “#543005”, “#330000”),
                               name = “¼ Mi. Time (s)”, breaks = breaks, labels = breaks)

    ## Warning in max(vapply(evaled, length, integer(1))): no non-missing
    ## arguments to max; returning -Inf

    Note: in the lower right hand corner of the graph above, there is a region where increasing weight is associated with decreasing ¼ mile times, which is not characteristic of the true relation between weight and acceleration. This is due to extrapolation that the predict function performed while creating predictions for qsec for combinations of weight and height that did not exist in the raw data. This cannot be avoided using the methods described above. A well-placed rectangle (geom_rect) or placing the legend over the offending area can conceal this region (see example below).

    Contour plot with contour lines colored using a continuous outcome variable (qsec)

    Instead of coloring the whole plot, it may be more desirable to color just the contour lines of the plot. This can be achieved by using the stat_contour aesthetic over the scale_fill_manual aesthetic. We also chose to move the legend in the area of extrapolation.

    plot4 <- ggplot()  +
             theme_bw() +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horspower”) +
             stat_contour(data = mtrx.melt, aes(x = wt, y = hp, z = qsec, colour = ..level..),
                         breaks = round(quantile(mtrx.melt$qsec, seq(0, 1, 0.1)), 0), size = 1) +
             scale_color_continuous(name = “¼ Mi. Time (s)”) +
             theme(legend.justification=c(1, 0), legend.position=c(1, 0))

    Contour plot with contour lines colored using a continuous outcome variable and overlaying scatterplot of weight and horsepower.

    We can also overlay the raw data from mtcars onto the previous plot.

    plot5 <- plot4 + 
             geom_point(data = mtcars, aes(x = wt, y = hp), shape = 1, size = 2.5, color = “red”)

    Contour plot with contour lines colored using a continuous outcome variable and labeled using direct.labels()

    With color-coded contour lines, as seen in the previous example, it may be difficult to differentiate the values of qsec that each line represents. Although we supplied a legend to the preceding plot, using direct.labels from the “directlabels” package can clarify values of qsec.

    plot6 <- direct.label(plot5, “bottom.pieces”)

    We hope that these examples were of help to you and that you are better able to visualize your data as a result.

    For questions, corrections, or suggestions for improvement, contact John at JBellettiere@ucsd.edu or using @JohnBellettiere via Twitter.

    Read more »
  • R 3.3.1 is released

    R 3.3.1 (codename “Bug in Your Hair”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes is provided below new features and (this release does not introduce new features).

    Upgrading to R 3.3.1 on Windows

    If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

    install.packages("installr") # install 
    setInternet2(TRUE) # only for R versions older than 3.3.0
    installr::updateR() # updating R.

    Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

    I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

    CHANGES IN R 3.3.1

    BUG FIXES

    • R CMD INSTALL and hence install.packages() gave an internal error installing a package called description from a tarball on a case-insensitive file system.
    • match(x, t) (and hence x %in% t) failed when x was of length one, and either character and x and t only differed in their Encoding or when x and twhere complex with NAs or NaNs. (PR#16885.)
    • unloadNamespace(ns) also works again when ns is a ‘namespace’, as from getNamespace().
    • rgamma(1,Inf) or rgamma(1, 0,0) no longer give NaN but the correct limit.
    • length(baseenv()) is correct now.
    • pretty(d, ..) for date-time d rarely failed when "halfmonth" time steps were tried (PR#16923) and on ‘inaccurate’ platforms such as 32-bit windows or a configuration with --disable-long-double; see comment #15 of PR#16761.
    • In text.default(x, y, labels), the rarely(?) used default for labels is now correct also for the case of a 2-column matrix x and missing y.
    • as.factor(c(a = 1L)) preserves names() again as in R < 3.1.0.
    • strtrim(""[0], 0[0]) now works.
    • Use of Ctrl-C to terminate a reverse incremental search started by Ctrl-R in the readline-based Unix terminal interface is now supported forreadline >= 6.3 (Ctrl-G always worked). (PR#16603)
    • diff(<difftime>) now keeps the "units" attribute, as subtraction already did, PR#16940.

    logo

    Read more »
  • heatmaply: interactive heat maps (with R)

    I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package.

    tl;dr

    By running the following 3 lines of code:

    install.packages("heatmaply")
    library(heatmaply)
    heatmaply(mtcars, k_col = 2, k_row = 3) %&gt;% layout(margin = list(l = 130, b = 40))

    You will get this output in your browser (or RStudio console):

    You can see more example in the online vignette on CRANFor issue reports or feature requests, please visit the GitHub repo.

    Introduction

    A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

    Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by draging a rectangle around the relevant area.

    This work is based on the ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), the ability to zoom from the dendrogram (thanks to the dendextend R package), and the possibility of seeing new features in the future (such as sidebar bars).

    Why heatmaply

    The heatmaply package is designed to have a familiar features and user interface as heatmapgplots::heatmap.2 and other functions for static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way. heatmaply includes the following features:

    • Shows the row/column/value under the mouse cursor (and includes a legend on the side)
    • Drag a rectangle over the heatmap image, or the dendrograms, in order to zoom in (the dendrogram coloring relies on integration with the dendextend package)
    • Works from the R console, in RStudio, with R Markdown, and with Shiny

    The package is similar to the d3heatmap package (developed by the brilliant Joe Cheng), but is based on the plotly R package. Performance-wise it can handle larger matrices. Furthermore, since it is based on ggplot2+plotly, it is expected to have more features in the future (as it is more easily extendable by also non-JavaScript experts). I choose to build heatmaply on top of plotly.js since it is a free, open source, JavaScript library that can translate ggplot2 figures into self-contained interactive JavaScript objects (which can be viewed in your browser or RStudio).

    The default color palette for the heatmap is based on the beautiful viridis package. Also, by using the dendextend package (see the open-access two-page bioinformatics paper), you can customize dendrograms before sending them to heatmaply (via Rowv and Colv).

    You can see some more eye-candy in the online Vignette on CRAN, for example:

    2016-05-31 23_21_46-Clipboard

    For issue reports or feature requests, please visit the GitHub repo.

    Read more »
  • Copyright Use-R.com 2012 - 2016 ©