RSS Feed : R-Statistics.com

RSS Feed from http://www.r-statistics.com

  • ggedit 0.0.2: a GUI for advanced editing of ggplot2 objects

    Guest post by Jonathan Sidi, Metrum Research Group

    Last week the updated version of ggedit was presented in RStudio::conf2017. First, a BIG thank you to the whole RStudio team for a great conference and being so awesome to answer the insane amount of questions I had (sorry!). For a quick intro to the package see the previous post.

    To install the package:

    devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")

    Highlights of the updated version.

    • verbose script handling during updating in the gagdet (see video below)
    • verbose script output for updated layers and theme to parse and evaluate in console or editor
    • colourpicker control for both single colours/fills and and palletes
    • output for scale objects eg scale*grandient,scale*grandientn and scale*manual
    • verbose script output for scales eg scale*grandient,scale*grandientn and scale*manual to parse and evaluate in console or editor
    • input plot objects can have the data in the layer object and in the base object.
      • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))+geom_point()
      • ggplot(data=iris,aes(x=Sepal.Width,y=Sepal.Length))+geom_point(aes(colour=Species))
      • ggplot()+geom_point(data=iris,aes(x=Sepal.Width,y=Sepal.Length,colour=Species))
    • plot.theme(): S3 method for class ‘theme’
      • visualizing theme objects in single output
      • visual comparison of two themes objects in single output
      • will be expanded upon in upcoming post

    RStudio::conf2017 Presentation

    #devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")
    rm(list=ls())
    library(ggedit)
    #?ggedit
    
    p0=list(
      Scatter=iris%>%ggplot(aes(x =Sepal.Length,y=Sepal.Width))+
        geom_point(aes(colour=Species),size=6),
      
      ScatterFacet=iris%>%ggplot(aes(x =Sepal.Length,y=Sepal.Width))+
        geom_point(aes(colour=Species),size=6)+
          geom_line(linetype=2)+
        facet_wrap(~Species,scales='free')+
        labs(title='Some Title')
      )
    
    #a=ggedit(p.in = p0,verbose = T) #run ggedit
    dat_url <- paste0("https://raw.githubusercontent.com/metrumresearchgroup/ggedit/master/RstudioExampleObj.rda")
    load(url(dat_url)) #pre-run example
    
    ldply(a,names)
    ##                     .id      V1           V2
    ## 1          UpdatedPlots Scatter ScatterFacet
    ## 2         UpdatedLayers Scatter ScatterFacet
    ## 3 UpdatedLayersElements Scatter ScatterFacet
    ## 4     UpdatedLayerCalls Scatter ScatterFacet
    ## 5         updatedScales Scatter ScatterFacet
    ## 6    UpdatedScalesCalls Scatter ScatterFacet
    ## 7         UpdatedThemes Scatter ScatterFacet
    ## 8     UpdatedThemeCalls Scatter ScatterFacet
    plot(a)

    comparePlots=c(p0,a$UpdatedPlots)
    names(comparePlots)[c(3:4)]=paste0(names(comparePlots)[c(3:4)],"Updated")

    Initial Comparison Plot

    plot(as.ggedit(comparePlots))

    Apply updated theme of first plot to second plot

    comparePlots$ScatterFacetNewTheme=p0$ScatterFacet+a$UpdatedThemes$Scatter
    
    plot(as.ggedit(comparePlots[c("ScatterFacet","ScatterFacetNewTheme")]),
          plot.layout = list(list(rows=1,cols=1),list(rows=2,cols=1))
         )

    #Using Remove and Replace Function ##Overlay two layers of same geom

    (comparePlots$ScatterMistake=p0$Scatter+a$UpdatedLayers$ScatterFacet[[1]])

    Remove

    (comparePlots$ScatterNoLayer=p0$Scatter%>%
      rgg(oldGeom = 'point'))

    Replace Geom_Point layer on Scatter Plot

    (comparePlots$ScatterNewLayer=p0$Scatter%>%
      rgg(oldGeom = 'point',
          oldGeomIdx = 1,
          newLayer = a$UpdatedLayers$ScatterFacet[[1]]))

    Remove and Replace Geom_Point layer and add the new theme

    (comparePlots$ScatterNewLayerTheme=p0$Scatter%>%
      rgg(oldGeom = 'point',
          newLayer = a$UpdatedLayers$ScatterFacet[[1]])+
      a$UpdatedThemes$Scatter)

    Cloning Layers

    A geom_point layer

    (l=p0$Scatter$layers[[1]])
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity

    Clone the layer

    (l1=cloneLayer(l))
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity
    all.equal(l,l1)
    ## [1] TRUE

    Verbose copy of layer

    (l1.txt=cloneLayer(l,verbose = T))
    ## [1] "geom_point(mapping=aes(colour=Species),na.rm=FALSE,size=6,data=NULL,position=\"identity\",stat=\"identity\",show.legend=NA,inherit.aes=TRUE)"

    Parse the text

    (l2=eval(parse(text=l1.txt)))
    ## mapping: colour = Species 
    ## geom_point: na.rm = FALSE
    ## stat_identity: na.rm = FALSE
    ## position_identity
    all.equal(l,l2)
    ## [1] TRUE

    Back to our example

      #Original geom_point layer
      parse(text=cloneLayer(p0$ScatterFacet$layers[[1]],verbose = T))
    ## expression(geom_point(mapping = aes(colour = Species), na.rm = FALSE, 
    ##     size = 6, data = NULL, position = "identity", stat = "identity", 
    ##     show.legend = NA, inherit.aes = TRUE))
      #new Layer
      parse(text=a$UpdatedLayerCalls$ScatterFacet[[1]])
    ## expression(geom_point(mapping = aes(colour = Species), na.rm = FALSE, 
    ##     size = 3, shape = 22, fill = "#BD2020", alpha = 1, stroke = 0.5, 
    ##     data = NULL, position = "identity", stat = "identity", show.legend = NA, 
    ##     inherit.aes = TRUE))

    <!---

    Visualize Themes

    pTheme=list()
    (pTheme$Base=plot(a$UpdatedThemes$Scatter))

    Visualize Part of Themes

    (pTheme$Select=plot(a$UpdatedThemes$Scatter,themePart = c('plot','legend'),fnt = 18))

    Visually Compare Theme

    (pTheme$Compare=plot(obj=a$UpdatedThemes$Scatter,obj2 = ggplot2:::theme_get()))

    --->


    Jonathan Sidi joined Metrum Researcg Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.

    To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.

    Contact: For questions and comments, feel free to email me at: [email protected] or open an issue in github.

    Read more »
  • ggedit – interactive ggplot aesthetic and theme editor

    Guest post by Jonathan Sidi, Metrum Research Group

    ggplot2 has become the standard of plotting in R for many users. New users, however, may find the learning curve steep at first, and more experienced users may find it challenging to keep track of all the options (especially in the theme!).

    ggedit is a package that helps users bridge the gap between making a plot and getting all of those pesky plot aesthetics just right, all while keeping everything portable for further research and collaboration.

    ggedit is powered by a Shiny gadget where the user inputs a ggplot plot object or a list of ggplot objects. You can run ggedit directly from the console from the Addin menu within RStudio.

    Installation

    devtools::install_github("metrumresearchgroup/ggedit",subdir="ggedit")

    Layers

    The gadget creates a popup window which is populated by the information found in each layer. You can edit the aesthetic values found in a layer and see the changes happen in real time.


    You can edit the aesthetic layers while still preserving the original plot, because the changed layers are cloned from the original plot object and are independent of it. The edited layers are provided in the output as objects, so you can use the layers independent of the plot using regular ggplot2 grammar. This is a great advantage when collaborating with other people, where you can send a plot to team members to edit the layers aesthetics and they can send you back just the new layers for you to implement them.

    Themes

    ggedit also has a theme editor inside. You can edit any element in the theme and see the changes in real time, making the trial and error process quick and easy. Once you are satisfied with the edited theme you can apply it to other plots in the plot list with one click or even make it the session theme regardless of the gadget. As with layers, the new theme object is part of the output, making collaboration easy.


    Outputs

    The gadget returns a list containing 4 elements

    • updatedPlots
      • List containing updated ggplot objects
    • updatedLayers
      • For each plot a list of updated layers (ggproto) objects
      • Portable object
    • updatedLayersElements
      • For each plot a list elements and their values in each layer
      • Can be used to update the new values in the original code
    • updatedThemes
      • For each plot a list of updated theme objects
      • Portable object
      • If the user doesn’t edit the theme updatedThemes will not be returned

    rgg

    After you finish editing the plots the natural progression is to use them in the rest of the script. In ggedit there is the function rgg (remove and replace ggplot). Using this function you can chain into the original code changes to the plot without multiplying script needlessly.


    With this function you can

    Specify which layer you want to remove from a plot:

    ggObj%>%rgg('line')

    Provide an index to a specific layer, in instances where there are more than one layer of the same type in the plot

    ggObj%>%rgg('line',2)

    Remove a layer from ggObj and replace it with a new one from the ggedit output p.out

    ggObj%>%rgg('line',newLayer = p.out$UpdatedLayers)

    Remove a layer and replace it with a new one and the new theme

    ggObj%>%rgg('line',newLayer = p.out$UpdatedLayers)+p.out$UpdatedThemes

    There is also a plotting function for ggedit objects that creates a grid.view for you and finds the best grid size for the amount of plots you have in the list. And for the exotic layouts you can give specific positions and the rest will be done for you. If you didn’t use ggedit, you can still add the class to any ggplot and use the plotting function just the same.

    plot(as.ggedit(list(p0,p1,p2,p3)),list(list(rows=1,cols=1:3),
                                           list(rows=2,cols=2),
                                           list(rows=2,cols=1),
                                           list(rows=2,cols=3))
    )

    Addin Launch

    To launch the Shiny gadget from the addin menu highlight the code that creates the plot object or the plot name in the source pane of Rstudio, then click on the ggedit addin from the Addins the dropdown menu.



    Jonathan Sidi joined Metrum Researcg Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings.

    To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website.

    Contact: For questions and comments, feel free to email me at: [email protected] or open an issue in github.

    Read more »
  • R 3.3.2 is released!

    R 3.3.2 (codename “Sincere Pumpkin Patch”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes and new features is provided below.

    Upgrading to R 3.3.2 on Windows

    If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

    install.packages("installr") # install 
    setInternet2(TRUE) # only for R versions older than 3.3.0
    installr::updateR() # updating R.

    Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

    I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

    CHANGES IN R 3.3.2

    NEW FEATURES

    • extSoftVersion() now reports the version (if any) of the readline library in use.
    • The version of LAPACK included in the sources has been updated to 3.6.1, a bug-fix release including a speedup for the non-symmetric case of eigen().
    • Use options(deparse.max.lines=) to limit the number of lines recorded in .Traceback and other deparsing activities.
    • format(<AsIs>) looks more regular, also for non-character atomic matrices.
    • abbreviate() gains an option named = TRUE.
    • The online documentation for package methods is extensively rewritten. The goals are to simplify documentation for basic use, to note old features not recommended and to correct out-of-date information.
    • Calls to setMethod() no longer print a message when creating a generic function in those cases where that is natural: S3 generics and primitives.

    INSTALLATION and INCLUDED SOFTWARE

    • Versions of the readline library >= 6.3 had been changed so that terminal window resizes were not signalled to readline: code has been added using a explicit signal handler to work around that (when R is compiled against readline >= 6.3). (PR#16604)
    • configure works better with Oracle Developer Studio 12.5.

    UTILITIES

    • R CMD check reports more dubious flags in files ‘src/Makevars[.in]’, including -w and -g.
    • R CMD check has been set up to filter important warnings from recent versions of gfortran with -Wall -pedantic: this now reports non-portable GNU extensions such as out-of-order declarations.
    • R CMD config works better with paths containing spaces, even those of home directories (as reported by Ken Beath).

    DEPRECATED AND DEFUNCT

    • Use of the C/C++ macro NO_C_HEADERS is deprecated (no C headers are included by R headers from C++ as from R 3.3.0, so it should no longer be needed).

    BUG FIXES

    • The check for non-portable flags in R CMD check could be stymied by ‘src/Makevars’ files which contained targets.
    • (Windows only) When using certain desktop themes in Windows 7 or higher, Alt-Tab could cause Rterm to stop accepting input. (PR#14406; patch submitted by Jan Gleixner.)
    • pretty(d, ..) behaves better for date-time d (PR#16923).
    • When an S4 class name matches multiple classes in the S4 cache, perform a dynamic search in order to obey namespace imports. This should eliminate annoying messages about multiple hits in the class cache. Also, pass along the package from the ClassExtends object when looking up superclasses in the cache.
    • sample(NA_real_) now works.
    • Packages using non-ASCII encodings in their code did not install data properly on systems using different encodings.
    • merge(df1, df2) now also works for data frames with column names "na.last", "decreasing", or "method". (PR#17119)
    • contour() caused a segfault if the labels argument had length zero. (Reported by Bill Dunlap.)
    • unique(warnings()) works more correctly, thanks to a new duplicated.warnings() method.
    • findInterval(x, vec = numeric(), all.inside = TRUE) now returns 0s as documented. (Reported by Bill Dunlap.)
    • (Windows only) R CMD SHLIB failed when a symbol in the resulting library had the same name as a keyword in the ‘.def’ file. (PR#17130)
    • pmax() and pmin() now work with (more ?) classed objects, such as "Matrix" from the Matrix package, as documented for a long time.
    • axis(side, x = D) and hence Axis() and plot() now work correctly for "Date" and time objects D, even when “time goes backward”, e.g., with decreasing xlim. (Reported by William May.)
    • str(I(matrix(..))) now looks as always intended.
    • plot.ts(), the plot() method for time series, now respects cex, lwd and lty. (Reported by Greg Werbin.)
    • parallel::mccollect() now returns a named list (as documented) when called with wait = FALSE. (Reported by Michel Lang.)
    • If a package added a class to a class union in another package, loading the first package gave erroneous warnings about “undefined subclass”.
    • c()‘s argument use.names is documented now, as belonging to the (C internal) default method. In “parallel”, argument recursive is also moved from the generic to the default method, such that the formal argument list of base generic c() is just (...).
    • rbeta(4, NA) and similarly rgamma() and rnbinom() now return NaN‘s with a warning, as other r<dist>(), and as documented. (PR#17155)
    • Using options(checkPackageLicense = TRUE) no longer requires acceptance of the licence for non-default standard packages such as compiler. (Reported by Mikko Korpela.)
    • split(<very_long>, *) now works even when the split off parts are long. (PR#17139)
    • min() and max() now also work correctly when the argument list starts with character(0). (PR#17160)
    • Subsetting very large matrices (prod(dim(.)) >= 2^31) now works thanks to Michael Schubmehl’s PR#17158.
    • bartlett.test() used residual sums of squares instead of variances, when the argument was a list of lm objects. (Reported by Jens Ledet Jensen).
    • plot(<lm>, which = *) now correctly labels the contour lines for the standardized residuals for which = 6. It also takes the correct p in case of singularities (also for which = 5). (PR#17161)
    • xtabs(~ exclude) no longer fails from wrong scope, thanks to Suharto Anggono’s PR#17147.
    • Reference class calls to methods() did not re-analyse previously defined methods, meaning that calls to methods defined later would fail. (Reported by Charles Tilford).
    • findInterval(x, vec, left.open = TRUE) misbehaved in some cases. (Reported by Dmitriy Chernykh.)

    logo

    Read more »
  • Set Application Domain Name with Shiny Server

    Guest post by AVNER KANTOR

    I used the wonderful tutorial of Dean Attall to set my machine in Google cloud. After I finished to configure it successfully I wanted to redirect my domain to the Shiny application URL. This is a short description how you can do it.

    The first step is changing the domain server to your server supplier. You can find here a guide for several suppliers how to do it. I used Godaddy and Google Cloud DNS:

    • ns-cloud-b1.googledomains.com
    • ns-cloud-b2.googledomains.com
    • ns-cloud-b3.googledomains.com
    • ns-cloud-b4.googledomains.com

    The tricky part is the setting up Nginx virtual hosts. The DigitalOcean tutorial helped me again.

    We will create virtual hosts in etc/nginx/sites-available. In this directory you can find the default file you created when you set the environment (here is my file). Now you should add config file named after you domain name. Let’s assume it is example.com. Here is the config file you should have:

    $ sudo nano /etc/nginx/sites-available/example.com
    server {
      listen 80;
      listen [::]:80;
      root /var/www/html;
    
      server_name example.com www.example.com;
    
      location / {
        proxy_pass http://127.0.0.1:3838/example/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
      }
    }

    It’s important to check that default_server option is only enabled in a single active file – in our case: deafult. You can do it by grep -R default_server /etc/nginx/sites-enabled/.

    In order to enable the server block we will create symbolic links from these files to the sites-enabled directory, which Nginx reads from during startup.

    $ sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled/

    If no problems were found, restart Nginx to enable your changes:

    sudo systemctl restart nginx

    Now that you are all set up, you should test that your server blocks are functioning correctly. You can do that by visiting the domains in your web browser: http://example.com.

    Read more »
  • Presidential Election Predictions 2016 (an ASA competition)

    Guest post by Jo Hardinprofessor of mathematics, Pomona College.

    ASA’s Prediction Competition

    In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

    http://thisisstatistics.org/electionprediction2016/

    To get you started, I’ve written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

    Getting the Data

    Thanks to the Internet, there is a lot of polling data which is publicly accessible. For the competition, you are welcome to get your data from anywhere. However, I’m going to take mine from 538. http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/ (Other good sources of data are http://www.realclearpolitics.com/epolls/latest_polls/ and http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton and http://www.gallup.com/products/170987/gallup-analytics.aspx)

    Note the date indicated above as to when this R Markdown file was written. That’s the day the data were scraped from 538. If you run the Markdown file on a different day, you are likely to get different results as the polls are constantly being updated.

    Because the original data were scraped as a JSON file, it gets pulled into R as a list of lists. The data wrangling used to convert it into a tidy format is available from the source code at https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd.

    url = “http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/”
    doc <- htmlParse(url, useInternalNodes = TRUE)

    sc = xpathSApply(doc, “//script[contains(., ‘race.model’)]”,
                    function(x) c(xmlValue(x), xmlAttrs(x)[[“href”]]))

    jsobj = gsub(“.*race.stateData = (.*);race.pathPrefix.*”, \\1″, sc)

    data = fromJSON(jsobj)
    allpolls <- data$polls

    #unlisting the whole thing
    indx <- sapply(allpolls, length)
    pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, `length<-`, max(indx))))

     

    A Quick Visualization

    Before coming up with a prediction for the vote percentages for the 2016 US Presidential Race, it is worth trying to look at the data. The data are in a tidy form, so ggplot2 will be the right tool for visualizing the data.

    ggplot(subset(allpolldata, ((polltypeA == “now”) & (endDate > ymd(“2016-08-01”)))),
                            aes(y=adj_pct, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wtnow)) +
      labs(title = “Vote percentage by date and poll weight\n,
        y = “Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“538 Poll\nWeight”)

    <p “>A Quick Analysis

    Let’s try to think about the percentage of votes that each candidate will get based on the now cast polling percentages. We’d like to weight the votes based on what 538 thinks (hey, they’ve been doing this longer than I have!), the sample size, and the number of days since the poll closed.

    Using my weight, I’ll calculate a weighted average and a weighted SE for the predicted percent of votes. (The SE of the weighted variance is taken from Cochran (1977) and cited in Gatz and Smith (1995).) The weights can be used to calculate the average or the running average for the now cast polling percentages.

    allpolldata2 <- allpolldata %>%
      filter(wtnow > 0) %>%
      filter(polltypeA == “now”) %>%
      mutate(dayssince = as.numeric(today() – endDate)) %>%
      mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%
      mutate(votewt = wt*pct) %>%
      group_by(choice) %>%
      arrange(choice, -dayssince) %>%
      mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%
      mutate(cum.mean = cummean(pct))

    Plotting the Cumulative Mean / Weighted Mean

    In tidy format, the data are ready to plot. Note that the cumulative mean is much smoother than the cumulative weighted mean because the weights are much heavier toward the later polls.

    ggplot(subset(allpolldata2, ( endDate > ymd(“2016-01-01”))),
                            aes(y=cum.mean, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wt)) +
        labs(title = “Cumulative Mean Vote Percentage\n,
        y = “Cumulative Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“Calculated Weight”)

    ggplot(subset(allpolldata2, (endDate > ymd(“2016-01-01”))),
                            aes(y=cum.mean.wt, x=endDate, color=choice)) +
      geom_line() + geom_point(aes(size=wt)) +
      labs(title = “Cumulative Weighted Mean Vote Percentage\n,
        y = “Cumulative Weighted Percent Vote if Election Today”, x = “Poll Date”,
        color = “Candidate”, size=“Calculated Weight”)

    Additionally, the weighted average and the SE of the average (given by Cochran (1977)) can be computed for each candidate. Using the formula, we have our prediction of the final percentage of the popular vote for each major candidate!

    pollsummary <- allpolldata2 %>%
      select(choice, pct, wt, votewt, sampleSize, dayssince) %>%
      group_by(choice) %>%
      summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),
               std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))

    pollsummary

    ## # A tibble: 2 x 3
    ##    choice mean.vote  std.vote
    ##     <chr>     <dbl>     <dbl>
    ## 1 Clinton  43.64687 0.5073492
    ## 2   Trump  39.32071 1.1792667

    Other people’s advice

    Prediction is very difficult, especially about the future. – Niels Bohr

    Along with good data sources, you should also be able to find information about prediction and modeling. I’ve provided a few resources to get you started.

                 Andrew Gelman: http://andrewgelman.com/2016/08/17/29654/

                 Sam Wang: http://election.princeton.edu/2016/08/21/sharpening-the-forecast/

                 Fivethirtyeight: http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

                 Christensen and Florence: http://www.amstat.org/misc/tasarticle.pdf and https://tofu.byu.edu/electionpollproject/

     

     

    Read more »
  • The reproducibility crisis in science and prospects for R

    Guest post by Gregorio Santori (<[email protected]>)

    The results that emerged from a recent Nature‘s survey confirm as, for many researchers, we are living in a weak reproducibility age (Baker M. Is there a reproducibility crisis? Nature 2016;533:453-454). Although the definition of reproducibility can vary widely between disciplines, in this survey was adopted the version for which “another scientist using the same methods gets similar results and can draw the same conclusions” (Reality check on reproducibility. Nature 2016;533:437). Already in 2009, Roger Peng formulated a definition of reproducibility very attractive: “In many fields of study there are examples of scientific investigations that cannot be fully replicated because of a lack of time or resources. In such a situation there is a need for a minimum standard that can fill the void between full replication and nothing. One candidate for this minimum standard is «reproducible research», which requires that data sets and computer code be made available to others for verifying published results and conducting alternative analyses” (Peng R. Reproducible research and Biostatistics. Biostatistics. 2009;10:405-408). For many readers of R-bloggers, the Peng’s formulation probably means in the first place a combination of R, LaTeX, Sweave, knitr, R Markdown, RStudio, and GitHub. From the broader perspective of scholarly journals, it mainly means Web repositories for experimental protocols, raw data, and source code.

    Although researchers and funders can contribute in many ways to reproducibility, scholarly journals seem to be in a position to give a decisive advancement for a more reproducible research. In the incipit of the “Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly Work in Medical Journals“, developed by the International Committee of Medical Journals Editors (ICMJE), there is an explicit reference to reproducibility. Moreover, the same ICMJE Recommendations reported as “the Methods section should aim to be sufficiently detailed such that others with access to the data would be able to reproduce the results“, while “[the Statistics section] describe[s] statistical methods with enough detail to enable a knowledgeable reader with access to the original data to judge its appropriateness for the study and to verify the reported results“.

    In December 2010, Nature Publishing Group launched Protocol Exchange, “[…] an Open Repository for the deposition and sharing of protocols for scientific research“, where “protocols […] are presented subject to a Creative Commons Attribution-NonCommercial licence“.

    In December 2014, PLOS journals announced a new policy for data sharing, resulted in the Data Availability Statement for submitted manuscripts.

    In June 2014, at the American Association for the Advancement of Science headquarter, the US National Institute of Health held a joint workshop on the reproducibility, with the participation of the Nature Publishing Group, Science, and the editors representing over 30 basic/preclinical science journals. The workshop resulted in the release of the “Principles and Guidelines for Reporting Preclinical Research“, where rigorous statistical analysis and data/material sharing were emphasized.

    In this scenario, I have recently suggested a global “statement for reproducibility” (Research papers: Journals should drive data reproducibility. Nature 2016;535:355). One of the strong points of this proposed statement is represented by the ban of “point-and-click” statistical software. For papers with a “Statistical analysis” section, only original studies carried out by using source code-based statistical environments should be admitted to peer review. In any case, the current policies adopted by scholarly journals seem to be moving towards stringent criteria to ensure more reproducible research. In the next future, the space for “point-and-click” statistical software will progressively shrink, and a cross-platform/open source language/environment such as R will be destined to play a key role.

     

    Read more »
  • Using 2D Contour Plots within {ggplot2} to Visualize Relationships between Three Variables

    Guest post by John Bellettiere, Vincent Berardi, Santiago Estrada

    The Goal

    To visually explore relations between two related variables and an outcome using contour plots. We use the contour function in Base R to produce contour plots that are well-suited for initial investigations into three dimensional data. We then develop visualizations using ggplot2 to gain more control over the graphical output. We also describe several data transformations needed to accomplish this visual exploration.

    The Dataset

    The mtcars dataset provided with Base R contains results from Motor Trend road tests of 32 cars that took place between 1973 and 1974. We focus on the following three variables: wt (weight, 1000lbs), hp (gross horsepower), qsec (time required to travel a quarter mile). qsec is a measure of acceleration with shorter times representing faster acceleration. It is reasonable to believe that weight and horsepower are jointly related to acceleration, possibly in a nonlinear fashion.

    head(mtcars)

    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

    Preliminary Visualizations

    To start, we look at a simple scatter plot of weight by horsepower, with each data point colored according to quartiles of acceleration. We first create a new variable to represent quartiles of acceleration using the cut and quantile functions.

    mtcars$quart <- cut(mtcars$qsec, quantile(mtcars$qsec))

    From here, we use ggplot to visualize the data. We selected colors that were sequential and color blind friendly using ColorBrewer and manually added them to the scale_colour_manual() argument within the ggplot() call below. Labels were also manually added to improve interpretation.

    ggplot(mtcars, aes(x = wt, y = hp, color = factor(quart))) +
           geom_point(shape = 16, size = 5) +
           theme(legend.position = c(0.80, 0.85),
                legend.background = element_rect(colour = “black”),
                panel.background = element_rect(fill = “black”)) +
           labs(x = “Weight (1,000lbs)”y = “Horsepower”) +
           scale_colour_manual(values = c(“#fdcc8a”, “#fc8d59”, “#e34a33”, “#b30000”),
                              name = “Quartiles of qsec”,
                              labels = c(“14.5-16.9s”, “17.0-17.7s”, “17.8-18.9s”, “19.0-22.9s”))

    This plot provides a first look at the interrelationships of the three variable of interest. To get a different representation of these relations, we use contour plots.

    Preparing the Data for Contour Plots in Base R

    The contour function requires three dimensional data as an input. We are interested in estimating acceleration for all possible combinations of weight and horsepower using the available data, thereby generating three dimensional data. To compute the estimates, a two-dimensional loess model is fit to the data using the following call:

    data.loess <- loess(qsec ~ wt * hp, data = mtcars)

    The model contained within the resulting loess object is then used to output the three-dimensional dataset needed for plotting. We do that by generating a sequence of values with uniform spacing over the range of wt and hp. An arbitrarily chosen distance of 0.3 between sequence elements was used to give a relatively fine resolution to the data. Using the predict function, the loess model object is used to estimate a qsec value for each combination of values in the two sequences. These estimates are stored in a matrix where each element of the wt sequence is represented by a row and each element of the hp sequence is represented by a column.

    # Create a sequence of incrementally increasing (by 0.3 units) values for both wt and hp
    xgrid <-  seq(min(mtcars$wt), max(mtcars$wt), 0.3)
    ygrid <-  seq(min(mtcars$hp), max(mtcars$hp), 0.3)
    # Generate a dataframe with every possible combination of wt and hp
    data.fit <-  expand.grid(wt = xgrid, hp = ygrid)
    # Feed the dataframe into the loess model and receive a matrix output with estimates of
    # acceleration for each combination of wt and hp
    mtrx3d <-  predict(data.loess, newdata = data.fit)
    # Abbreviated display of final matrix
    mtrx3d[1:4, 1:4]

    ##           hp
    ## wt         hp= 52.0 hp= 52.3 hp= 52.6 hp= 52.9
    ##   wt=1.513 19.04237 19.03263 19.02285 19.01302
    ##   wt=1.813 19.25566 19.24637 19.23703 19.22764
    ##   wt=2.113 19.55298 19.54418 19.53534 19.52645
    ##   wt=2.413 20.06436 20.05761 20.05077 20.04383

    We then visualize the resulting three dimensional data using the contour function.

    contour(x = xgrid, y = ygrid, z = mtrx3d, xlab = “Weight (1,000lbs)”, ylab = “Horsepower”)

    Preparing the Data for Contour Plots in GGPlots

    To use ggplot, we manipulate the data into “long format” using the melt function from the reshape2 package. We add names for all of the resulting columns for clarity. An unfortunate side effect of the predict function used to populate the initial 3d dataset is that all of the row values and column values of the resulting matrix are of type char, in the form of “variable = value“. The character portion of these values need to first be removed then the remaining values converted to numeric. This is done using str_locate (from the stringR package) to locate the “=” character, then use str_sub (also from stringR) to extract only the numerical portion of the character string. Finally, as.numeric is used to convert results to the appropriate class.

    # Transform data to long form
    mtrx.melt <- melt(mtrx3d, id.vars = c(“wt”, “hp”), measure.vars = “qsec”)
    names(mtrx.melt) <- c(“wt”, “hp”, “qsec”)
    # Return data to numeric form
    mtrx.melt$wt <- as.numeric(str_sub(mtrx.melt$wt, str_locate(mtrx.melt$wt, “=”)[1,1] + 1))
    mtrx.melt$hp <- as.numeric(str_sub(mtrx.melt$hp, str_locate(mtrx.melt$hp, “=”)[1,1] + 1))

    head(mtrx.melt)

    ##      wt hp     qsec
    ## 1 1.513 52 19.04237
    ## 2 1.813 52 19.25566
    ## 3 2.113 52 19.55298
    ## 4 2.413 52 20.06436
    ## 5 2.713 52 20.65788
    ## 6 3.013 52 20.88378

    Using GGPlots2 to Create Contour Plots

    Basic Contour Plot

    With the data transformed into “long” form, we can make contour plots with ggplot2. With the most basic parameters in place, we see:

    plot1 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +
             stat_contour()

    The resulting plot is not very descriptive and has no indication of the values of qsec.

    Contour plot with plot region colored using a continuous outcome variable (qsec).

    To aid in our plot’s descriptive value, we add color to the contour plot based on values of qsec.

    plot2 <- ggplot(mtrx.melt, aes(x = wt, y = hp, z = qsec)) +
             stat_contour(geom = “polygon”, aes(fill = ..level..)) +
             geom_tile(aes(fill = qsec)) +
             stat_contour(bins = 15) +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horsepower”) +
             guides(fill = guide_colorbar(title = “¼ Mi. Time (s)”))

    Contour plot with plot region colored using discrete levels

    Another option could be to add colored regions between contour lines. In this case, we will split qsec into 10 equal segments using the cut function.

    # Create ten segments to be colored in
    mtrx.melt$equalSpace <- cut(mtrx.melt$qsec, 10)
    # Sort the segments in ascending order
    breaks <- levels(unique(mtrx.melt$equalSpace))
    # Plot
    plot3 <- ggplot() +
             geom_tile(data = mtrx.melt, aes(wt, hp, qsec, fill = equalSpace)) +
             geom_contour(color = “white”, alpha = 0.5) +
             theme_bw() +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horsepower”) +
             scale_fill_manual(values = c(“#35978f”, “#80cdc1”, “#c7eae5”, “#f5f5f5”,
                                         “#f6e8c3”, “#dfc27d”, “#bf812d”, “#8c510a”,
                                         “#543005”, “#330000”),
                               name = “¼ Mi. Time (s)”, breaks = breaks, labels = breaks)

    ## Warning in max(vapply(evaled, length, integer(1))): no non-missing
    ## arguments to max; returning -Inf

    Note: in the lower right hand corner of the graph above, there is a region where increasing weight is associated with decreasing ¼ mile times, which is not characteristic of the true relation between weight and acceleration. This is due to extrapolation that the predict function performed while creating predictions for qsec for combinations of weight and height that did not exist in the raw data. This cannot be avoided using the methods described above. A well-placed rectangle (geom_rect) or placing the legend over the offending area can conceal this region (see example below).

    Contour plot with contour lines colored using a continuous outcome variable (qsec)

    Instead of coloring the whole plot, it may be more desirable to color just the contour lines of the plot. This can be achieved by using the stat_contour aesthetic over the scale_fill_manual aesthetic. We also chose to move the legend in the area of extrapolation.

    plot4 <- ggplot()  +
             theme_bw() +
             xlab(“Weight (1,000lbs)”) +
             ylab(“Horspower”) +
             stat_contour(data = mtrx.melt, aes(x = wt, y = hp, z = qsec, colour = ..level..),
                         breaks = round(quantile(mtrx.melt$qsec, seq(0, 1, 0.1)), 0), size = 1) +
             scale_color_continuous(name = “¼ Mi. Time (s)”) +
             theme(legend.justification=c(1, 0), legend.position=c(1, 0))

    Contour plot with contour lines colored using a continuous outcome variable and overlaying scatterplot of weight and horsepower.

    We can also overlay the raw data from mtcars onto the previous plot.

    plot5 <- plot4 + 
             geom_point(data = mtcars, aes(x = wt, y = hp), shape = 1, size = 2.5, color = “red”)

    Contour plot with contour lines colored using a continuous outcome variable and labeled using direct.labels()

    With color-coded contour lines, as seen in the previous example, it may be difficult to differentiate the values of qsec that each line represents. Although we supplied a legend to the preceding plot, using direct.labels from the “directlabels” package can clarify values of qsec.

    plot6 <- direct.label(plot5, “bottom.pieces”)

    We hope that these examples were of help to you and that you are better able to visualize your data as a result.

    For questions, corrections, or suggestions for improvement, contact John at [email protected]or using @JohnBellettiere via Twitter.

    Read more »
  • R 3.3.1 is released

    R 3.3.1 (codename “Bug in Your Hair”) was released yesterday You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of bug fixes is provided below new features and (this release does not introduce new features).

    Upgrading to R 3.3.1 on Windows

    If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

    install.packages("installr") # install 
    setInternet2(TRUE) # only for R versions older than 3.3.0
    installr::updateR() # updating R.

    Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

    I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

    CHANGES IN R 3.3.1

    BUG FIXES

    • R CMD INSTALL and hence install.packages() gave an internal error installing a package called description from a tarball on a case-insensitive file system.
    • match(x, t) (and hence x %in% t) failed when x was of length one, and either character and x and t only differed in their Encoding or when x and twhere complex with NAs or NaNs. (PR#16885.)
    • unloadNamespace(ns) also works again when ns is a ‘namespace’, as from getNamespace().
    • rgamma(1,Inf) or rgamma(1, 0,0) no longer give NaN but the correct limit.
    • length(baseenv()) is correct now.
    • pretty(d, ..) for date-time d rarely failed when "halfmonth" time steps were tried (PR#16923) and on ‘inaccurate’ platforms such as 32-bit windows or a configuration with --disable-long-double; see comment #15 of PR#16761.
    • In text.default(x, y, labels), the rarely(?) used default for labels is now correct also for the case of a 2-column matrix x and missing y.
    • as.factor(c(a = 1L)) preserves names() again as in R < 3.1.0.
    • strtrim(""[0], 0[0]) now works.
    • Use of Ctrl-C to terminate a reverse incremental search started by Ctrl-R in the readline-based Unix terminal interface is now supported forreadline >= 6.3 (Ctrl-G always worked). (PR#16603)
    • diff(<difftime>) now keeps the "units" attribute, as subtraction already did, PR#16940.

    logo

    Read more »
  • heatmaply: interactive heat maps (with R)

    I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package.

    tl;dr

    By running the following 3 lines of code:

    install.packages("heatmaply")
    library(heatmaply)
    heatmaply(mtcars, k_col = 2, k_row = 3) %&gt;% layout(margin = list(l = 130, b = 40))

    You will get this output in your browser (or RStudio console):

    You can see more example in the online vignette on CRANFor issue reports or feature requests, please visit the GitHub repo.

    Introduction

    A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

    Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by draging a rectangle around the relevant area.

    This work is based on the ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), the ability to zoom from the dendrogram (thanks to the dendextend R package), and the possibility of seeing new features in the future (such as sidebar bars).

    Why heatmaply

    The heatmaply package is designed to have a familiar features and user interface as heatmapgplots::heatmap.2 and other functions for static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way. heatmaply includes the following features:

    • Shows the row/column/value under the mouse cursor (and includes a legend on the side)
    • Drag a rectangle over the heatmap image, or the dendrograms, in order to zoom in (the dendrogram coloring relies on integration with the dendextend package)
    • Works from the R console, in RStudio, with R Markdown, and with Shiny

    The package is similar to the d3heatmap package (developed by the brilliant Joe Cheng), but is based on the plotly R package. Performance-wise it can handle larger matrices. Furthermore, since it is based on ggplot2+plotly, it is expected to have more features in the future (as it is more easily extendable by also non-JavaScript experts). I choose to build heatmaply on top of plotly.js since it is a free, open source, JavaScript library that can translate ggplot2 figures into self-contained interactive JavaScript objects (which can be viewed in your browser or RStudio).

    The default color palette for the heatmap is based on the beautiful viridis package. Also, by using the dendextend package (see the open-access two-page bioinformatics paper), you can customize dendrograms before sending them to heatmaply (via Rowv and Colv).

    You can see some more eye-candy in the online Vignette on CRAN, for example:

    2016-05-31 23_21_46-Clipboard

    For issue reports or feature requests, please visit the GitHub repo.

    Read more »
  • R 3.3.0 is released!

    R 3.3.0 (codename “Supposedly Educational”) was released today. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

    Upgrading to R 3.3.0 on Windows

    If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

    install.packages("installr") # install 
    setInternet2(TRUE)
    installr::updateR() # updating R.

    Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

    I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

    CHANGES IN R 3.3.0

    SIGNIFICANT USER-VISIBLE CHANGES

    • nchar(x, *)‘s argument keepNA governing how the result for NAs in x is determined, gets a new default keepNA = NA which returns NA where x is NA, except for type = "width" which still returns 2, the formatting / printing width of NA.
    • All builds have support for https: URLs in the default methods for download.file(), url() and code making use of them.Unfortunately that cannot guarantee that any particular https: URL can be accessed. For example, server and client have to successfully negotiate a cryptographic protocol (TLS/SSL, …) and the server’s identity has to be verifiable via the available certificates. Different access methods may allow different protocols or use private certificate bundles: we encountered a https: CRAN mirror which could be accessed by one browser but not by another nor by download.file() on the same Linux machine.

    NEW FEATURES

    • The print method for methods() gains a byclass argument.
    • New functions validEnc() and validUTF8() to give access to the validity checks for inputs used by grep() and friends.
    • Experimental new functionality for S3 method checking, notably isS3method().Also, the names of the R ‘language elements’ are exported as character vector tools::langElts.
    • str(x) now displays "Time-Series" also for matrix (multivariate) time-series, i.e. when is.ts(x) is true.
    • (Windows only) The GUI menu item to install local packages now accepts ‘*.tar.gz’ files as well as ‘*.zip’ files (but defaults to the latter).
    • New programmeR’s utility function chkDots().
    • D() now signals an error when given invalid input, rather than silently returning NA. (Request of John Nash.)
    • formula objects are slightly more “first class”: e.g., formula() or new("formula", y ~ x) are now valid. Similarly, for "table", "ordered" and "summary.table". Packages defining S4 classes with the above S3/S4 classes as slots should be reinstalled.
    • New function strrep() for repeating the elements of a character vector.
    • rapply() preserves attributes on the list when how = "replace".
    • New S3 generic function sigma() with methods for extracting the estimated standard deviation aka “residual standard deviation” from a fitted model.
    • news() now displays R and package news files within the HTML help system if it is available. If no news file is found, a visible NULL is returned to the console.
    • as.raster(x) now also accepts raw arrays x assuming values in 0:255.
    • Subscripting of matrix/array objects of type "expression" is now supported.
    • type.convert("i") now returns a factor instead of a complex value with zero real part and missing imaginary part.
    • Graphics devices cairo_pdf() and cairo_ps() now allow non-default values of the cairographics ‘fallback resolution’ to be set.This now defaults to 300 on all platforms: that is the default documented by cairographics, but apparently was not used by all system installations.
    • file() gains an explicit method argument rather than implicitly using getOption("url.method", "default").
    • Thanks to a patch from Tomas Kalibera, x[x != 0] is now typically faster than x[which(x != 0)] (in the case where x has no NAs, the two are equivalent).
    • read.table() now always uses the names for a named colClasses argument (previously names were only used when colClasses was too short). (In part, wish ofPR#16478.)
    • (Windows only) download.file() with default method = "auto" and a ftps:// URL chooses "libcurl" if that is available.
    • The out-of-the box Bioconductor mirror has been changed to one using https://: use chooseBioCmirror() to choose a http:// mirror if required.
    • The data frame and formula methods for aggregate() gain a drop argument.
    • available.packages() gains a repos argument.
    • The undocumented switching of methods for url() on https: and ftps: URLs is confined to method = "default" (and documented).
    • smoothScatter() gains a ret.selection argument.
    • qr() no longer has a ... argument to pass additional arguments to methods.
    • [ has a method for class "table".
    • It is now possible (again) to replayPlot() a display list snapshot that was created by recordPlot() in a different R session.It is still not a good idea to use snapshots as a persistent storage format for R plots, but it is now not completely silly to use a snapshot as a format for transferring an R plot between two R sessions.

      The underlying changes mean that packages providing graphics devices (e.g., Cairo, RSvgDevice, cairoDevice, tikzDevice) will need to be reinstalled.

      Code for restoring snapshots was contributed by Jeroen Ooms and JJ Allaire.

      Some testing code is available at https://github.com/pmur002/R-display-list.

    • tools::undoc(dir = D) and codoc(dir = D) now also work when D is a directory whose normalizePath()ed version does not end in the package name, e.g. from a symlink.
    • abbreviate() has more support for multi-byte character sets – it no longer removes bytes within characters and knows about Latin vowels with accents. It is still only really suitable for (most) European languages, and still warns on non-ASCII input.abbreviate(use.classes = FALSE) is now implemented, and that is more suitable for non-European languages.
    • match(x, table) is faster (sometimes by an order of magnitude) when x is of length one and incomparables is unchanged, thanks to Peter Haverty (PR#16491).
    • More consistent, partly not back-compatible behavior of NA and NaN coercion to complex numbers, operations less often resulting in complex NA (NA_complex_).
    • lengths() considers methods for length and [[ on x, so it should work automatically on any objects for which appropriate methods on those generics are defined.
    • The logic for selecting the default screen device on OS X has been simplified: it is now quartz() if that is available even if environment variable DISPLAY has been set by the user.The choice can easily be overridden via environment variable R_INTERACTIVE_DEVICE.
    • On Unix-like platforms which support the getline C library function, system(*,intern = TRUE) no longer truncates (output) lines longer than 8192 characters, thanks to Karl Millar. (PR#16544)
    • rank() gains a ties.method = "last" option, for convenience (and symmetry).
    • regmatches(invert = NA) can now be used to extract both non-matched and matched substrings.
    • data.frame() gains argument fix.empty.names; as.data.frame.list() gets new cut.names, col.names and fix.empty.names.
    • plot(x ~ x, *) now warns that it is the same as plot(x ~ 1, *).
    • recordPlot() has new arguments load and attach to allow package names to be stored as part of a recorded plot. replayPlot() has new argument reloadPkgs to load/attach any package names that were stored as part of a recorded plot.
    • S4 dispatch works within calls to .Internal(). This means explicit S4 generics are no longer needed for unlist() and as.vector().
    • Only font family names starting with “Hershey” (and not “Her” as before) are given special treatment by the graphics engine.
    • S4 values are automatically coerced to vector (via as.vector) when subassigned into atomic vectors.
    • findInterval() gets a left.open option.
    • The version of LAPACK included in the sources has been updated to 3.6.0, including those ‘deprecated’ routines which were previously included. Ca 40 double-complex routines have been added at the request of a package maintainer.As before, the details of what is included are in ‘src/modules/lapack/README’ and this now gives information on earlier additions.
    • tapply() has been made considerably more efficient without changing functionality, thanks to proposals from Peter Haverty and Suharto Anggono. (PR#16640)
    • match.arg(arg) (the one-argument case) is faster; so is sort.int(). (PR#16640)
    • The format method for object_size objects now also accepts “binary” units such as "KiB" and e.g., "Tb". (Partly from PR#16649.)
    • Profiling now records calls of the form foo::bar and some similar cases directly rather than as calls to <Anonymous>. Contributed by Winston Chang.
    • New string utilities startsWith(x, prefix) and endsWith(x, suffix). Also provide speedups for some grepl("^...",*) uses (related to proposals in PR#16490).
    • Reference class finalizers run at exit, as well as on garbage collection.
    • Avoid parallel dependency on stats for port choice and random number seeds. (PR#16668)
    • The radix sort algorithm and implementation from data.table (forder) replaces the previous radix (counting) sort and adds a new method for order(). Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ?sort).
    • The order() function gains a method argument for choosing between "shell" and "radix".
    • New function grouping() returns a permutation that stably rearranges data so that identical values are adjacent. The return value includes extra partitioning information on the groups. The implementation came included with the new radix sort.
    • rhyper(nn, m, n, k) no longer returns NA when one of the three parameters exceeds the maximal integer.
    • switch() now warns when no alternatives are provided.
    • parallel::detectCores() now has default logical = TRUE on all platforms – as this was the default on Windows, this change only affects Sparc Solaris.Option logical = FALSE is now supported on Linux and recent versions of OS X (for the latter, thanks to a suggestion of Kyaw Sint).
    • hist() for "Date" or "POSIXt" objects would sometimes give misleading labels on the breaks, as they were set to the day before the start of the period being displayed. The display format has been changed, and the shift of the start day has been made conditional on right = TRUE (the default). (PR#16679)
    • R now uses a new version of the logo (donated to the R Foundation by RStudio). It is defined in ‘.svg’ format, so will resize without unnecessary degradation when displayed on HTML pages—there is also a vector PDF version. Thanks to Dirk Eddelbuettel for producing the corresponding X11 icon.
    • New function .traceback() returns the stack trace which traceback() prints.
    • lengths() dispatches internally.
    • dotchart() gains a pt.cex argument to control the size of points separately from the size of plot labels. Thanks to Michael Friendly and Milan Bouchet-Valat for ideas and patches.
    • as.roman(ch) now correctly deals with more diverse character vectors ch; also arithmetic with the resulting roman numbers works in more cases. (PR#16779)
    • prcomp() gains a new option rank. allowing to directly aim for less than min(n,p) PC’s. The summary() and its print() method have been amended, notably for this case.
    • gzcon() gains a new option text, which marks the connection as text-oriented (so e.g. pushBack() works). It is still always opened in binary mode.
    • The import() namespace directive now accepts an argument except which names symbols to exclude from the imports. The except expression should evaluate to a character vector (after substituting symbols for strings). See Writing R Extensions.
    • New convenience function Rcmd() in package tools for invoking R CMD tools from within R.
    • New functions makevars_user() and makevars_site() in package tools to determine the location of the user and site specific ‘Makevars’ files for customizing package compilation.

    UTILITIES

    • R CMD check has a new option –ignore-vignettes for use with non-Sweave vignettes whose VignetteBuilder package is not available.
    • R CMD check now by default checks code usage (via codetools) with only the base package attached. Functions from default packages other than base which are used in the package code but not imported are reported as undefined globals, with a suggested addition to the NAMESPACE file.
    • R CMD check --as-cran now also checks DOIs in package ‘CITATION’ and Rd files.
    • R CMD Rdconv and R CMD Rd2pdf each have a new option –RdMacros=pkglist which allows Rd macros to be specified before processing.

    DEPRECATED AND DEFUNCT

    • The previously included versions of zlib, bzip2, xz and PCRE have been removed, so suitable external (usually system) versions are required (see the ‘R Installation and Administration’ manual).
    • The unexported and undocumented Windows-only devices cairo_bmp(), cairo_png() and cairo_tiff() have been removed. (These devices should be used as e.g.bmp(type = "cairo").)
    • (Windows only) Function setInternet2() has no effect and will be removed in due course. The choice between methods "internal" and "wininet" is now made by themethod arguments of url() and download.file() and their defaults can be set via options. The out-of-the-box default remains "wininet" (as it has been since R 3.2.2).
    • [<- with an S4 value into a list currently embeds the S4 object into its own list such that the end result is roughly equivalent to using [[<-. That behavior is deprecated. In the future, the S4 value will be coerced to a list with as.list().
    • Package tools‘ functions package.dependencies(), pkgDepends(), etc are deprecated now, mostly in favor of package_dependencies() which is both more flexible and efficient.

    INSTALLATION and INCLUDED SOFTWARE

    • Support for very old versions of valgrind (e.g., 3.3.0) has been removed.
    • The included libtool script (generated by configure) has been updated to version 2.4.6 (from 2.2.6a).
    • libcurl version 7.28.0 or later with support for the https protocol is required for installation (except on Windows).
    • BSD networking is now required (except on Windows) and so capabilities("http/ftp") is always true.
    • configure uses pkg-config for PNG, TIFF and JPEG where this is available. This should work better with multiple installs and with those using static libraries.
    • The minimum supported version of OS X is 10.6 (‘Snow Leopard’): even that has been unsupported by Apple since 2012.
    • The configure default on OS X is –disable-R-framework: enable this if you intend to install under ‘/Library/Frameworks’ and use with R.app.
    • The minimum preferred version of PCRE has since R 3.0.0 been 8.32 (released in Nov 2012). Versions 8.10 to 8.31 are now deprecated (with warnings from configure), but will still be accepted until R 3.4.0.
    • configure looks for C functions __cospi, __sinpi and __tanpi and uses these if cospi etc are not found. (OS X is the main instance.)
    • (Windows) R is now built using gcc 4.9.3. This build will require recompilation of at least those packages that include C++ code, and possibly others. A build of R-devel using the older toolchain will be temporarily available for comparison purposes.During the transition, the environment variable R_COMPILED_BY has been defined to indicate which toolchain was used to compile R (and hence, which should be used to compile code in packages). The COMPILED_BY variable described below will be a permanent replacement for this.
    • (Windows) A make and R CMD config variable named COMPILED_BY has been added. This indicates which toolchain was used to compile R (and hence, which should be used to compile code in packages).

    PACKAGE INSTALLATION

    • The make macro AWK which used to be made available to files such as ‘src/Makefile’ is no longer set.

    C-LEVEL FACILITIES

    • The API call logspace_sum introduced in R 3.2.0 is now remapped as an entry point to Rf_logspace_sum, and its first argument has gained a const qualifier. (PR#16470)Code using it will need to be reinstalled.

      Similarly, entry point log1pexp also defined in ‘Rmath.h’ is remapped there to Rf_log1pexp

    • R_GE_version has been increased to 11.
    • New API call R_orderVector1, a faster one-argument version of R_orderVector.
    • When R headers such as ‘R.h’ and ‘Rmath.h’ are called from C++ code in packages they include the C++ versions of system headers such as ‘<cmath>’ rather than the legacy headers such as ‘<math.h>’. (Headers ‘Rinternals.h’ and ‘Rinterface.h’ already did, and inclusion of system headers can still be circumvented by definingNO_C_HEADERS, including as from this version for those two headers.)The manual has long said that R headers should not be included within an extern "C" block, and almost all the packages affected by this change were doing so.
    • Including header ‘S.h’ from C++ code would fail on some platforms, and so gives a compilation error on all.
    • The deprecated header ‘Rdefines.h’ is now compatible with defining R_NO_REMAP.
    • The connections API now includes a function R_GetConnection() which allows packages implementing connections to convert R connection objects to Rconnectionhandles used in the API. Code which previously used the low-level R-internal getConnection() entry point should switch to the official API.

    BUG FIXES

    • C-level asChar(x) is fixed for when x is not a vector, and it returns "TRUE"/"FALSE" instead of "T"/"F" for logical vectors.
    • The first arguments of .colSums() etc (with an initial dot) are now named x rather than X (matching colSums()): thus error messages are corrected.
    • A coef() method for class "maov" has been added to allow vcov() to work with multivariate results. (PR#16380)
    • method = "libcurl" connections signal errors rather than retrieving HTTP error pages (where the ISP reports the error).
    • xpdrows.data.frame() was not checking for unique row names; in particular, this affected assignment to non-existing rows via numerical indexing. (PR#16570)
    • tail.matrix() did not work for zero rows matrices, and could produce row “labels” such as "[1e+05,]".
    • Data frames with a column named "stringsAsFactors" now format and print correctly. (PR#16580)
    • cor() is now guaranteed to return a value with absolute value less than or equal to 1. (PR#16638)
    • Array subsetting now keeps names(dim(.)).
    • Blocking socket connection selection recovers more gracefully on signal interrupts.
    • The data.frame method of rbind() construction row.names works better in borderline integer cases, but may change the names assigned. (PR#16666)
    • (X11 only) getGraphicsEvent() miscoded buttons and missed mouse motion events. (PR#16700)
    • methods(round) now also lists round.POSIXt.
    • tar() now works with the default files = NULL. (PR#16716)
    • Jumps to outer contexts, for example in error recovery, now make intermediate jumps to contexts where on.exit() actions are established instead of trying to run allon.exit() actions before jumping to the final target. This unwinds the stack gradually, releases resources held on the stack, and significantly reduces the chance of a segfault when running out of C stack space. Error handlers established using withCallingHandlers() and options("error") specifications are ignored when handling a C stack overflow error as attempting one of these would trigger a cascade of C stack overflow errors. (These changes resolve PR#16753.)
    • The spacing could be wrong when printing a complex array. (Report and patch by Lukas Stadler.)
    • pretty(d, n, min.n, *) for date-time objects d works again in border cases with large min.n, returns a labels attribute also for small-range dates and in such cases its returned length is closer to the desired n. (PR#16761) Additionally, it finally does cover the range of d, as it always claimed.
    • tsp(x) <- NULL did not handle correctly objects inheriting from both "ts" and "mts". (PR#16769)
    • install.packages() could give false errors when options("pkgType") was "binary". (Reported by Jose Claudio Faria.)
    • A bug fix in R 3.0.2 fixed problems with locator() in X11, but introduced problems in Windows. Now both should be fixed. (PR#15700)
    • download.file() with method = "wininet" incorrectly warned of download file length difference when reported length was unknown. (PR#16805)
    • diag(NULL, 1) crashed because of missed type checking. (PR#16853)

     

    logo

    Read more »
  • Copyright Use-R.com 2012 - 2016 ©