RSS Feed : R-Bloggers.com

RSS Feed from R-bloggers.com

  • Is my time series additive or multiplicative?

    Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand how the interaction between general seasonality in activity and the underlying trend.

    The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing.

    Additive or multiplicative?

    It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

    There are three components to a time series:
    trend how things are overall changing
    seasonality how things change within a given period e.g. a year, month, week, day
    error/residual/irregular activity not explained by the trend or the seasonal value

    How these three components interact determines the difference between a multiplicative and an additive time series.

    In a multiplicative time series, the components multiply together to make the time series. If you have an increasing trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated. This is common when you’re looking at web traffic.

    In an additive time series, the components added together to make the time series.If you have an increasing trend, you still see roughly the same size peaks and troughs. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

    You can have a time series that is somewhere in between the two, a statistician’s “it depends”, but I’m interested in attaining a quick classification so I won’t be handling this complication here.

    There’s a package for that

    When I first started doing time series analysis, the only way to visualise how a time series splits into different components was to use base R. About the time I was feeling the pain, someone released a ggplot2 time series extension! I’ll be using ggseas where I can.

    We’ll use the nzbop data set from ggseas to, first of all, examine a single time series and then process all the time series in the dataset to determine if they’re multiplicative or additive.

    sample_ts<-nzdata[Account == "Current account" & Category=="Services; Exports total",
                      .(TimePeriod, Value)]
    
    TimePeriodValue
    1971-06-3055
    1971-09-3056
    1971-12-3160
    1972-03-3165
    1972-06-3065
    1972-09-3063

    I’ll be using other packages (like data.table) and will only show relevant code snippets as I go along. You can get the whole script in a GIST.

    Decomposing the data

    To be able to determine if the time series is additive or multiplicative, the time series has to be split into its components.

    Existing functions to decompose the time series include decompose(), which allows you pass whether the series is multiplicative or not, and stl(), which is only for additive series without transforming the data. I could use stl() with a multiplicative series if I transform the time series by taking the log. For either function, I need to know whether it’s additive or multiplicative first.

    The trend

    The first component to extract is the trend. There are a number of ways you can do this, and some of the simplest ways involve calculating a moving average or median.

    sample_ts[,trend := zoo::rollmean(Value, 8, fill=NA, align = "right")]
    
    TimePeriodValuetrend
    2014-03-3152124108.625
    2014-06-3037744121.750
    2014-09-3036984145.500
    2014-12-3147524236.375
    2015-03-3161544376.500
    2015-06-3045434478.875

    A moving median is less sensitive to outliers than a moving mean. It doesn’t work well though if you have a time series that includes periods of inactivity. Lots of 0s can result in very weird trends.

    The seasonality

    Seasonality will be cyclical patterns that occur in our time series once the data has had trend removed.

    Of course, the way to de-trend the data needs to additive or multiplicative depending on what type your time series is. Since we don’t know the type of time series at this point, we’ll do both.

    sample_ts[,`:=`( detrended_a = Value - trend,  detrended_m = Value / trend )]
    
    TimePeriodValuetrenddetrended_adetrended_m
    2014-03-3152124108.6251103.3751.2685509
    2014-06-3037744121.750-347.7500.9156305
    2014-09-3036984145.500-447.5000.8920516
    2014-12-3147524236.375515.6251.1217137
    2015-03-3161544376.5001777.5001.4061465
    2015-06-3045434478.87564.1251.0143172

    To work out the seasonality we need to work out what the typical de-trended values are over a cycle. Here I will calculate the mean value for the observations in Q1, Q2, Q3, and Q4.

    sample_ts[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE),
                 seasonal_m = mean(detrended_m, na.rm = TRUE)), 
              by=.(quarter(TimePeriod)) ]
    
    TimePeriodValuetrenddetrended_adetrended_mseasonal_aseasonal_m
    2014-03-3152124108.6251103.3751.2685509574.19191.2924422
    2014-06-3037744121.750-347.7500.9156305-111.28781.0036648
    2014-09-3036984145.500-447.5000.8920516-219.83630.9488803
    2014-12-3147524236.375515.6251.1217137136.78271.1202999
    2015-03-3161544376.5001777.5001.4061465574.19191.2924422
    2015-06-3045434478.87564.1251.0143172-111.28781.0036648

    My actual needs aren’t over long economic periods so I’m not using a better seasonality system for this blog post. There are some much better mechanisms than this.

    The remainder

    Now that we have our two components, we can calculate the residual in both situations and see which has the better fit.

    sample_ts[,`:=`( residual_a = detrended_a - seasonal_a, 
                     residual_m = detrended_m / seasonal_m )]
    
    TimePeriodValuetrenddetrended_adetrended_mseasonal_aseasonal_mresidual_aresidual_m
    2014-03-3152124108.6251103.3751.2685509574.19191.2924422529.18310.9815146
    2014-06-3037744121.750-347.7500.9156305-111.28781.0036648-236.46220.9122871
    2014-09-3036984145.500-447.5000.8920516-219.83630.9488803-227.66370.9401098
    2014-12-3147524236.375515.6251.1217137136.78271.1202999378.84231.0012620
    2015-03-3161544376.5001777.5001.4061465574.19191.29244221203.30811.0879763
    2015-06-3045434478.87564.1251.0143172-111.28781.0036648175.41281.0106135

    Visualising decomposition

    I’ve done the number crunching, but you could also perform a visual decomposition. ggseas gives us a function ggsdc() which we can use.

    ggsdc(sample_ts, aes(x = TimePeriod, y = Value), method = "decompose", 
          frequency = 4, s.window = 8, type = "additive")+ geom_line()+
      ggtitle("Additive")+ theme_minimal()
    

    Additive time series decomposition
    Multiplicative time series decomposition

    The different decompositions produce differently distributed residuals. We need to assess these to identify which decomposition is a better fit.

    Assessing fit

    After decomposing our data, we need to compare the residuals. As we’re just trying to classify the time series, we don’t need to do anything particularly sophisticated – a big part of this exercise is to produce a quick function that could be used to perform an initial classification in a batch processing environment so simpler is better.

    We’re going to check the whether how much correlation between data points is still encoded within the residuals. This is the Auto-Correlation Factor (ACF) and it has a function for calculating it. As some of the correlations could be negative we will select the type with the smallest sum of squares of correlation values.

    ssacf<- function(x) sum(acf(x, na.action = na.omit)$acf^2)
    compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), "Additive", "Multiplicative") 
    sample_ts[,.(ts_type = compare_ssacf(residual_a, residual_m ))]
    
    ts_type
    Multiplicative

    Putting it all together

    This isn’t a fully generalized function (as it doesn’t have configurable lags, medians, seasonality etc) but if I had to apply to run this exercise over multiple time series from this dataset, my overall function and usage would look like:

    ssacf<- function(x) sum(acf(x, na.action = na.omit, plot = FALSE)$acf^2)
    compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), 
                                             "Additive", "Multiplicative") 
    additive_or_multiplicative <- function(dt){
      m<-copy(dt)
      m[,trend := zoo::rollmean(Value, 8, fill="extend", align = "right")]
      m[,`:=`( detrended_a = Value - trend,  detrended_m = Value / trend )]
      m[Value==0,detrended_m:= 0]
      m[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE),
                      seasonal_m = mean(detrended_m, na.rm = TRUE)), 
                by=.(quarter(TimePeriod)) ]
      m[is.infinite(seasonal_m),seasonal_m:= 1]
      m[,`:=`( residual_a = detrended_a - seasonal_a, 
                       residual_m = detrended_m / seasonal_m)]
      compare_ssacf(m$residual_a, m$residual_m )
    }
    
    # Applying it to all time series in table
    sample_ts<-nzdata[ , .(Type=additive_or_multiplicative(.SD)),
                        .(Account, Category)]
    
    AccountCategoryType
    Current accountInflow totalAdditive
    Current accountBalanceMultiplicative
    Current accountGoods; Exports (fob) totalAdditive
    Current accountServices; Exports totalMultiplicative
    Current accountPrimary income; Inflow totalMultiplicative
    Current accountSecondary income; Inflow totalMultiplicative
    Current accountGoods balanceMultiplicative
    Current accountServices balanceMultiplicative
    Current accountPrimary income balanceAdditive
    Current accountOutflow totalAdditive
    Current accountGoods; Imports (fob) totalAdditive
    Current accountServices; Imports totalAdditive
    Current accountPrimary income; Outflow totalAdditive
    Current accountSecondary income; Outflow totalMultiplicative
    Capital accountInflow totalMultiplicative
    Capital accountOutflow totalMultiplicative
    NANet errors and omissionsMultiplicative
    Financial accountForeign inv. in NZ totalMultiplicative
    Financial accountBalanceAdditive
    Financial accountForeign inv. in NZ; Direct inv. liabilitiesAdditive
    Financial accountForeign inv. in NZ; Portfolio inv. liabilitiesMultiplicative
    Financial accountForeign inv. in NZ; Other inv. liabilitiesMultiplicative
    Financial accountNZ inv. abroad totalMultiplicative
    Financial accountNZ inv. abroad; Direct inv. assetsMultiplicative
    Financial accountNZ inv. abroad; Portfolio inv. assetsAdditive
    Financial accountNZ inv. abroad; Financial derivative assetsMultiplicative
    Financial accountNZ inv. abroad; Other inv. assetsAdditive
    Financial accountNZ inv. abroad; Reserve assetsMultiplicative

    Conclusion

    This is a very simple way of quickly assessing whether multiple time series are additive or multiplicative. It gives an effective starting point for conditionally processing batches of time series. Get the GIST of the code used throughout this blog to work through it yourself. If you’ve got an easier way of classifying time series, let me know in the comments!

    The post Is my time series additive or multiplicative? appeared first on Locke Data.

    Read more »
  • future: Reproducible RNGs, future_lapply() and more

    (This article was first published on jottR, and kindly contributed to R-bloggers)

    future 1.3.0 is available on CRAN. With futures, it is easy to write R code once, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).

    Futures makes it easy to harness any resources at hand.

    Thanks to great feedback from the community, this new version provides:

    • A convenient lapply() function
      • Added future_lapply() that works like lapply() and gives identical results with the difference that futures are used internally. Depending on user’s choice of plan(), these calculations may be processed sequential, in parallel, or distributed on multiple machines.
      • Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument future.seed = TRUE(default) will use a random initial seed, which may also be specified as future.seed = <integer>. L’Ecuyer-CMRG RNG streams are used internally.
      • Load balancing can be controlled by argument future.scheduling, which is a scalar adjusting how many futures each worker should process.
    • Clarifies distinction between developer and end user
      • The end user controls what future strategy to use by default, e.g. plan(multiprocess) or plan(cluster, workers = c("machine1", "machine2", "remote.server.org")).
      • The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. f <- future(..., lazy = TRUE). Because of this, plan(lazy) is now deprecated.
    • Is even more friendly to multi-tenant compute environments
      • availableCores() returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (parallel::detectCores()). If option mc.cores or environment variable MC_CORES is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. This way developers don’t have to adjust their code to match a certain compute environment; the default works everywhere.
      • With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option future.availableCores.fallback or environment variable R_FUTURE_AVAILABLE_FALLBACK. For instance, by using R_FUTURE_AVAILABLE_FALLBACK=1system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake.
      • Analogously to how availableCores() returns the number of cores, the new function availableWorkers() returns the host names available to the R process. The default is rep("localhost", times = availableCores()), but when using HPC schedulers it may be the host names of other compute notes allocated to the job.

    For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems.

    A quick example

    The bootstrap example of help("clusterApply", package = "parallel") adapted to make use of futures.

    library("future")
    library("boot")

    run <- function(...) {
    cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
    cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
    boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
    }

    # base::lapply()
    system.time(boot <- lapply(1:100, FUN = run))
    ### user system elapsed
    ### 133.637 0.000 133.744

    # Sequentially on the local machine
    plan(sequential)
    system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
    ### user system elapsed
    ### 134.916 0.003 135.039

    # In parallel on the local machine (with 8 cores)
    plan(multisession)
    system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
    ### user system elapsed
    ### 0.960 0.041 29.527
    stopifnot(all.equal(boot1, boot0))

    What’s next?

    The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using plan(batchjobs_torque) instead of plan(multiprocess) your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon.

    Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using plan(aws_lambda).

    Happy futuring!

    (*) Alright, I admit, it was intended.

    Links

    See also

    To leave a comment for the author, please follow the link and comment on their blog: jottR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • SatRday and visual inference of vine copulas

    (This article was first published on Pareto's Playground, and kindly contributed to R-bloggers)

    SatRday

    From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections:

    R and Git

    Easy integration of version control through git and Rstudio has never been this easy. If you are not already following this principle of Pull, Commit and Push in your workflow, I recommend you vist the website http://happygitwithr.com/ which will help you to correct your current workflow indiscretions!

    If you are however an avid user of github and regularly push your commits with correct messages, here is a gem of a website that made my day – starlogs

    Shiny

    The Shiny workshop was an amazing exhibition on what can be done in an afternoon with the power of shiny and Rstudio! We got to build an interactive dashboard investigating disease data from South Africa using the new flexdashboard.

    Training Models

    Once a again I realised how powerful the caret package can be in training and testing a wide range of models in a modular way without too much hassle or fuss. I am really keen to start playing with the h20 package after a brief overview was shown on the ensemble capabilities of the package.

    The Conference

    The saturday saw a multitude of speakers take the stage to discuss some of the most interesting topics and application of R that I have seen in a very long time. The full programme can be viewed on the website. I also had the opportunity to talk about how we can combine Shiny and the visNetworks html widget as a way to conduct analysis of high-dimensional Vine Copulas.

    R, as an open source software with a large following provides quick and easy access to complex models and methods that are used and applied widely within academics, but more importantly, in practice. This increase in complexity both theoretically and mathematically, has resulted in an obligation from practitioners to break down walls of jargon and mathematical hyperbole into easy digestable actions and insight.

    My attempt at addressing this philosophical idea was to create a small package called VineVizR. The package can be viewed and downloaded here. I used this package and its function to build a shiny App where one can explore the interlinkages between 23 of South Africa’s largest companies by market cap using Vine Copulas.

    Vine copulas allow us to model complex linkages among assets, but currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output that is hard to interpret. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output.

    The app can be found on my ShinyApp page. I hope you enjoy this small app and the meta data that has been integrated into it. If you feel that you still want to play a bit more, go download the package and play with the VizVineR function where you can play and create your own groups!! But be aware, I don’t see myself as a developer – so when you peek under the hood and see any improvements that can be made – let me know.

    View of the dashboard

    To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • Building Shiny App exercises part 7

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Connect widgets & plots

    In the seventh part of our journey we are ready to connect more of the widgets we created before with our k-means plot in order to totally control its output. Of cousre we will also reform the plot itself properly in order to make it a real k-means plot.
    Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    First of all let’s move the widgets we are going to use from the sidebarPanel into the mainPanel and specifically under our plot.

    Exercise 1

    Remove the textInput from your server.R file. Then place the checkboxGroupInput and the selectInput in the same row with the sliderInput. Name them “Variable X” and “Variable Y” respectively. HINT: Use fluidrow and column.

    Create a reactive expression

    Reactive expressions are expressions that can read reactive values and call other reactive expressions. Whenever a reactive value changes, any reactive expressions that depended on it are marked as “invalidated” and will automatically re-execute if necessary. If a reactive expression is marked as invalidated, any other reactive expressions that recently called it are also marked as invalidated. In this way, invalidations ripple through the expressions that depend on each other.
    The reactive expression is activated like this: example <- reactive({ })

    Exercise 2

    Place a reactive expression in server.R, at any spot except inside output$All and name it “Data”. HINT: Use reactive

    Connect your dataset’s variables with your widgets.

    Now let’s connect your selectInput with the variables of your dataset as in the example below.

    #ui.R
    library(shiny)
    shinyUI(fluidPage(
    titlePanel("Shiny App"),

    sidebarLayout(
    sidebarPanel(h2("Menu"),
    selectInput('ycol', 'Y Variable', names(iris)) ),
    mainPanel(h1("Main")
    )
    )
    ))

    #server.R
    shinyServer(function(input, output) {
    example <- reactive({
    iris[, c(input$ycol)]
    })
    })

    Exercise 3

    Put the variables of the iris dataset as inputs in your selectInput as “Variable Y” . HINT: Use names.

    Exercise 4

    Do the same for checkboxGroupInput and “Variable X”. HINT: Use names.

    Select the fourth variabale as default like the example below.

    #ui.R
    library(shiny)
    shinyUI(fluidPage(
    titlePanel("Shiny App"),

    sidebarLayout(
    sidebarPanel(h2("Menu"),
    checkboxGroupInput("xcol", "Variable X",names(iris),
    selected=names(iris)[[4]]),
    selectInput("ycol", "Y Variable", names(iris),
    selected=names(iris)[[4]])
    ),
    mainPanel(h1("Main")
    )
    )
    ))

    #server.R
    shinyServer(function(input, output) {
    example <- reactive({
    iris[, c(input$xcol,input$ycol)
    ]
    })
    })

    Exercise 5

    Make the second variable the default choise for both widgets. HINT: Use selected.

    Now follow the example below to create a new function and place there the automated function for k means calculation.

    #ui.R
    library(shiny)
    shinyUI(fluidPage(
    titlePanel("Shiny App"),

    sidebarLayout(
    sidebarPanel(h2("Menu"),
    checkboxGroupInput("xcol", "Variable X",names(iris),
    selected=names(iris)[[4]]),
    selectInput("ycol", "Y Variable", names(iris),
    selected=names(iris)[[4]])
    ),
    mainPanel(h1("Main")
    )
    )
    ))

    #server.R
    shinyServer(function(input, output) {
    example <- reactive({
    iris[, c(input$xcol,input$ycol)
    ]
    })
    example2 <- reactive({
    kmeans(example())
    })
    })

    Exercise 6

    Create the reactive function Clusters and put in there the function kmeans which will be applied on the function Data. HINT: Use reactive.

    Connect your plot with the widgets.

    It is time to connect your plot with the widgets.

    Exercise 7

    Put Data inside renderPlot as first argument replacing the data that you have chosen to be plotted until now. Moreover delete xlab and ylab.

    Improve your k-means visualiztion.

    You gan change automatically the colours of your clusters by copying and pasting this part of code as first argument of renderPlot before the plot function:

    palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
    "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))

    We will choose to have up to nine clusters so we choose nine colours.

    Exercise 8

    Set min of your sliderInput to 1, max to 9 and value to 4 and use the palette function to give colours.

    This is how you can give different colors to your clusters. To activate these colors put this part of code into your plot function.

    col = Clusters()$cluster,

    Exercise 9

    Activate the palette function.

    To make your clusters easily foundable you can fully color them by adding into plot function this:
    pch = 20, cex = 3

    Exercise 10

    Fully color the points of your plot.

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

    (This article was first published on Easy Guides, and kindly contributed to R-bloggers)

    factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses, including:

    1. Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

    2. Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).

    3. Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.

    4. Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).

    5. Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.

    6. Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

    There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

    However, the result is presented differently according to the used packages. To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and dimensionality reduction analysis – we developed an easy-to-use R package named factoextra.

    • The R package factoextra has flexible and easy-to-use methods to extract quickly, in a human readable standard data format, the analysis results from the different packages mentioned above.

    • It produces a ggplot2-based elegant data visualization with less typing.

    • It contains also many functions facilitating clustering analysis and visualization.

    We’ll use i) the FactoMineR package (Sebastien Le, et al., 2008) to compute PCA, (M)CA, FAMD, MFA and HCPC; ii) and the factoextra package for extracting and visualizing the results.

    FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. The official website is available at: http://factominer.free.fr/. Thanks to François Husson for his impressive work.

    The figure below shows methods, which outputs can be visualized using the factoextra package. The official online documentation is available at: http://www.sthda.com/english/rpkgs/factoextra.

    multivariate analysis, factoextra, cluster, r, pca

    Why using factoextra?

    1. The factoextra R package can handle the results of PCA, CA, MCA, MFA, FAMD and HMFA from several packages, for extracting and visualizing the most important information contained in your data.

    2. After PCA, CA, MCA, MFA, FAMD and HMFA, the most important row/column elements can be highlighted using :
    • their cos2 values corresponding to their quality of representation on the factor map
    • their contributions to the definition of the principal dimensions.

    If you want to do this, the factoextra package provides a convenient solution.

    1. PCA and (M)CA are used sometimes for prediction problems : one can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the information provided by the previously performed PCA or (M)CA. This can be done easily using the FactoMineR package.

    If you want to make predictions with PCA/MCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2: then factoextra can help you. It’s quick, write less and do more…

    1. Several functions from different packages – FactoMineR, ade4, ExPosition, stats – are available in R for performing PCA, CA or MCA. However, The components of the output vary from package to package.

    No matter the package you decided to use, factoextra can give you a human understandable output.

    Installing FactoMineR

    The FactoMineR package can be installed and loaded as follow:

    # Install
    install.packages("FactoMineR")
    
    # Load
    library("FactoMineR")

    Installing and loading factoextra

    • factoextra can be installed from CRAN as follow:
    install.packages("factoextra")
    • Or, install the latest version from Github
    if(!require(devtools)) install.packages("devtools")
    devtools::install_github("kassambara/factoextra")
    • Load factoextra as follow :
    library("factoextra")

    Main functions in the factoextra package

    See the online documentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.

    Visualizing dimension reduction analysis outputs

    FunctionsDescription
    fviz_eig (or fviz_eigenvalue)Extract and visualize the eigenvalues/variances of dimensions.
    fviz_pcaGraph of individuals/variables from the output of Principal Component Analysis (PCA).
    fviz_caGraph of column/row variables from the output of Correspondence Analysis (CA).
    fviz_mcaGraph of individuals/variables from the output of Multiple Correspondence Analysis (MCA).
    fviz_mfaGraph of individuals/variables from the output of Multiple Factor Analysis (MFA).
    fviz_famdGraph of individuals/variables from the output of Factor Analysis of Mixed Data (FAMD).
    fviz_hmfaGraph of individuals/variables from the output of Hierarchical Multiple Factor Analysis (HMFA).
    fviz_ellipsesDraw confidence ellipses around the categories.
    fviz_cos2Visualize the quality of representation of the row/column variable from the results of PCA, CA, MCA functions.
    fviz_contribVisualize the contributions of row/column elements from the results of PCA, CA, MCA functions.

    Extracting data from dimension reduction analysis outputs

    FunctionsDescription
    get_eigenvalueExtract and visualize the eigenvalues/variances of dimensions.
    get_pcaExtract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs.
    get_caExtract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs.
    get_mcaExtract results from Multiple Correspondence Analysis outputs.
    get_mfaExtract results from Multiple Factor Analysis outputs.
    get_famdExtract results from Factor Analysis of Mixed Data outputs.
    get_hmfaExtract results from Hierarchical Multiple Factor Analysis outputs.
    facto_summarizeSubset and summarize the output of factor analyses.

    Clustering analysis and visualization

    FunctionsDescription
    dist(fviz_dist, get_dist)Enhanced Distance Matrix Computation and Visualization.
    get_clust_tendencyAssessing Clustering Tendency.
    fviz_nbclust(fviz_gap_stat)Determining and Visualizing the Optimal Number of Clusters.
    fviz_dendEnhanced Visualization of Dendrogram
    fviz_clusterVisualize Clustering Results
    fviz_mclustVisualize Model-based Clustering Results
    fviz_silhouetteVisualize Silhouette Information from Clustering.
    hcutComputes Hierarchical Clustering and Cut the Tree
    hkmeans (hkmeans_tree, print.hkmeans)Hierarchical k-means clustering.
    eclustVisual enhancement of clustering analysis

    Dimension reduction and factoextra

    As depicted in the figure below, the type of analysis to be performed depends on the data set formats and structures.

    dimension reduction and factoextra

    In this section we start by illustrating classical methods – such as PCA, CA and MCA – for analyzing a data set containing continuous variables, contingency table and qualitative variables, respectively.

    We continue by discussing advanced methods – such as FAMD, MFA and HMFA – for analyzing a data set containing a mix of variables (qualitatives & quantitatives) organized or not into groups.

    Finally, we show how to perform hierarchical clustering on principal components (HCPC), which useful for performing clustering with a data set containing only qualitative variables or with a mixed data of qualitative and quantitative variables.

    Principal component analysis

    • Data: decathlon2 [in factoextra package]
    • PCA function: FactoMineR::PCA()
    • Visualization factoextra::fviz_pca()

    Read more about computing and interpreting principal component analysis at: Principal Component Analysis (PCA).

    1. Loading data
    library("factoextra")
    data("decathlon2")
    df <- decathlon2[1:23, 1:10]
    1. Principal component analysis
    library("FactoMineR")
    res.pca <- PCA(df,  graph = FALSE)
    1. Extract and visualize eigenvalues/variances:
    # Extract eigenvalues/variances
    get_eig(res.pca)
    ##        eigenvalue variance.percent cumulative.variance.percent
    ## Dim.1   4.1242133        41.242133                    41.24213
    ## Dim.2   1.8385309        18.385309                    59.62744
    ## Dim.3   1.2391403        12.391403                    72.01885
    ## Dim.4   0.8194402         8.194402                    80.21325
    ## Dim.5   0.7015528         7.015528                    87.22878
    ## Dim.6   0.4228828         4.228828                    91.45760
    ## Dim.7   0.3025817         3.025817                    94.48342
    ## Dim.8   0.2744700         2.744700                    97.22812
    ## Dim.9   0.1552169         1.552169                    98.78029
    ## Dim.10  0.1219710         1.219710                   100.00000
    # Visualize eigenvalues/variances
    fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))
    factoextra

    factoextra

    4.Extract and visualize results for variables:

    # Extract the results for variables
    var <- get_pca_var(res.pca)
    var
    ## Principal Component Analysis Results for variables
    ##  ===================================================
    ##   Name       Description
    ## 1 "$coord"   "Coordinates for the variables"
    ## 2 "$cor"     "Correlations between variables and dimensions"
    ## 3 "$cos2"    "Cos2 for the variables"
    ## 4 "$contrib" "contributions of the variables"
    # Coordinates of variables
    head(var$coord)
    ##                   Dim.1       Dim.2      Dim.3       Dim.4      Dim.5
    ## X100m        -0.8506257 -0.17939806  0.3015564  0.03357320 -0.1944440
    ## Long.jump     0.7941806  0.28085695 -0.1905465 -0.11538956  0.2331567
    ## Shot.put      0.7339127  0.08540412  0.5175978  0.12846837 -0.2488129
    ## High.jump     0.6100840 -0.46521415  0.3300852  0.14455012  0.4027002
    ## X400m        -0.7016034  0.29017826  0.2835329  0.43082552  0.1039085
    ## X110m.hurdle -0.7641252 -0.02474081  0.4488873 -0.01689589  0.2242200
    # Contribution of variables
    head(var$contrib)
    ##                  Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
    ## X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
    ## Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
    ## Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
    ## High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
    ## X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
    ## X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193
    # Graph of variables: default plot
    fviz_pca_var(res.pca, col.var = "black")
    factoextra

    factoextra

    It’s possible to control variable colors using their contributions (“contrib”) to the principal axes:

    # Control variable colors using their contributions
    fviz_pca_var(res.pca, col.var="contrib",
                 gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                 repel = TRUE # Avoid text overlapping
                 )
    factoextra

    factoextra

    1. Variable contributions to the principal axes:
    # Contributions of variables to PC1
    fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
    
    # Contributions of variables to PC2
    fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
    factoextrafactoextra

    factoextra

    1. Extract and visualize results for individuals:
    # Extract the results for individuals
    ind <- get_pca_ind(res.pca)
    ind
    ## Principal Component Analysis Results for individuals
    ##  ===================================================
    ##   Name       Description
    ## 1 "$coord"   "Coordinates for the individuals"
    ## 2 "$cos2"    "Cos2 for the individuals"
    ## 3 "$contrib" "contributions of the individuals"
    # Coordinates of individuals
    head(ind$coord)
    ##                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
    ## SEBRLE     0.1955047  1.5890567  0.6424912  0.08389652  1.16829387
    ## CLAY       0.8078795  2.4748137 -1.3873827  1.29838232 -0.82498206
    ## BERNARD   -1.3591340  1.6480950  0.2005584 -1.96409420  0.08419345
    ## YURKOV    -0.8889532 -0.4426067  2.5295843  0.71290837  0.40782264
    ## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
    ## McMULLEN   0.1212195 -1.0139102 -0.8625170  1.34164291  1.62151286
    # Graph of individuals
    # 1. Use repel = TRUE to avoid overplotting
    # 2. Control automatically the color of individuals using the cos2
        # cos2 = the quality of the individuals on the factor map
        # Use points only
    # 3. Use gradient color
    fviz_pca_ind(res.pca, col.ind = "cos2",
                 gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                 repel = TRUE # Avoid text overlapping (slow if many points)
                 )
    factoextra

    factoextra

    # Biplot of individuals and variables
    fviz_pca_biplot(res.pca, repel = TRUE)
    factoextra

    factoextra

    1. Color individuals by groups:
    # Compute PCA on the iris data set
    # The variable Species (index = 5) is removed
    # before PCA analysis
    iris.pca <- PCA(iris[,-5], graph = FALSE)
    
    # Visualize
    # Use habillage to specify groups for coloring
    fviz_pca_ind(iris.pca,
                 label = "none", # hide individual labels
                 habillage = iris$Species, # color by groups
                 palette = c("#00AFBB", "#E7B800", "#FC4E07"),
                 addEllipses = TRUE # Concentration ellipses
                 )
    factoextra

    factoextra

    Correspondence analysis

    • Data: housetasks [in factoextra]
    • CA function FactoMineR::CA()
    • Visualize with factoextra::fviz_ca()

    Read more about computing and interpreting correspondence analysis at: Correspondence Analysis (CA).

    • Compute CA:
     # Loading data
    data("housetasks")
    
     # Computing CA
    library("FactoMineR")
    res.ca <- CA(housetasks, graph = FALSE)
    • Extract results for row/column variables:
    # Result for row variables
    get_ca_row(res.ca)
    
    # Result for column variables
    get_ca_col(res.ca)
    • Biplot of rows and columns
    fviz_ca_biplot(res.ca, repel = TRUE)
    factoextra

    factoextra

    To visualize only row points or column points, type this:

    # Graph of row points
    fviz_ca_row(res.ca, repel = TRUE)
    
    # Graph of column points
    fviz_ca_col(res.ca)
    
    # Visualize row contributions on axes 1
    fviz_contrib(res.ca, choice ="row", axes = 1)
    
    # Visualize column contributions on axes 1
    fviz_contrib(res.ca, choice ="col", axes = 1)

    Multiple correspondence analysis

    • Data: poison [in factoextra]
    • MCA function FactoMineR::MCA()
    • Visualization factoextra::fviz_mca()

    Read more about computing and interpreting multiple correspondence analysis at: Multiple Correspondence Analysis (MCA).

    1. Computing MCA:
    library(FactoMineR)
    data(poison)
    res.mca <- MCA(poison, quanti.sup = 1:2,
                  quali.sup = 3:4, graph=FALSE)
    1. Extract results for variables and individuals:
    # Extract the results for variable categories
    get_mca_var(res.mca)
    
    # Extract the results for individuals
    get_mca_ind(res.mca)
    1. Contribution of variables and individuals to the principal axes:
    # Visualize variable categorie contributions on axes 1
    fviz_contrib(res.mca, choice ="var", axes = 1)
    
    # Visualize individual contributions on axes 1
    # select the top 20
    fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)
    1. Graph of individuals
    # Color by groups
    # Add concentration ellipses
    # Use repel = TRUE to avoid overplotting
    grp <- as.factor(poison[, "Vomiting"])
    fviz_mca_ind(res.mca,  habillage = grp,
                 addEllipses = TRUE, repel = TRUE)
    factoextra

    factoextra

    1. Graph of variable categories:
    fviz_mca_var(res.mca, repel = TRUE)
    factoextra

    factoextra

    1. Biplot of individuals and variables:
    fviz_mca_biplot(res.mca, repel = TRUE)
    factoextra

    factoextra

    Advanced methods

    The factoextra R package has also functions that support the visualization of advanced methods such:

    Cluster analysis and factoextra

    To learn more about cluster analysis, you can refer to the book available at: Practical Guide to Cluster Analysis in R

    clustering book cover

    The main parts of the book include:

    • distance measures,
    • partitioning clustering,
    • hierarchical clustering,
    • cluster validation methods, as well as,
    • advanced clustering methods such as fuzzy clustering, density-based clustering and model-based clustering.

    The book presents the basic principles of these tasks and provide many examples in R. It offers solid guidance in data mining for students and researchers.

    Partitioning clustering

    Partitioning cluster analysis

    # 1. Loading and preparing data
    data("USArrests")
    df <- scale(USArrests)
    
    # 2. Compute k-means
    set.seed(123)
    km.res <- kmeans(scale(USArrests), 4, nstart = 25)
    
    # 3. Visualize
    library("factoextra")
    fviz_cluster(km.res, data = df,
                 palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),
                 ggtheme = theme_minimal(),
                 main = "Partitioning Clustering Plot"
                 )
    factoextra

    factoextra

    Hierarchical clustering

    Hierarchical clustering

    library("factoextra")
    # Compute hierarchical clustering and cut into 4 clusters
    res <- hcut(USArrests, k = 4, stand = TRUE)
    
    # Visualize
    fviz_dend(res, rect = TRUE, cex = 0.5,
              k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"))
    factoextra

    factoextra

    Determine the optimal number of clusters

    # Optimal number of clusters for k-means
    library("factoextra")
    my_data <- scale(USArrests)
    fviz_nbclust(my_data, kmeans, method = "gap_stat")
    factoextra

    factoextra

    Acknoweledgment

    I would like to thank Fabian Mundt for his active contributions to factoextra.

    We sincerely thank all developers for their efforts behind the packages that factoextra depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), FactoMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more …..

    References

    • H. Wickham (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
    • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.5.
    • Sebastien Le, Julie Josse, Francois Husson (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software, 25(1), 1-18. 10.18637/jss.v025.i01
    • Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428

    Infos

    This analysis has been performed using R software (ver. 3.3.2) and factoextra (ver. 1.0.4.999)

    jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

    .content{padding:0px;}


    To leave a comment for the author, please follow the link and comment on their blog: Easy Guides.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • padr::pad does now do group padding

    (This article was first published on That’s so Random, and kindly contributed to R-bloggers)

    A few weeks ago padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr") for a general introduction. In v0.2.0 the pad function is extended with a group argument, which makes your life a lot easier when you want to do padding within groups.

    In the Examples of padr in v0.1.0 I showed that padding over multiple groups could be done by using padr in conjunction with dplyr and tidyr.

    library(dplyr)
    library(padr)
    # padding a data.frame on group level
    day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
    x_df_grp <- data.frame(grp  = rep(LETTERS[1:3], each = 4),
                           y    = runif(12, 10, 20) %>% round(0),
                           date = sample(day_var, 12, TRUE)) %>%
     arrange(grp, date)
    
    x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>%
      tidyr::fill(grp) 
    

    I quite quickly realized this is an unsatisfactory solution for two reasons:

    1) It is a hassle. It is the goal of padr to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal.
    2) It does not work when one or both of the start_val and end_val arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with tidyr::fill will incorrectly fill the grouping variable(s) as a result.

    It was therefore necessary to expand pad, so the grouping variable(s) do not contain missing values anymore after padding. The group argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows.

    The new version of the section in the Examples of padr is:

    day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
    x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4),
                           grp2 = letters[1:2],
                           y    = runif(12, 10, 20) %>% round(0),
                           date = sample(day_var, 12, TRUE)) %>%
     arrange(grp1, grp2, date)
    
    # pad by one grouping var
    x_df_grp %>% pad(group = 'grp1')
    
    # pad by two groups vars
    x_df_grp %>% pad(group = c('grp1', 'grp2'))
    

    Bug fixes

    Besides the additional argument there were two bug fixes in this version:

    • pad does no longer break when datetime variable contains one value only. Returns x and a warning if start_val and end_val are NULL, and will do proper padding when one or both are specified.

    • In the fill_ function now a meaningful error is thrown, when forgetting to specify at least one column to apply the function on.

    v0.2.1

    Right before posting this blog, Doug Friedman found out that in v0.2.0 the by argument no longer functioned. This bug was fixed in the patch release v0.2.1.

    I hope you (still) enjoy working with padr, let me know when you find a bug or got ideas for improvement.

    To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • Predicting food preferences with sparklyr (machine learning)

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    This week I want to show how to run machine learning applications on a Spark cluster. I am using the sparklyr package, which provides a handy interface to access Apache Spark functionalities via R.

    The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines.

    Apache Spark

    Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup.
    Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training.
    Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance.

    MLlib

    Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing.
    With sparklyr you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries. sparklyr also implements dplyr, making it especially convenient for handling data.

    If you don’t have Spark installed locally, run:

    library(sparklyr)
    spark_install(version = "2.0.0")
    

    Now we can connect to a local Spark instance:

    library(sparklyr)
    sc <- spark_connect(master = "local", version = "2.0.0")
    

    Preparations

    Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages tidyr (for gathering data for plotting), dplyr (for data manipulation) and ggrepel (for non-overlapping text labels in plots).

    library(tidyr)
    library(ggplot2)
    library(ggrepel)
    library(dplyr)
    
    my_theme <- function(base_size = 12, base_family = "sans"){
      theme_minimal(base_size = base_size, base_family = base_family) +
      theme(
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14),
        panel.grid.major = element_line(color = "grey"),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "aliceblue"),
        strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
        strip.text = element_text(face = "bold", size = 12, color = "black"),
        legend.position = "right",
        legend.justification = "top", 
        panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
      )
    }
    

    The Data

    Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup.

    This dataset is part of the fivethirtyeight package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen:

    • 5: I love this country’s traditional cuisine. I think it’s one of the best in the world.
    • 4: I like this country’s traditional cuisine. I think it’s considerably above average.
    • 3: I’m OK with this county’s traditional cuisine. I think it’s about average.
    • 2: I dislike this country’s traditional cuisine. I think it’s considerably below average.
    • 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world.
    • N/A: I’m unfamiliar with this country’s traditional cuisine.

    Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0.

    library(fivethirtyeight)
    
    food_world_cup[food_world_cup == "N/A"] <- NA
    food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0
    food_world_cup$gender <- as.factor(food_world_cup$gender)
    food_world_cup$location <- as.factor(food_world_cup$location)
    

    The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location.

    Before I do any machine learning, however, I want to get to know the data.
    First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country.

    # calculating percentages per category and country
    percentages <- food_world_cup %>%
      select(algeria:vietnam) %>%
      gather(x, y) %>%
      group_by(x, y) %>%
      summarise(n = n()) %>%
      mutate(Percent = round(n / sum(n) * 100, digits = 2))
    
    # rename countries & plot
    percentages %>%
      mutate(x_2 = gsub("_", " ", x)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = "", y = Percent, fill = y)) + 
        geom_bar(width = 1, stat = "identity") + 
        theme_minimal() +
        coord_polar("y", start = 0) +
        facet_wrap(~ x_2, ncol = 8) +
        scale_fill_brewer(palette = "Set3") +
        labs(fill = "")
    

    As we can already see from the distributions, there are big differences between countries. Some countries’ cuisines are very well known and liked (e.g. Italy, Mexico and the United States), while others are only known to a handful of people (e.g. Algeria, Croatia, Ghana, etc.).

    Imputing missing values

    In the demographic information, there are a few missing values. These, I want to impute:

    library(mice)
    
    dataset_impute <- mice(food_world_cup[, -c(1, 2)],  print = FALSE)
    food_world_cup <- cbind(food_world_cup[, 2, drop = FALSE], mice::complete(dataset_impute, 1))
    

    Transforming preference values

    Strictly speaking, the preference data is categorical. But because using them as factor levels would make the models much more complex and calculation-intensive, I am converting the factors to numbers and transform them by dividing through the mean of non-zero values for each country.

    food_world_cup[8:47] <- lapply(food_world_cup[8:47], as.numeric)
    
    countries <- paste(colnames(food_world_cup)[-c(1:7)])
    
    for (response in countries) {
      food_world_cup[paste(response, "trans", sep = "_")] <- food_world_cup[response] / mean(food_world_cup[food_world_cup[response] > 0, response])
    }
    

    As we can see by plotting the distribution of transformed values, they are far from being normal distributions. Moreover, we still see big differences between well known and not well known countries – this means that the data is unbalanced.

    food_world_cup %>%
      gather(x, y, algeria_trans:vietnam_trans) %>%
      mutate(x_2 = gsub("_trans", "", x)) %>%
      mutate(x_2 = gsub("_", " ", x_2)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(y)) +
        geom_density(fill = "navy", alpha = 0.7) +
        my_theme() + 
        facet_wrap(~ x_2, ncol = 8) +
        labs(x = "transformed preference")
    

    Most liked cuisines and gender biases

    Then, I wanted to know which countries were the most liked and whether we see a gender bias in preference for some country’s cuisines. The first plot shows the sum of preference values for each country, separated by male and female answers.

    food_world_cup_gather <- food_world_cup %>%
      collect %>%
      gather(country, value, algeria:vietnam)
                                     
    food_world_cup_gather$value <- as.numeric(food_world_cup_gather$value)
    food_world_cup_gather$country <- as.factor(food_world_cup_gather$country)
    
    food_world_cup_gather <- food_world_cup_gather %>%
      mutate(x_2 = gsub("_", " ", country)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2))
    
    order <- aggregate(food_world_cup_gather$value, by = list(food_world_cup_gather$x_2), FUN = sum)
    
    food_world_cup_gather %>%
      mutate(x_2 = factor(x_2, levels = order$Group.1[order(order$x, decreasing = TRUE)])) %>%
      ggplot(aes(x = x_2, y = value, fill = gender)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette = "Set1") +
        my_theme() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
        labs(fill = "Gender",
             x = "",
             y = "sum of preferences")
    

    Cuisines from Italy, the United States and Mexico were the most liked, from Ghana, Cameroon and the Ivory Coast were least liked (that they were also less well known playes into it, of course). No country shows an obvious gender bias, though.

    To dig a bit deeper into gender differences, I am calculating the differences between mean preference of males and females for each country. Here, I am using the country list that I defined earlier when transforming the values.

    Because I want to paste the country list to dplyr’s functions, I need to use standard evaluation (SE). Be default, dplyr uses non-standard evalutaion (NSE), i.e. variable names are given without quotation marks. This makes it possible to easily convert between R and SQL code. But each dplyr function also has a version to use with SE: they each have a “” appended to the function name, here e.g. *mutate_each()*.

    food_world_cup %>%
      collect %>%
      mutate_each_(funs(as.numeric), countries) %>%
      group_by(gender) %>%
      summarise_each_(funs(mean), countries) %>%
      summarise_each_(funs(diff), countries) %>%
      gather(x, y) %>%
      mutate(x_2 = gsub("_", " ", x)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = x_2, y = y)) +
        geom_bar(stat = "identity", fill = "navy", alpha = 0.7) +
        my_theme() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
        labs(x = "",
             y = "difference\nbetween gender")
    

    All gender differences are very close to zero, so it seems that men and women generally have similar food preferences.

    Spark

    Now, that I’m somewhat familiar with the data, I copy it to the Spark instance:

    food_world_cup <- copy_to(sc, food_world_cup)
    

    Principal Component Analysis (PCA)

    One of the functions of Spark’s machine learning functions is for PCA: ml_pca(). I am using it to get an idea about where the countries fall on a 2-dimensional plane of the first two principal components.

    pca <- food_world_cup %>%
      mutate_each_(funs(as.numeric), countries) %>%
      ml_pca(features = paste(colnames(food_world_cup)[-c(1:47)]))
    
    library(tibble)
    as.data.frame(pca$components) %>%
      rownames_to_column(var = "labels") %>%
      mutate(x_2 = gsub("_trans", "", labels)) %>%
      mutate(x_2 = gsub("_", " ", x_2)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = PC1, y = PC2, color = x_2, label = x_2)) + 
        geom_point(size = 2, alpha = 0.6) +
        geom_text_repel() +
        labs(x = paste0("PC1: ", round(pca$explained.variance[1], digits = 2) * 100, "% variance"),
             y = paste0("PC2: ", round(pca$explained.variance[2], digits = 2) * 100, "% variance")) +
        my_theme() + 
        guides(fill = FALSE, color = FALSE)
    

    The least well known countries cluster on the top right, while the most liked are on the bottom right. PC2 seems to be mainly driven by how many 0s there are in the data.

    Preparing the data for machine learning

    First, I need to convert the factor strings of the non-country features into indexes. To do so, I am using the ft_string_indexer() function. For other feature transformation functions, check out sparklyr’s website.

    food_world_cup <- tbl(sc, "food_world_cup") %>%
      ft_string_indexer(input_col = "interest", output_col = "interest_idx") %>%
      ft_string_indexer(input_col = "gender", output_col = "gender_idx") %>%
      ft_string_indexer(input_col = "age", output_col = "age_idx") %>%
      ft_string_indexer(input_col = "household_income", output_col = "household_income_idx") %>%
      ft_string_indexer(input_col = "education", output_col = "education_idx") %>%
      ft_string_indexer(input_col = "location", output_col = "location_idx") %>%
      ft_string_indexer(input_col = "knowledge", output_col = "knowledge_idx")
    

    Now I can divide the data into training and test sets using the sdf_partition() function.

    partitions <- food_world_cup %>%
      sdf_partition(training = 0.75, test = 0.25, seed = 753)
    

    Modeling

    Now, I run the Random Forest algorithm to predict each country’s preference based on all other countries’ preferences and the demographic information. Check back with sparklyr’s website to find out about other machine learning algorithms you can use.

    For each country (as response variable), I am defining the features as all other countries’ transformed values and the indexed factor variables.

    Then, I am filtering out all data rows where the response variable was 0. When I left the 0s in the data, the countries with many 0s introduced a very strong bias to the prediction. Here, I needed to use standard evaluation again, this time with lazyeval’s interp() function and a formula.

    I am using the ml_random_forest() function to run classification models. Inititally, I run a model on all features, then extract the 10 features with highest importance and re-run the model again on this subset of features. This improved the prediction accuracy compared to all-feature-models.

    The sdf_predict() function is used to predict the classes of the test set. To obtain the quality metrics F1, weighted precision and weighted recall, I need to copy the prediction table to the Spark instance and run the ml_classification_eval() function.

    Finally, I am combining the output tables from all countries’s models to compare.

    library(lazyeval)
    
    for (response in countries) {
    
      features <- colnames(partitions$training)[-grep(response, colnames(partitions$training))]
      features <- features[grep("_trans|_idx", features)]
    
      fit <- partitions$training %>%
        filter_(interp(~ var > 0, var = as.name(response))) %>%
        ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
      
      feature_imp <- ml_tree_feature_importance(sc, fit)
      
      features <- as.character(feature_imp[1:10, 2])
      
      fit <- partitions$training %>%
        filter_(interp(~ var > 0, var = as.name(response))) %>%
        ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification")
      
      partitions$test <- partitions$test %>%
        filter_(interp(~ var > 0, var = as.name(response)))
      
      pred <- sdf_predict(fit, partitions$test) %>%
        collect
      
      pred_2 <- as.data.frame(table(pred[[response]], pred$prediction))
      pred_2$response <- response
      
      pred_sc <- select(pred, -rawPrediction, -probability)
      pred_sc <- copy_to(sc, pred_sc, overwrite = TRUE)
      
      feature_imp$response <- response
      
      f1 <- ml_classification_eval(pred_sc, response, "prediction", metric = "f1")
      wP <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedPrecision")
      wR <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedRecall")
      
      ml_eval <- data.frame(response = response,
                            f1 = f1,
                            weightedPrecision = wP,
                            weightedRecall = wR)
      
      if (response == "algeria") {
        feature_imp_df <- feature_imp
        ml_eval_df <- ml_eval
        pred_df <- pred_2
      } else {
        feature_imp_df <- rbind(feature_imp_df, feature_imp)
        ml_eval_df <- rbind(ml_eval_df, ml_eval)
        pred_df <- rbind(pred_df, pred_2)
      }
    }
    

    Model evaluation

    Now, I can compare the prediction accuracies for each country’s model by plotting the F1, weighted precision and weighted recall values.

    • Precision

    In classification models, precision gives the proportion of correct classifications or “true positives” (i.e. how many of the samples that were classified as “5” were actually a “5”). If we have a high precision, this means that our model classified most samples correctly.

    • Recall

    Recall, or sensitivity, gives the proportion of how many of all true “5” samples in the training data were correctly classified. If we have a high recall, this means that our model correctly detected most classes.

    • F1 score

    The F-score gives the harmonic mean or weighted average of precision and recall.

    Let’s say we had 10 “5s” in the training data. The prediction table classified 8 samples as “5s”, 6 of which were correctly classified. In this example, we have 2 false positives and 4 false negatives. This means we would have a precision of 6/8 and a recall of 6/10.

    results <- ml_eval_df %>%
      mutate(x_2 = gsub("_", " ", response)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2))
      
    order <- results$x_2[order(results$f1, decreasing = TRUE)]
    
    gather(results, x, y, f1:weightedRecall) %>%
      mutate(x_2 = factor(x_2, levels = order)) %>%
      ggplot(aes(x = x_2, y = y, fill = x)) +
        geom_bar(stat = "identity", position = "dodge", alpha = 0.9) +
        scale_fill_brewer(palette = "Set1") +
        my_theme() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
        labs(fill = "", color = "", x = "", y = "value")
    

    The plot above shows that the model predicting preference for Spanish dishes had the highest accuracy, while the model for Italy, India and Ireland had the lowest accuracy.

    To see what features had been used in more and less successful models, I am plotting the values of the 10 final features for classifying Spanish, Greece and Italian food preference categories 1 – 5 in the original dataset.

    as.data.frame(food_world_cup) %>%
      select_(.dots = c("spain", as.character(feats$feature))) %>%
      gather(x, y, -spain) %>%
      filter(spain > 0) %>%
      group_by(spain, x) %>%
      summarise(mean = mean(y)) %>%
      mutate(x_2 = gsub("_trans", "", x)) %>%
      mutate(x_2 = gsub("_idx", "", x_2)) %>%
      mutate(x_2 = gsub("_", " ", x_2)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = x_2, y = spain, fill = mean)) +
        geom_tile(width = 0.9, height = 0.9) +
        scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
        my_theme() +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
        labs(x = "", y = "Spain")
    

    feats <- feature_imp_df %>%
      filter(response == "greece") %>%
      slice(1:10)
    
    as.data.frame(food_world_cup) %>%
      select_(.dots = c("greece", as.character(feats$feature))) %>%
      gather(x, y, -greece) %>%
      filter(greece > 0) %>%
      group_by(greece, x) %>%
      summarise(mean = mean(y)) %>%
      mutate(x_2 = gsub("_trans", "", x)) %>%
      mutate(x_2 = gsub("_idx", "", x_2)) %>%
      mutate(x_2 = gsub("_", " ", x_2)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = x_2, y = greece, fill = mean)) +
        geom_tile(width = 0.9, height = 0.9) +
        scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
        my_theme() +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
        labs(x = "", y = "Greece")
    

    feats <- feature_imp_df %>%
      filter(response == "italy") %>%
      slice(1:10)
    
    as.data.frame(food_world_cup) %>%
      select_(.dots = c("italy", as.character(feats$feature))) %>%
      gather(x, y, -italy) %>%
      filter(italy > 0) %>%
      group_by(italy, x) %>%
      summarise(mean = mean(y)) %>%
      mutate(x_2 = gsub("_trans", "", x)) %>%
      mutate(x_2 = gsub("_idx", "", x_2)) %>%
      mutate(x_2 = gsub("_", " ", x_2)) %>%
      mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>%
      mutate(x_2 = gsub("And", "and", x_2)) %>%
      ggplot(aes(x = x_2, y = italy, fill = mean)) +
        geom_tile(width = 0.9, height = 0.9) +
        scale_fill_gradient2(low = "white", high = "red",  name = "mean") +
        my_theme() +
        theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
        labs(x = "", y = "Italy")
    

    Conclusions

    This dataset was a bit difficult because the preferences were highly unbalanced, i.e. the distributions of preferences were very different between countries. Some countries’ cuisines were largely unknown, while others’ were well known and generally popular. If I kept the unknown class (0) in the prediction models, countries with many 0s jumped up in prediction accuracy. But this was mainly due to the fact, that the chance of correctly classifying a sample as 0 was over-proportionally high (leading to high precision). Removing the 0s while working on raw preference values showed the same effect as before but now countries with a majority of 5s (e.g. Italy) were overly “accurate”. Transforming the data and removing the 0s led to a more evenly distributed accuracy of well known and less well known countries. But by removing them, the number of samples used for modeling varied strongly between countries, introducing another type of bias.

    It was not possible to avoid bias entirely with this dataset. But I wanted to show it as an example nonetheless. Usually, machine learning examples show datasets where the models worked very well, leaving the reader in awe of the powers of machine learning. And while there are certainly powerful and impressive prediction models, real-life data is not always as simple. As can be seen with this dataset, it’s not always as straight forward to build a good classification model, even though you’d intuitively assume that it should be easy to predict food preferences based on which other cuisines someone likes…

    The model could potentially be improved by engineering features, modeling factors and/ or different algorithms but I’ll leave this for another time.

    Next week, I’ll be looking into how to use the h2o package with Spark using rsparkling.


    Other machine learning topics I have covered include


    ## R version 3.3.2 (2016-10-31)
    ## Platform: x86_64-apple-darwin13.4.0 (64-bit)
    ## Running under: macOS Sierra 10.12.1
    ## 
    ## locale:
    ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    ## 
    ## attached base packages:
    ## [1] stats     graphics  grDevices utils     datasets  methods   base     
    ## 
    ## other attached packages:
    ## [1] tibble_1.2            fivethirtyeight_0.1.0 dplyr_0.5.0          
    ## [4] ggrepel_0.6.5         ggplot2_2.2.1         tidyr_0.6.1          
    ## [7] sparklyr_0.5.1       
    ## 
    ## loaded via a namespace (and not attached):
    ##  [1] Rcpp_0.12.9        knitr_1.15.1       magrittr_1.5      
    ##  [4] rappdirs_0.3.1     munsell_0.4.3      colorspace_1.3-2  
    ##  [7] R6_2.2.0           plyr_1.8.4         stringr_1.1.0     
    ## [10] httr_1.2.1         tools_3.3.2        parallel_3.3.2    
    ## [13] grid_3.3.2         gtable_0.2.0       config_0.2        
    ## [16] DBI_0.5-1          withr_1.0.2        htmltools_0.3.5   
    ## [19] lazyeval_0.2.0     yaml_2.1.14        assertthat_0.1    
    ## [22] rprojroot_1.2      digest_0.6.12      RColorBrewer_1.1-2
    ## [25] base64enc_0.1-3    evaluate_0.10      rmarkdown_1.3     
    ## [28] labeling_0.3       stringi_1.1.2      scales_0.4.1      
    ## [31] backports_1.0.5    jsonlite_1.2
    

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »
  • Bar bar plots but not Babar plots

    (This article was first published on Maëlle, and kindly contributed to R-bloggers)

    You might have heard of the “bar bar plots” movement whose goal is to prevent using (let’s use ggplot2 language shall we) geom_bar when you could have used e.g. geom_boxplot or geom_histogram because the bar plot hides the variability of the distributions you say you’re comparing, even if you add some sort of error bar. I whole-heartedly agree with this movement but in this post I’d like to present some ok bar plots, that represent counts of individuals in different categories. You might know them as geom_bar(blabla, stat = "identity") or geom_col. They’re still bar plots and they’re great, in particular when you make puns out of them which is what I did with Miles McBain.

    The reason I mentioned bar plots to Miles was my wanting to recycle some old code of mine (again) for a blog post. It was code from my “let’s test all the kitsch R plots have to offer” period, when I had started using Thomas Leeper’s colourlovers package for adding patterns on bars.

    colourlovers is a great package for getting trendy colour palettes, although I’ll admit I’m a viridis person. But colourlovers also has patterns, about which the package documentation states “Patterns are images created on COLOURlovers using a specified palette of colors. They serve as examples of how to use the palette in practice.”. At the time I read those sentences I got really excited because I was convinced one could do more with patterns, like filling bar plots with them. In this post, I’ll prove you can, with my non-generalizable worflow.

    library("dplyr")
    library("tidyr")
    library("httr")
    library("grid")
    library("ggplot2")
    library("colourlovers")
    

    General comments on using patterns

    A pattern is a square image that I convert to a raster. This raster is a matrix with 200 rows and 200 columns. In this matrix $x_{i,j}$ represents the hex code of the colour of a pixel. Below I show how to get a pattern figure and how to plot it on its own.

    # get pattern
    catsBlue <- clpattern(id = "3363032")
    # get the URL of the picture corresponding to the pattern
    imageURL <- catsBlue$imageUrl
    # get the picture
     picture <-  content(GET(as.character(imageURL)))
    # convert it to a raster
    img <- as.raster(picture)
    # plot it!
    plot(img)
    

    plot of chunk unnamed-chunk-2

    This pattern can also be seen as a puzzle piece: one can copy the pattern several times and put each copy side by side and/or pile them up and get something that looks homogeneous with no problem at the border of each piece.

    In ggplot2, I’ll add the patterns with geom_point. The main issue will be to re-size puzzle pieces (how big should be a single puzzle piece for the graph to look nice?), to paste pieces together (now many puzzle pieces?), and to crop the puzzle (the puzzle should not be bigger than the bar it covers, for instance).

    Writing code for decorating a bar plot

    Here is the original bar plot I aim to make fancier.

    # Normal plot
    c <- ggplot(mtcars, aes(factor(cyl)))
    c <- c + geom_bar(width = .5,
                      aes(fill = factor(cyl)))+
      theme(text = element_text(size=20))
    c
    

    plot of chunk unnamed-chunk-3

    Now, since my puzzle pieces are squares, I want the plot to have a x/y ratio such that the distance between two major grid lines on the x axis is the same as the distance as the distance between two major grid lines on the y axis.

    plotInfo <- ggplot_build(c)
    extentX <- diff(plotInfo$layout$panel_ranges[[1]]$x.major_source)[1]
    extentY <- diff(plotInfo$layout$panel_ranges[[1]]$y.major_source)[1]
    c <- c + coord_fixed(ratio = extentX/extentY)
    c
    

    plot of chunk unnamed-chunk-4

    I’ll reckon once again that since writing the code and publishing it I’ve slept many times and thus forgotten how I came up with the solution. Today I had to change the code for getting extentY and extentX a bit since ggplot_build() gives a slightly different output since ggplot2 newest version was released.

    I shall now get three patterns from colourlovers. I went on the website of colourlovers itself to find three patterns using the same template with different colours so I know their IDs.

    # get the patterns
    get_patterns <- function(id_blue, id_red, id_green){
      blue <- clpattern(id = id_blue)
      red <- clpattern(id = id_red)
      green <- clpattern(id = id_green)
      list(red, green, blue)
    }
    
    patterns <- get_patterns(id_blue = 4348319,
                             id_red = 4376078,
                             id_green = 4360659)
    

    I shall first get one colour from each pattern and re-do the figure with these colours. I do this to have a legend later. I don’t want to try and reproduce parts of the patterns to get them in the legend.

    library("purrr")
    get_colors <- function(pattern){
      pattern$colors[3]$hex
    }
    addhash <- function(color){
      paste0("#", color)
    }
    
    colors <- map(patterns, get_colors) %>%
      map(addhash) %>%
      unlist()
    
    c2 <- c + scale_fill_manual(values = colors)
    c2
    

    plot of chunk unnamed-chunk-6

    Now let’s add the patterns to the graph! It’s quite inelegant since I use a loop for adding things successively to the graph.

    add_patterns <- function(plot, patterns, colors){
      for (i in 1:length(levels(factor(mtcars$cyl)))){
      imageURL <- patterns[[i]]$imageUrl
      # get pattern
      picture <-  content(GET(as.character(imageURL)))
      # picture is a 4 dimensional array
      img <- as.raster(picture)
      
      # we have to repeat the data.frame/pattern
      # I repeat it so that one extentY = one square
      xmax <- 1
      ymax <- sum(mtcars$cyl == levels(factor(mtcars$cyl))[i])
      
      size <- ceiling(ymax*2/extentY)
      
      img2 <- apply(img,MARGIN=2,function(x) rep(x,size))
      
      # from matrix to data.frame
      img2 <- tbl_df(as.data.frame(as.matrix(img2)))
      
      # I change the column names so that they correspond to x coordinates 
      names(img2) <- seq(i - 0.25, to = i + 0.25, length = ncol(img2))
      
      # I transform the img2 so that I have one line = 1 x, 1 y and 1 colour
      dataPlot <- img2 %>% 
        mutate(y = seq(from = size/2*extentY, to = 0, length = nrow(img2)))%>% 
        gather(x, value, 1:ncol(img2)) %>%
        # filter part of the pattern that doesn't fit in the original bar
        filter(y <= ymax)  %>%
        mutate(x = as.numeric(x))
      
      
      plot <- plot + geom_point(data = dataPlot, aes(x, y), col = dataPlot$value)
    }
    plot
    }
    
    add_patterns(plot = c2, patterns = patterns, colors = colors) +
      ggtitle("Babar bar plot")
    

    plot of chunk unnamed-chunk-7

    We’ve just made a Babar bar plot! Thanks Miles for this pun. Now let’s apply the same principles again and by principles I mean applying the colourlover code I’ve just presented, and using puns developped with Miles. A recipe for success!

    Other punny plots

    Barber plot

    patterns <- get_patterns(id_blue = 3490878,
                             id_red = 3277163,
                             id_green = 4038034)
    colors <- map(patterns, get_colors) %>%
      map(addhash) %>%
      unlist()
    c2 <- c + scale_fill_manual(values = colors)
    
    add_patterns(plot = c2, patterns = patterns, colors = colors) +
      ggtitle("Barber bar plot")
    

    plot of chunk unnamed-chunk-8
    This is one plot decoration you could use to illustrate something about Occam’s razor.

    Bark plot

    patterns <- get_patterns(id_blue = 1816764,
                             id_red = 1825775,
                             id_green = 1815897)
    
    colors <- map(patterns, get_colors) %>%
      map(addhash) %>%
      unlist()
    c2 <- c + scale_fill_manual(values = colors)
    
    add_patterns(plot = c2, patterns = patterns, colors = colors) +
      ggtitle("Bark bar plot")
    

    plot of chunk unnamed-chunk-9

    I’m less convinced by this one, maybe it’d have looked better with barking dogs?

    BART plot

    patterns <- get_patterns(id_blue = 1902711,
                             id_red = 1916712,
                             id_green = 1930435)
    colors <- map(patterns, get_colors) %>%
      map(addhash) %>%
      unlist()
    c2 <- c + scale_fill_manual(values = colors)
    
    add_patterns(plot = c2, patterns = patterns, colors = colors) +
      ggtitle("BART bar plot")
    

    plot of chunk unnamed-chunk-10

    This one is the loudest bar plot you’ll ever make since it was inspired by San Francisco BART. I might have come up with this idea after not finding a Bart Simpson pattern.

    I told Miles I’d stop once I’d only think of a barf plot which I don’t want to draw ever. As I had said the first time I made a bar plot with patterns, I’m not sure how useful it is, but you never know when you’re gonna need a fancy plot to get someone’s attention, right?

    To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Read more »

Copyright Use-R.com 2012 - 2016 ©