RSS Feed from R-bloggers.com

- Is my time series additive or multiplicative?
Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand how the interaction between general seasonality in activity and the underlying trend.

The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing.

## Additive or multiplicative?

It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

There are three components to a time series:

–**trend**how things are overall changing

–**seasonality**how things change within a given period e.g. a year, month, week, day

–**error**/**residual**/**irregular**activity not explained by the trend or the seasonal valueHow these three components interact determines the difference between a multiplicative and an additive time series.

In a multiplicative time series, the components multiply together to make the time series. If you have an increasing trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated. This is common when you’re looking at web traffic.

In an additive time series, the components added together to make the time series.If you have an increasing trend, you still see roughly the same size peaks and troughs. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

You can have a time series that is somewhere in between the two, a statistician’s “it depends”, but I’m interested in attaining a quick classification so I won’t be handling this complication here.

## There’s a package for that

When I first started doing time series analysis, the only way to visualise how a time series splits into different components was to use base R. About the time I was feeling the pain, someone released a ggplot2 time series extension! I’ll be using ggseas where I can.

We’ll use the

`nzbop`

data set from ggseas to, first of all, examine a single time series and then process all the time series in the dataset to determine if they’re multiplicative or additive.`sample_ts<-nzdata[Account == "Current account" & Category=="Services; Exports total", .(TimePeriod, Value)]`

TimePeriod Value 1971-06-30 55 1971-09-30 56 1971-12-31 60 1972-03-31 65 1972-06-30 65 1972-09-30 63 I’ll be using other packages (like data.table) and will only show relevant code snippets as I go along. You can get the whole script in a GIST.

## Decomposing the data

To be able to determine if the time series is additive or multiplicative, the time series has to be split into its components.

Existing functions to decompose the time series include

`decompose()`

, which allows you pass whether the series is multiplicative or not, and`stl()`

, which is only for additive series without transforming the data. I could use`stl()`

with a multiplicative series if I transform the time series by taking the log. For either function, I need to know whether it’s additive or multiplicative first.### The trend

The first component to extract is the trend. There are a number of ways you can do this, and some of the simplest ways involve calculating a moving average or median.

`sample_ts[,trend := zoo::rollmean(Value, 8, fill=NA, align = "right")]`

TimePeriod Value trend 2014-03-31 5212 4108.625 2014-06-30 3774 4121.750 2014-09-30 3698 4145.500 2014-12-31 4752 4236.375 2015-03-31 6154 4376.500 2015-06-30 4543 4478.875 A moving median is less sensitive to outliers than a moving mean. It doesn’t work well though if you have a time series that includes periods of inactivity. Lots of 0s can result in very weird trends.

### The seasonality

Seasonality will be cyclical patterns that occur in our time series once the data has had trend removed.

Of course, the way to de-trend the data needs to additive or multiplicative depending on what type your time series is. Since we don’t know the type of time series at this point, we’ll do both.

`sample_ts[,`:=`( detrended_a = Value - trend, detrended_m = Value / trend )]`

TimePeriod Value trend detrended_a detrended_m 2014-03-31 5212 4108.625 1103.375 1.2685509 2014-06-30 3774 4121.750 -347.750 0.9156305 2014-09-30 3698 4145.500 -447.500 0.8920516 2014-12-31 4752 4236.375 515.625 1.1217137 2015-03-31 6154 4376.500 1777.500 1.4061465 2015-06-30 4543 4478.875 64.125 1.0143172 To work out the seasonality we need to work out what the typical de-trended values are over a cycle. Here I will calculate the mean value for the observations in Q1, Q2, Q3, and Q4.

`sample_ts[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE), seasonal_m = mean(detrended_m, na.rm = TRUE)), by=.(quarter(TimePeriod)) ]`

TimePeriod Value trend detrended_a detrended_m seasonal_a seasonal_m 2014-03-31 5212 4108.625 1103.375 1.2685509 574.1919 1.2924422 2014-06-30 3774 4121.750 -347.750 0.9156305 -111.2878 1.0036648 2014-09-30 3698 4145.500 -447.500 0.8920516 -219.8363 0.9488803 2014-12-31 4752 4236.375 515.625 1.1217137 136.7827 1.1202999 2015-03-31 6154 4376.500 1777.500 1.4061465 574.1919 1.2924422 2015-06-30 4543 4478.875 64.125 1.0143172 -111.2878 1.0036648 My actual needs aren’t over long economic periods so I’m not using a better seasonality system for this blog post. There are some much better mechanisms than this.

### The remainder

Now that we have our two components, we can calculate the residual in both situations and see which has the better fit.

`sample_ts[,`:=`( residual_a = detrended_a - seasonal_a, residual_m = detrended_m / seasonal_m )]`

TimePeriod Value trend detrended_a detrended_m seasonal_a seasonal_m residual_a residual_m 2014-03-31 5212 4108.625 1103.375 1.2685509 574.1919 1.2924422 529.1831 0.9815146 2014-06-30 3774 4121.750 -347.750 0.9156305 -111.2878 1.0036648 -236.4622 0.9122871 2014-09-30 3698 4145.500 -447.500 0.8920516 -219.8363 0.9488803 -227.6637 0.9401098 2014-12-31 4752 4236.375 515.625 1.1217137 136.7827 1.1202999 378.8423 1.0012620 2015-03-31 6154 4376.500 1777.500 1.4061465 574.1919 1.2924422 1203.3081 1.0879763 2015-06-30 4543 4478.875 64.125 1.0143172 -111.2878 1.0036648 175.4128 1.0106135 ## Visualising decomposition

I’ve done the number crunching, but you could also perform a visual decomposition. ggseas gives us a function

`ggsdc()`

which we can use.`ggsdc(sample_ts, aes(x = TimePeriod, y = Value), method = "decompose", frequency = 4, s.window = 8, type = "additive")+ geom_line()+ ggtitle("Additive")+ theme_minimal()`

The different decompositions produce differently distributed residuals. We need to assess these to identify which decomposition is a better fit.## Assessing fit

After decomposing our data, we need to compare the residuals. As we’re just trying to classify the time series, we don’t need to do anything particularly sophisticated – a big part of this exercise is to produce a quick function that could be used to perform an initial classification in a batch processing environment so simpler is better.

We’re going to check the whether how much correlation between data points is still encoded within the residuals. This is the Auto-Correlation Factor (ACF) and it has a function for calculating it. As some of the correlations could be negative we will select the type with the smallest sum of squares of correlation values.

`ssacf<- function(x) sum(acf(x, na.action = na.omit)$acf^2) compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), "Additive", "Multiplicative") sample_ts[,.(ts_type = compare_ssacf(residual_a, residual_m ))]`

ts_type Multiplicative ## Putting it all together

This isn’t a fully generalized function (as it doesn’t have configurable lags, medians, seasonality etc) but if I had to apply to run this exercise over multiple time series from this dataset, my overall function and usage would look like:

`ssacf<- function(x) sum(acf(x, na.action = na.omit, plot = FALSE)$acf^2) compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), "Additive", "Multiplicative") additive_or_multiplicative <- function(dt){ m<-copy(dt) m[,trend := zoo::rollmean(Value, 8, fill="extend", align = "right")] m[,`:=`( detrended_a = Value - trend, detrended_m = Value / trend )] m[Value==0,detrended_m:= 0] m[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE), seasonal_m = mean(detrended_m, na.rm = TRUE)), by=.(quarter(TimePeriod)) ] m[is.infinite(seasonal_m),seasonal_m:= 1] m[,`:=`( residual_a = detrended_a - seasonal_a, residual_m = detrended_m / seasonal_m)] compare_ssacf(m$residual_a, m$residual_m ) } # Applying it to all time series in table sample_ts<-nzdata[ , .(Type=additive_or_multiplicative(.SD)), .(Account, Category)]`

Account Category Type Current account Inflow total Additive Current account Balance Multiplicative Current account Goods; Exports (fob) total Additive Current account Services; Exports total Multiplicative Current account Primary income; Inflow total Multiplicative Current account Secondary income; Inflow total Multiplicative Current account Goods balance Multiplicative Current account Services balance Multiplicative Current account Primary income balance Additive Current account Outflow total Additive Current account Goods; Imports (fob) total Additive Current account Services; Imports total Additive Current account Primary income; Outflow total Additive Current account Secondary income; Outflow total Multiplicative Capital account Inflow total Multiplicative Capital account Outflow total Multiplicative NA Net errors and omissions Multiplicative Financial account Foreign inv. in NZ total Multiplicative Financial account Balance Additive Financial account Foreign inv. in NZ; Direct inv. liabilities Additive Financial account Foreign inv. in NZ; Portfolio inv. liabilities Multiplicative Financial account Foreign inv. in NZ; Other inv. liabilities Multiplicative Financial account NZ inv. abroad total Multiplicative Financial account NZ inv. abroad; Direct inv. assets Multiplicative Financial account NZ inv. abroad; Portfolio inv. assets Additive Financial account NZ inv. abroad; Financial derivative assets Multiplicative Financial account NZ inv. abroad; Other inv. assets Additive Financial account NZ inv. abroad; Reserve assets Multiplicative ## Conclusion

This is a very simple way of quickly assessing whether multiple time series are additive or multiplicative. It gives an effective starting point for conditionally processing batches of time series. Get the GIST of the code used throughout this blog to work through it yourself. If you’ve got an easier way of classifying time series, let me know in the comments!

The post Is my time series additive or multiplicative? appeared first on Locke Data.

Read more » - future: Reproducible RNGs, future_lapply() and more(This article was first published on
**jottR**, and kindly contributed to R-bloggers)future 1.3.0 is available on CRAN. With futures, it is easy to

**write R code once**, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).Futures makes it easy to harness any resources at hand. Thanks to great feedback from the community, this new version provides:

**A convenient lapply() function**- Added
`future_lapply()`

that works like`lapply()`

and gives identical results with the difference that futures are used internally. Depending on user’s choice of`plan()`

, these calculations may be processed sequential, in parallel, or distributed on multiple machines. - Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument
`future.seed = TRUE`

(default) will use a random initial seed, which may also be specified as`future.seed = <integer>`

. L’Ecuyer-CMRG RNG streams are used internally. - Load balancing can be controlled by argument
`future.scheduling`

, which is a scalar adjusting how many futures each worker should process.

- Added
**Clarifies distinction between developer and end user**- The end user controls what future strategy to use by default, e.g.
`plan(multiprocess)`

or`plan(cluster, workers = c("machine1", "machine2", "remote.server.org"))`

. - The developer controls whether futures should be resolved eagerly (default) or lazily, e.g.
`f <- future(..., lazy = TRUE)`

. Because of this,`plan(lazy)`

is now deprecated.

- The end user controls what future strategy to use by default, e.g.
**Is even more friendly to multi-tenant compute environments**`availableCores()`

returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (`parallel::detectCores()`

). If option`mc.cores`

or environment variable`MC_CORES`

is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead.**This way developers don’t have to adjust their code to match a certain compute environment; the default works everywhere**.- With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option
`future.availableCores.fallback`

or environment variable`R_FUTURE_AVAILABLE_FALLBACK`

. For instance, by using`R_FUTURE_AVAILABLE_FALLBACK=1`

system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake. - Analogously to how
`availableCores()`

returns the number of cores, the new function`availableWorkers()`

returns the host names available to the R process. The default is`rep("localhost", times = availableCores())`

, but when using HPC schedulers it may be the host names of other compute notes allocated to the job.

For full details on updates, please see the NEWS file.

**The future package installs out-of-the-box on all operating systems**.

## A quick example

The bootstrap example of

`help("clusterApply", package = "parallel")`

adapted to make use of futures.`library("future")`

library("boot")

run <- function(...) {

cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)

cd4.mle <- list(m = colMeans(cd4), v = var(cd4))

boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)

}

# base::lapply()

system.time(boot <- lapply(1:100, FUN = run))

### user system elapsed

### 133.637 0.000 133.744

# Sequentially on the local machine

plan(sequential)

system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))

### user system elapsed

### 134.916 0.003 135.039

# In parallel on the local machine (with 8 cores)

plan(multisession)

system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))

### user system elapsed

### 0.960 0.041 29.527

stopifnot(all.equal(boot1, boot0))## What’s next?

The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using

`plan(batchjobs_torque)`

instead of`plan(multiprocess)`

your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon.Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using

`plan(aws_lambda)`

.Happy futuring!

(*) Alright, I admit, it was intended.

## Links

- future package:
- CRAN page: https://cran.r-project.org/package=future
- GitHub page: https://github.com/HenrikBengtsson/future

- future.BatchJobs package:
- future.batchtools package:
- CRAN page: N/A
- GitHub page: https://github.com/HenrikBengtsson/future.batchtools

- doFuture package (an foreach adaptor):
- CRAN page: https://cran.r-project.org/package=doFuture
- GitHub page: https://github.com/HenrikBengtsson/doFuture

## See also

- A Future for R: Slides from useR 2016, 2016-07-02
- Remote Processing Using Futures, 2016-10-21

Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**jottR**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - SatRday and visual inference of vine copulas(This article was first published on
**Pareto's Playground**, and kindly contributed to R-bloggers)# SatRday

From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections:

- R and Git (Jennifer Bryan)
- Shiny, flexdashboard and Shinyapps.io (Julia Silge)
- Building and validating logistic regression models (Steph Locke)

**R and Git**Easy integration of version control through git and Rstudio has never been this easy. If you are not already following this principle of

`Pull`

,`Commit`

and`Push`

in your workflow, I recommend you vist the website http://happygitwithr.com/ which will help you to correct your current workflow indiscretions!If you are however an avid user of github and regularly push your commits with correct messages, here is a gem of a website that made my day – starlogs

**Shiny**The Shiny workshop was an amazing exhibition on what can be done in an afternoon with the power of shiny and Rstudio! We got to build an interactive dashboard investigating disease data from South Africa using the new

`flexdashboard`

.**Training Models**Once a again I realised how powerful the

`caret`

package can be in training and testing a wide range of models in a modular way without too much hassle or fuss. I am really keen to start playing with the`h20`

package after a brief overview was shown on the ensemble capabilities of the package.## The Conference

The saturday saw a multitude of speakers take the stage to discuss some of the most interesting topics and application of R that I have seen in a very long time. The full programme can be viewed on the website. I also had the opportunity to talk about how we can combine

`Shiny`

and the`visNetworks`

html widget as a way to conduct analysis of high-dimensional Vine Copulas.R, as an open source software with a large following provides quick and easy access to complex models and methods that are used and applied widely within academics, but more importantly, in practice. This increase in complexity both theoretically and mathematically, has resulted in an obligation from practitioners to break down walls of jargon and mathematical hyperbole into easy digestable actions and insight.

My attempt at addressing this philosophical idea was to create a small package called

`VineVizR`

. The package can be viewed and downloaded here. I used this package and its function to build a shiny App where one can explore the interlinkages between 23 of South Africa’s largest companies by market cap using Vine Copulas.Vine copulas allow us to model complex linkages among assets, but currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output that is hard to interpret. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output.

The app can be found on my ShinyApp page. I hope you enjoy this small app and the meta data that has been integrated into it. If you feel that you still want to play a bit more, go download the package and play with the

`VizVineR`

function where you can play and create your own groups!! But be aware, I don’t see myself as a developer – so when you peek under the hood and see any improvements that can be made – let me know.### View of the dashboard

Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**Pareto's Playground**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - Building Shiny App exercises part 7(This article was first published on
**R-exercises**, and kindly contributed to R-bloggers)**Connect widgets & plots**In the seventh part of our journey we are ready to connect more of the widgets we created before with our k-means plot in order to totally control its output. Of cousre we will also reform the plot itself properly in order to make it a real k-means plot.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

First of all let’s move the widgets we are going to use from the

`sidebarPanel`

into the`mainPanel`

and specifically under our`plot`

.**Exercise 1**Remove the

`textInput`

from your server.R file. Then place the`checkboxGroupInput`

and the`selectInput`

in the same row with the`sliderInput`

. Name them “Variable X” and “Variable Y” respectively. HINT: Use`fluidrow`

and`column`

.**Create a reactive expression**Reactive expressions are expressions that can read reactive values and call other reactive expressions. Whenever a reactive value changes, any reactive expressions that depended on it are marked as “invalidated” and will automatically re-execute if necessary. If a reactive expression is marked as invalidated, any other reactive expressions that recently called it are also marked as invalidated. In this way, invalidations ripple through the expressions that depend on each other.

The reactive expression is activated like this:`example <- reactive({ })`

**Exercise 2**Place a reactive expression in server.R, at any spot except inside

`output$All`

and name it “Data”. HINT: Use`reactive`

**Connect your dataset’s variables with your widgets.**Now let’s connect your

`selectInput`

with the variables of your dataset as in the example below.#ui.R

`library(shiny)`

shinyUI(fluidPage(

titlePanel("Shiny App"),`sidebarLayout(`

sidebarPanel(h2("Menu"),

selectInput('ycol', 'Y Variable', names(iris)) ),

mainPanel(h1("Main")

)

)

))

#server.R

`shinyServer(function(input, output) {`

example <- reactive({

iris[, c(input$ycol)]

})

})**Exercise 3**Put the variables of the

`iris`

dataset as inputs in your`selectInput`

as “Variable Y” . HINT: Use`names`

.**Exercise 4**Do the same for

`checkboxGroupInput`

and “Variable X”. HINT: Use`names`

.Select the fourth variabale as default like the example below.

#ui.R

`library(shiny)`

shinyUI(fluidPage(

titlePanel("Shiny App"),`sidebarLayout(`

sidebarPanel(h2("Menu"),

checkboxGroupInput("xcol", "Variable X",names(iris),

selected=names(iris)[[4]]),

selectInput("ycol", "Y Variable", names(iris),

selected=names(iris)[[4]])

),

mainPanel(h1("Main")

)

)

))

#server.R

`shinyServer(function(input, output) {`

example <- reactive({

iris[, c(input$xcol,input$ycol)

]

})

})**Exercise 5**Make the second variable the default choise for both widgets. HINT: Use

`selected`

.Now follow the example below to create a new function and place there the automated function for k means calculation.

#ui.R

`library(shiny)`

shinyUI(fluidPage(

titlePanel("Shiny App"),`sidebarLayout(`

sidebarPanel(h2("Menu"),

checkboxGroupInput("xcol", "Variable X",names(iris),

selected=names(iris)[[4]]),

selectInput("ycol", "Y Variable", names(iris),

selected=names(iris)[[4]])

),

mainPanel(h1("Main")

)

)

))

#server.R

`shinyServer(function(input, output) {`

example <- reactive({

iris[, c(input$xcol,input$ycol)

]

})

example2 <- reactive({

kmeans(example())

})

})**Exercise 6**Create the reactive function

`Clusters`

and put in there the function`kmeans`

which will be applied on the function`Data`

. HINT: Use`reactive`

.**Connect your plot with the widgets.**It is time to connect your plot with the widgets.

**Exercise 7**Put

`Data`

inside`renderPlot`

as first argument replacing the data that you have chosen to be plotted until now. Moreover delete`xlab`

and`ylab`

.**Improve your k-means visualiztion.**You gan change automatically the colours of your clusters by copying and pasting this part of code as first argument of

`renderPlot`

before the`plot`

function:`palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",`

"#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))We will choose to have up to nine clusters so we choose nine colours.

**Exercise 8**Set

`min`

of your`sliderInput`

to 1,`max`

to 9 and`value`

to 4 and use the`palette`

function to give colours.This is how you can give different colors to your clusters. To activate these colors put this part of code into your

`plot`

function.`col = Clusters()$cluster,`

**Exercise 9**Activate the

`palette`

function.To make your clusters easily foundable you can fully color them by adding into

`plot`

function this:

`pch`

= 20,`cex`

= 3**Exercise 10**Fully color the

`points`

of your`plot`

.Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**R-exercises**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization(This article was first published on
**Easy Guides**, and kindly contributed to R-bloggers)**factoextra**is an R package making easy to*extract*and*visualize*the output of exploratory**multivariate data analyses**, including:Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two

*qualitative variables*(or categorical data).Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.

Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).

Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.

Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

There are a number of R packages implementing principal component methods. These packages include:

*FactoMineR*,*ade4*,*stats*,*ca*,*MASS*and*ExPosition*.However, the result is presented differently according to the used packages. To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and dimensionality reduction analysis – we developed an easy-to-use R package named factoextra.

The R package

**factoextra**has flexible and easy-to-use methods to extract quickly, in a human readable standard data format, the analysis results from the different packages mentioned above.It produces a

**ggplot2**-based**elegant data visualization**with less typing.- It contains also many functions facilitating clustering analysis and visualization.

We’ll use i) the FactoMineR package (Sebastien Le, et al., 2008) to compute PCA, (M)CA, FAMD, MFA and HCPC; ii) and the factoextra package for extracting and visualizing the results.

FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. The official website is available at: http://factominer.free.fr/. Thanks to François Husson for his impressive work.

The figure below shows methods, which outputs can be visualized using the factoextra package. The official online documentation is available at: http://www.sthda.com/english/rpkgs/factoextra.

**Contents**## Why using factoextra?

The

*factoextra*R package can handle the results of PCA, CA, MCA, MFA, FAMD and HMFA from several packages, for extracting and visualizing the most important information contained in your data.*After PCA, CA, MCA, MFA, FAMD and HMFA, the most important row/column elements*can be highlighted using :

- their cos2 values corresponding to their quality of representation on the factor map
- their contributions to the definition of the principal dimensions.

If you want to do this, the factoextra package provides a convenient solution.

*PCA and (M)CA are used sometimes for prediction problems*: one can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the information provided by the previously performed PCA or (M)CA. This can be done easily using the FactoMineR package.

If you want to make predictions with PCA/MCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2: then factoextra can help you. It’s quick, write less and do more…

*Several functions from different packages – FactoMineR, ade4, ExPosition, stats – are available in R for performing PCA, CA or MCA*. However, The components of the output vary from package to package.

No matter the package you decided to use, factoextra can give you a human understandable output.

## Installing FactoMineR

The FactoMineR package can be installed and loaded as follow:

`# Install install.packages("FactoMineR") # Load library("FactoMineR")`

## Installing and loading factoextra

- factoextra can be installed from CRAN as follow:

`install.packages("factoextra")`

- Or, install the latest version from Github

`if(!require(devtools)) install.packages("devtools") devtools::install_github("kassambara/factoextra")`

- Load factoextra as follow :

`library("factoextra")`

## Main functions in the factoextra package

See the online documentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.

### Visualizing dimension reduction analysis outputs

Functions Description *fviz_eig (or fviz_eigenvalue)*Extract and visualize the eigenvalues/variances of dimensions. *fviz_pca*Graph of individuals/variables from the output of *Principal Component Analysis*(PCA).*fviz_ca*Graph of column/row variables from the output of *Correspondence Analysis*(CA).*fviz_mca*Graph of individuals/variables from the output of *Multiple Correspondence Analysis*(MCA).*fviz_mfa*Graph of individuals/variables from the output of *Multiple Factor Analysis*(MFA).*fviz_famd*Graph of individuals/variables from the output of *Factor Analysis of Mixed Data*(FAMD).*fviz_hmfa*Graph of individuals/variables from the output of *Hierarchical Multiple Factor Analysis*(HMFA).*fviz_ellipses*Draw confidence ellipses around the categories. *fviz_cos2*Visualize the quality of representation of the row/column variable from the results of PCA, CA, MCA functions. *fviz_contrib*Visualize the contributions of row/column elements from the results of PCA, CA, MCA functions. ### Extracting data from dimension reduction analysis outputs

Functions Description *get_eigenvalue*Extract and visualize the eigenvalues/variances of dimensions. *get_pca*Extract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from *Principal Component Analysis*(PCA) outputs.*get_ca*Extract all the results (coordinates, squared cosine, contributions) for the active column/row variables from *Correspondence Analysis*outputs.*get_mca*Extract results from *Multiple Correspondence Analysis*outputs.*get_mfa*Extract results from *Multiple Factor Analysis*outputs.*get_famd*Extract results from *Factor Analysis of Mixed Data*outputs.*get_hmfa*Extract results from *Hierarchical Multiple Factor Analysis*outputs.*facto_summarize*Subset and summarize the output of factor analyses. ### Clustering analysis and visualization

Functions Description *dist*(fviz_dist, get_dist)Enhanced Distance Matrix Computation and Visualization. *get_clust_tendency*Assessing Clustering Tendency. *fviz_nbclust*(fviz_gap_stat)Determining and Visualizing the Optimal Number of Clusters. *fviz_dend*Enhanced Visualization of Dendrogram *fviz_cluster*Visualize Clustering Results *fviz_mclust*Visualize Model-based Clustering Results *fviz_silhouette*Visualize Silhouette Information from Clustering. *hcut*Computes Hierarchical Clustering and Cut the Tree *hkmeans*(hkmeans_tree, print.hkmeans)Hierarchical k-means clustering. *eclust*Visual enhancement of clustering analysis ## Dimension reduction and factoextra

As depicted in the figure below, the type of analysis to be performed depends on the data set formats and structures.

In this section we start by illustrating classical methods – such as PCA, CA and MCA – for analyzing a data set containing continuous variables, contingency table and qualitative variables, respectively.

We continue by discussing advanced methods – such as FAMD, MFA and HMFA – for analyzing a data set containing a mix of variables (qualitatives & quantitatives) organized or not into groups.

Finally, we show how to perform hierarchical clustering on principal components (HCPC), which useful for performing clustering with a data set containing only qualitative variables or with a mixed data of qualitative and quantitative variables.

### Principal component analysis

- Data:
*decathlon2*[in*factoextra*package] - PCA function:
*FactoMineR::PCA*() - Visualization
*factoextra::fviz_pca*()

Read more about computing and interpreting principal component analysis at:

**Principal Component Analysis**(PCA).**Loading data**

`library("factoextra") data("decathlon2") df <- decathlon2[1:23, 1:10]`

**Principal component analysis**

`library("FactoMineR") res.pca <- PCA(df, graph = FALSE)`

**Extract and visualize eigenvalues/variances**:

`# Extract eigenvalues/variances get_eig(res.pca)`

`## eigenvalue variance.percent cumulative.variance.percent ## Dim.1 4.1242133 41.242133 41.24213 ## Dim.2 1.8385309 18.385309 59.62744 ## Dim.3 1.2391403 12.391403 72.01885 ## Dim.4 0.8194402 8.194402 80.21325 ## Dim.5 0.7015528 7.015528 87.22878 ## Dim.6 0.4228828 4.228828 91.45760 ## Dim.7 0.3025817 3.025817 94.48342 ## Dim.8 0.2744700 2.744700 97.22812 ## Dim.9 0.1552169 1.552169 98.78029 ## Dim.10 0.1219710 1.219710 100.00000`

`# Visualize eigenvalues/variances fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))`

4.

**Extract and visualize results for variables**:`# Extract the results for variables var <- get_pca_var(res.pca) var`

`## Principal Component Analysis Results for variables ## =================================================== ## Name Description ## 1 "$coord" "Coordinates for the variables" ## 2 "$cor" "Correlations between variables and dimensions" ## 3 "$cos2" "Cos2 for the variables" ## 4 "$contrib" "contributions of the variables"`

`# Coordinates of variables head(var$coord)`

`## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 ## X100m -0.8506257 -0.17939806 0.3015564 0.03357320 -0.1944440 ## Long.jump 0.7941806 0.28085695 -0.1905465 -0.11538956 0.2331567 ## Shot.put 0.7339127 0.08540412 0.5175978 0.12846837 -0.2488129 ## High.jump 0.6100840 -0.46521415 0.3300852 0.14455012 0.4027002 ## X400m -0.7016034 0.29017826 0.2835329 0.43082552 0.1039085 ## X110m.hurdle -0.7641252 -0.02474081 0.4488873 -0.01689589 0.2242200`

`# Contribution of variables head(var$contrib)`

`## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 ## X100m 17.544293 1.7505098 7.338659 0.13755240 5.389252 ## Long.jump 15.293168 4.2904162 2.930094 1.62485936 7.748815 ## Shot.put 13.060137 0.3967224 21.620432 2.01407269 8.824401 ## High.jump 9.024811 11.7715838 8.792888 2.54987951 23.115504 ## X400m 11.935544 4.5799296 6.487636 22.65090599 1.539012 ## X110m.hurdle 14.157544 0.0332933 16.261261 0.03483735 7.166193`

`# Graph of variables: default plot fviz_pca_var(res.pca, col.var = "black")`

It’s possible to control variable colors using their contributions (“contrib”) to the principal axes:

`# Control variable colors using their contributions fviz_pca_var(res.pca, col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE # Avoid text overlapping )`

**Variable contributions to the principal axes**:

`# Contributions of variables to PC1 fviz_contrib(res.pca, choice = "var", axes = 1, top = 10) # Contributions of variables to PC2 fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)`

**Extract and visualize results for individuals**:

`# Extract the results for individuals ind <- get_pca_ind(res.pca) ind`

`## Principal Component Analysis Results for individuals ## =================================================== ## Name Description ## 1 "$coord" "Coordinates for the individuals" ## 2 "$cos2" "Cos2 for the individuals" ## 3 "$contrib" "contributions of the individuals"`

`# Coordinates of individuals head(ind$coord)`

`## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 ## SEBRLE 0.1955047 1.5890567 0.6424912 0.08389652 1.16829387 ## CLAY 0.8078795 2.4748137 -1.3873827 1.29838232 -0.82498206 ## BERNARD -1.3591340 1.6480950 0.2005584 -1.96409420 0.08419345 ## YURKOV -0.8889532 -0.4426067 2.5295843 0.71290837 0.40782264 ## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217 ## McMULLEN 0.1212195 -1.0139102 -0.8625170 1.34164291 1.62151286`

`# Graph of individuals # 1. Use repel = TRUE to avoid overplotting # 2. Control automatically the color of individuals using the cos2 # cos2 = the quality of the individuals on the factor map # Use points only # 3. Use gradient color fviz_pca_ind(res.pca, col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE # Avoid text overlapping (slow if many points) )`

`# Biplot of individuals and variables fviz_pca_biplot(res.pca, repel = TRUE)`

**Color individuals by groups**:

`# Compute PCA on the iris data set # The variable Species (index = 5) is removed # before PCA analysis iris.pca <- PCA(iris[,-5], graph = FALSE) # Visualize # Use habillage to specify groups for coloring fviz_pca_ind(iris.pca, label = "none", # hide individual labels habillage = iris$Species, # color by groups palette = c("#00AFBB", "#E7B800", "#FC4E07"), addEllipses = TRUE # Concentration ellipses )`

### Correspondence analysis

- Data:
*housetasks*[in factoextra] - CA function
*FactoMineR::CA*() - Visualize with
*factoextra::fviz_ca*()

Read more about computing and interpreting correspondence analysis at:

**Correspondence Analysis**(CA).**Compute CA**:

`# Loading data data("housetasks") # Computing CA library("FactoMineR") res.ca <- CA(housetasks, graph = FALSE)`

**Extract results for row/column variables**:

`# Result for row variables get_ca_row(res.ca) # Result for column variables get_ca_col(res.ca)`

**Biplot of rows and columns**

`fviz_ca_biplot(res.ca, repel = TRUE)`

To visualize only row points or column points, type this:

`# Graph of row points fviz_ca_row(res.ca, repel = TRUE) # Graph of column points fviz_ca_col(res.ca) # Visualize row contributions on axes 1 fviz_contrib(res.ca, choice ="row", axes = 1) # Visualize column contributions on axes 1 fviz_contrib(res.ca, choice ="col", axes = 1)`

### Multiple correspondence analysis

- Data:
**poison**[in factoextra] - MCA function
**FactoMineR::MCA**() - Visualization
**factoextra::fviz_mca**()

Read more about computing and interpreting multiple correspondence analysis at:

**Multiple Correspondence Analysis**(MCA).**Computing MCA**:

`library(FactoMineR) data(poison) res.mca <- MCA(poison, quanti.sup = 1:2, quali.sup = 3:4, graph=FALSE)`

**Extract results for variables and individuals**:

`# Extract the results for variable categories get_mca_var(res.mca) # Extract the results for individuals get_mca_ind(res.mca)`

**Contribution of variables and individuals to the principal axes**:

`# Visualize variable categorie contributions on axes 1 fviz_contrib(res.mca, choice ="var", axes = 1) # Visualize individual contributions on axes 1 # select the top 20 fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)`

**Graph of individuals**

`# Color by groups # Add concentration ellipses # Use repel = TRUE to avoid overplotting grp <- as.factor(poison[, "Vomiting"]) fviz_mca_ind(res.mca, habillage = grp, addEllipses = TRUE, repel = TRUE)`

**Graph of variable categories**:

`fviz_mca_var(res.mca, repel = TRUE)`

**Biplot of individuals and variables**:

`fviz_mca_biplot(res.mca, repel = TRUE)`

### Advanced methods

The factoextra R package has also functions that support the visualization of advanced methods such:

- Factor Analysis of Mixed Data (FAMD): : FAMD Examples
- Multiple Factor Analysis (MFA): MFA Examples
- Hierarchical Multiple Factor Analysis (HMFA): HMFA Examples
- Hierachical Clustering on Principal Components (HCPC)

## Cluster analysis and factoextra

To learn more about cluster analysis, you can refer to the book available at: Practical Guide to Cluster Analysis in R

The main parts of the book include:

*distance measures*,*partitioning clustering*,*hierarchical clustering*,*cluster validation methods*, as well as,*advanced clustering methods*such as fuzzy clustering, density-based clustering and model-based clustering.

The book presents the basic principles of these tasks and provide many examples in R. It offers solid guidance in data mining for students and researchers.

### Partitioning clustering

`# 1. Loading and preparing data data("USArrests") df <- scale(USArrests) # 2. Compute k-means set.seed(123) km.res <- kmeans(scale(USArrests), 4, nstart = 25) # 3. Visualize library("factoextra") fviz_cluster(km.res, data = df, palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"), ggtheme = theme_minimal(), main = "Partitioning Clustering Plot" )`

### Hierarchical clustering

`library("factoextra") # Compute hierarchical clustering and cut into 4 clusters res <- hcut(USArrests, k = 4, stand = TRUE) # Visualize fviz_dend(res, rect = TRUE, cex = 0.5, k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"))`

### Determine the optimal number of clusters

`# Optimal number of clusters for k-means library("factoextra") my_data <- scale(USArrests) fviz_nbclust(my_data, kmeans, method = "gap_stat")`

## Acknoweledgment

I would like to thank Fabian Mundt for his active contributions to factoextra.

We sincerely thank all developers for their efforts behind the packages that

**factoextra**depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), FactoMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more …..## References

- H. Wickham (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
- Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.5.
- Sebastien Le, Julie Josse, Francois Husson (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software, 25(1), 1-18. 10.18637/jss.v025.i01
- Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428

## Infos

This analysis has been performed using

**R software**(ver. 3.3.2) and**factoextra**(ver. 1.0.4.999)jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**Easy Guides**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - padr::pad does now do group padding(This article was first published on
**That’s so Random**, and kindly contributed to R-bloggers)A few weeks ago

`padr`

was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or`vignette("padr")`

for a general introduction. In v0.2.0 the`pad`

function is extended with a`group`

argument, which makes your life a lot easier when you want to do padding within groups.In the Examples of

`padr`

in v0.1.0 I showed that padding over multiple groups could be done by using`padr`

in conjunction with`dplyr`

and`tidyr`

.`library(dplyr) library(padr) # padding a data.frame on group level day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp = rep(LETTERS[1:3], each = 4), y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp, date) x_df_grp %>% group_by(grp) %>% do(pad(.)) %>% ungroup %>% tidyr::fill(grp)`

I quite quickly realized this is an unsatisfactory solution for two reasons:

1) It is a hassle. It is the goal of

`padr`

to make datetime preparation as swift and pain free as possible. Having to manually fill your grouping variable(s) after padding is not exactly in concordance with that goal.

2) It does not work when one or both of the`start_val`

and`end_val`

arguments are specified. The start and/or the end of the time series of a group are then no longer bounded by an original observation, and thus don’t have a value of the grouping variable(s). Forward filling with`tidyr::fill`

will incorrectly fill the grouping variable(s) as a result.It was therefore necessary to expand

`pad`

, so the grouping variable(s) do not contain missing values anymore after padding. The`group`

argument takes a character vector with the column name(s) of the variables to group by. Padding will be done on each of the groups formed by the unique combination of the grouping variables. This is of course just the distinct of the variable, if there is only one grouping variable. The result of the date padding will be exactly the same as before this addition (meaning the datetime variable does not change). However, the returned data frame will no longer have missing values for the grouping variables on the padded rows.The new version of the section in the Examples of

`padr`

is:`day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4), grp2 = letters[1:2], y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date) # pad by one grouping var x_df_grp %>% pad(group = 'grp1') # pad by two groups vars x_df_grp %>% pad(group = c('grp1', 'grp2'))`

### Bug fixes

Besides the additional argument there were two bug fixes in this version:

`pad`

does no longer break when datetime variable contains one value only. Returns`x`

and a warning if`start_val`

and`end_val`

are`NULL`

, and will do proper padding when one or both are specified.In the

`fill_`

function now a meaningful error is thrown, when forgetting to specify at least one column to apply the function on.

### v0.2.1

Right before posting this blog, Doug Friedman found out that in v0.2.0 the

`by`

argument no longer functioned. This bug was fixed in the patch release v0.2.1.I hope you (still) enjoy working with

`padr`

, let me know when you find a bug or got ideas for improvement.Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**That’s so Random**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - Predicting food preferences with sparklyr (machine learning)(This article was first published on
**Shirin's playgRound**, and kindly contributed to R-bloggers)This week I want to show how to run machine learning applications on a Spark cluster. I am using the

**sparklyr**package, which provides a handy interface to access Apache Spark functionalities via R.The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines.

### Apache Spark

Apache Spark™ can be used to perform large-scale data analysis workflows by taking advantage of its parallel cluster-computing setup.

Because machine learning processes are iterative, running models on parallel clusters can vastly increase the speed of training.

Obviously, Spark’s power comes to pass when dispatching it to external clusters, but for demonstration purposes, I am running the demo on a local Spark instance.#### MLlib

Spark’s distributed machine learning library MLlib sits on top of the Spark core framework. It implements many popular machine learning algorithms, plus many helper functions for data preprocessing.

With**sparklyr**you can easily access MLlib. You can work with a couple of different machine learning algorithms and with functions for manipulating features and Spark dataframes. Additionally, you can also perform SQL queries.**sparklyr**also implements**dplyr**, making it especially convenient for handling data.If you don’t have Spark installed locally, run:

`library(sparklyr) spark_install(version = "2.0.0")`

Now we can connect to a local Spark instance:

`library(sparklyr) sc <- spark_connect(master = "local", version = "2.0.0")`

### Preparations

Before I start with the analysis, I am preparing my custom ggplot2 theme and load the packages

**tidyr**(for gathering data for plotting),**dplyr**(for data manipulation) and**ggrepel**(for non-overlapping text labels in plots).`library(tidyr) library(ggplot2) library(ggrepel) library(dplyr) my_theme <- function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "right", legend.justification = "top", panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) }`

### The Data

Of course, the power of Spark lies in speeding up operations on large datasets. But because that’s not very handy for demonstration, I am here working with a small dataset: the raw data behind The FiveThirtyEight International Food Association’s 2014 World Cup.

This dataset is part of the

**fivethirtyeight**package and provides scores for how each person rated their preference of the dishes from several countries. The following categories could be chosen:- 5: I love this country’s traditional cuisine. I think it’s one of the best in the world.
- 4: I like this country’s traditional cuisine. I think it’s considerably above average.
- 3: I’m OK with this county’s traditional cuisine. I think it’s about average.
- 2: I dislike this country’s traditional cuisine. I think it’s considerably below average.
- 1: I hate this country’s traditional cuisine. I think it’s one of the worst in the world.
- N/A: I’m unfamiliar with this country’s traditional cuisine.

Because I think that whether someone doesn’t know a country’s cuisine is in itself information, I recoded NAs to 0.

`library(fivethirtyeight) food_world_cup[food_world_cup == "N/A"] <- NA food_world_cup[, 9:48][is.na(food_world_cup[, 9:48])] <- 0 food_world_cup$gender <- as.factor(food_world_cup$gender) food_world_cup$location <- as.factor(food_world_cup$location)`

The question I want to address with machine learning is whether the preference for a country’s cuisine can be predicted based on preferences of other countries’ cuisines, general knowledge and interest in different cuisines, age, gender, income, education level and/ or location.

Before I do any machine learning, however, I want to get to know the data.

First, I am calculating the percentages for each preference category and plot them with a pie chart that is facetted by country.`# calculating percentages per category and country percentages <- food_world_cup %>% select(algeria:vietnam) %>% gather(x, y) %>% group_by(x, y) %>% summarise(n = n()) %>% mutate(Percent = round(n / sum(n) * 100, digits = 2)) # rename countries & plot percentages %>% mutate(x_2 = gsub("_", " ", x)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = "", y = Percent, fill = y)) + geom_bar(width = 1, stat = "identity") + theme_minimal() + coord_polar("y", start = 0) + facet_wrap(~ x_2, ncol = 8) + scale_fill_brewer(palette = "Set3") + labs(fill = "")`

As we can already see from the distributions, there are big differences between countries. Some countries’ cuisines are very well known and liked (e.g. Italy, Mexico and the United States), while others are only known to a handful of people (e.g. Algeria, Croatia, Ghana, etc.).

#### Imputing missing values

In the demographic information, there are a few missing values. These, I want to impute:

`library(mice) dataset_impute <- mice(food_world_cup[, -c(1, 2)], print = FALSE) food_world_cup <- cbind(food_world_cup[, 2, drop = FALSE], mice::complete(dataset_impute, 1))`

#### Transforming preference values

Strictly speaking, the preference data is categorical. But because using them as factor levels would make the models much more complex and calculation-intensive, I am converting the factors to numbers and transform them by dividing through the mean of non-zero values for each country.

`food_world_cup[8:47] <- lapply(food_world_cup[8:47], as.numeric) countries <- paste(colnames(food_world_cup)[-c(1:7)]) for (response in countries) { food_world_cup[paste(response, "trans", sep = "_")] <- food_world_cup[response] / mean(food_world_cup[food_world_cup[response] > 0, response]) }`

As we can see by plotting the distribution of transformed values, they are far from being normal distributions. Moreover, we still see big differences between well known and not well known countries – this means that the data is unbalanced.

`food_world_cup %>% gather(x, y, algeria_trans:vietnam_trans) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(y)) + geom_density(fill = "navy", alpha = 0.7) + my_theme() + facet_wrap(~ x_2, ncol = 8) + labs(x = "transformed preference")`

#### Most liked cuisines and gender biases

Then, I wanted to know which countries were the most liked and whether we see a gender bias in preference for some country’s cuisines. The first plot shows the sum of preference values for each country, separated by male and female answers.

`food_world_cup_gather <- food_world_cup %>% collect %>% gather(country, value, algeria:vietnam) food_world_cup_gather$value <- as.numeric(food_world_cup_gather$value) food_world_cup_gather$country <- as.factor(food_world_cup_gather$country)`

`food_world_cup_gather <- food_world_cup_gather %>% mutate(x_2 = gsub("_", " ", country)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) order <- aggregate(food_world_cup_gather$value, by = list(food_world_cup_gather$x_2), FUN = sum) food_world_cup_gather %>% mutate(x_2 = factor(x_2, levels = order$Group.1[order(order$x, decreasing = TRUE)])) %>% ggplot(aes(x = x_2, y = value, fill = gender)) + geom_bar(stat = "identity") + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(fill = "Gender", x = "", y = "sum of preferences")`

Cuisines from Italy, the United States and Mexico were the most liked, from Ghana, Cameroon and the Ivory Coast were least liked (that they were also less well known playes into it, of course). No country shows an obvious gender bias, though.

To dig a bit deeper into gender differences, I am calculating the differences between mean preference of males and females for each country. Here, I am using the country list that I defined earlier when transforming the values.

Because I want to paste the country list to

**dplyr**’s functions, I need to use standard evaluation (SE). Be default,**dplyr**uses non-standard evalutaion (NSE), i.e. variable names are given without quotation marks. This makes it possible to easily convert between R and SQL code. But each**dplyr**function also has a version to use with SE: they each have a “*” appended to the function name, here e.g. *mutate_each*()*.`food_world_cup %>% collect %>% mutate_each_(funs(as.numeric), countries) %>% group_by(gender) %>% summarise_each_(funs(mean), countries) %>% summarise_each_(funs(diff), countries) %>% gather(x, y) %>% mutate(x_2 = gsub("_", " ", x)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = y)) + geom_bar(stat = "identity", fill = "navy", alpha = 0.7) + my_theme() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "difference\nbetween gender")`

All gender differences are very close to zero, so it seems that men and women generally have similar food preferences.

### Spark

Now, that I’m somewhat familiar with the data, I copy it to the Spark instance:

`food_world_cup <- copy_to(sc, food_world_cup)`

#### Principal Component Analysis (PCA)

One of the functions of Spark’s machine learning functions is for PCA:

*ml_pca()*. I am using it to get an idea about where the countries fall on a 2-dimensional plane of the first two principal components.`pca <- food_world_cup %>% mutate_each_(funs(as.numeric), countries) %>% ml_pca(features = paste(colnames(food_world_cup)[-c(1:47)])) library(tibble) as.data.frame(pca$components) %>% rownames_to_column(var = "labels") %>% mutate(x_2 = gsub("_trans", "", labels)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = PC1, y = PC2, color = x_2, label = x_2)) + geom_point(size = 2, alpha = 0.6) + geom_text_repel() + labs(x = paste0("PC1: ", round(pca$explained.variance[1], digits = 2) * 100, "% variance"), y = paste0("PC2: ", round(pca$explained.variance[2], digits = 2) * 100, "% variance")) + my_theme() + guides(fill = FALSE, color = FALSE)`

The least well known countries cluster on the top right, while the most liked are on the bottom right. PC2 seems to be mainly driven by how many 0s there are in the data.

#### Preparing the data for machine learning

First, I need to convert the factor strings of the non-country features into indexes. To do so, I am using the

*ft_string_indexer()*function. For other feature transformation functions, check out**sparklyr**’s website.`food_world_cup <- tbl(sc, "food_world_cup") %>% ft_string_indexer(input_col = "interest", output_col = "interest_idx") %>% ft_string_indexer(input_col = "gender", output_col = "gender_idx") %>% ft_string_indexer(input_col = "age", output_col = "age_idx") %>% ft_string_indexer(input_col = "household_income", output_col = "household_income_idx") %>% ft_string_indexer(input_col = "education", output_col = "education_idx") %>% ft_string_indexer(input_col = "location", output_col = "location_idx") %>% ft_string_indexer(input_col = "knowledge", output_col = "knowledge_idx")`

Now I can divide the data into training and test sets using the

*sdf_partition()*function.`partitions <- food_world_cup %>% sdf_partition(training = 0.75, test = 0.25, seed = 753)`

#### Modeling

Now, I run the Random Forest algorithm to predict each country’s preference based on all other countries’ preferences and the demographic information. Check back with

**sparklyr**’s website to find out about other machine learning algorithms you can use.For each country (as response variable), I am defining the features as all other countries’ transformed values and the indexed factor variables.

Then, I am filtering out all data rows where the response variable was 0. When I left the 0s in the data, the countries with many 0s introduced a very strong bias to the prediction. Here, I needed to use standard evaluation again, this time with

**lazyeval**’s*interp()*function and a formula.I am using the

*ml_random_forest()*function to run classification models. Inititally, I run a model on all features, then extract the 10 features with highest importance and re-run the model again on this subset of features. This improved the prediction accuracy compared to all-feature-models.The

*sdf_predict()*function is used to predict the classes of the test set. To obtain the quality metrics F1, weighted precision and weighted recall, I need to copy the prediction table to the Spark instance and run the*ml_classification_eval()*function.Finally, I am combining the output tables from all countries’s models to compare.

`library(lazyeval) for (response in countries) { features <- colnames(partitions$training)[-grep(response, colnames(partitions$training))] features <- features[grep("_trans|_idx", features)] fit <- partitions$training %>% filter_(interp(~ var > 0, var = as.name(response))) %>% ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification") feature_imp <- ml_tree_feature_importance(sc, fit) features <- as.character(feature_imp[1:10, 2]) fit <- partitions$training %>% filter_(interp(~ var > 0, var = as.name(response))) %>% ml_random_forest(intercept = FALSE, response = response, features = features, type = "classification") partitions$test <- partitions$test %>% filter_(interp(~ var > 0, var = as.name(response))) pred <- sdf_predict(fit, partitions$test) %>% collect pred_2 <- as.data.frame(table(pred[[response]], pred$prediction)) pred_2$response <- response pred_sc <- select(pred, -rawPrediction, -probability) pred_sc <- copy_to(sc, pred_sc, overwrite = TRUE) feature_imp$response <- response f1 <- ml_classification_eval(pred_sc, response, "prediction", metric = "f1") wP <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedPrecision") wR <- ml_classification_eval(pred_sc, response, "prediction", metric = "weightedRecall") ml_eval <- data.frame(response = response, f1 = f1, weightedPrecision = wP, weightedRecall = wR) if (response == "algeria") { feature_imp_df <- feature_imp ml_eval_df <- ml_eval pred_df <- pred_2 } else { feature_imp_df <- rbind(feature_imp_df, feature_imp) ml_eval_df <- rbind(ml_eval_df, ml_eval) pred_df <- rbind(pred_df, pred_2) } }`

#### Model evaluation

Now, I can compare the prediction accuracies for each country’s model by plotting the F1, weighted precision and weighted recall values.

*Precision*

In classification models, precision gives the proportion of correct classifications or “true positives” (i.e. how many of the samples that were classified as “5” were actually a “5”). If we have a high precision, this means that our model classified most samples correctly.

*Recall*

Recall, or sensitivity, gives the proportion of how many of all true “5” samples in the training data were correctly classified. If we have a high recall, this means that our model correctly detected most classes.

*F1 score*

The F-score gives the harmonic mean or weighted average of precision and recall.

Let’s say we had 10 “5s” in the training data. The prediction table classified 8 samples as “5s”, 6 of which were correctly classified. In this example, we have 2 false positives and 4 false negatives. This means we would have a precision of 6/8 and a recall of 6/10.

`results <- ml_eval_df %>% mutate(x_2 = gsub("_", " ", response)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) order <- results$x_2[order(results$f1, decreasing = TRUE)] gather(results, x, y, f1:weightedRecall) %>% mutate(x_2 = factor(x_2, levels = order)) %>% ggplot(aes(x = x_2, y = y, fill = x)) + geom_bar(stat = "identity", position = "dodge", alpha = 0.9) + scale_fill_brewer(palette = "Set1") + my_theme() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(fill = "", color = "", x = "", y = "value")`

The plot above shows that the model predicting preference for Spanish dishes had the highest accuracy, while the model for Italy, India and Ireland had the lowest accuracy.

To see what features had been used in more and less successful models, I am plotting the values of the 10 final features for classifying Spanish, Greece and Italian food preference categories 1 – 5 in the original dataset.

`as.data.frame(food_world_cup) %>% select_(.dots = c("spain", as.character(feats$feature))) %>% gather(x, y, -spain) %>% filter(spain > 0) %>% group_by(spain, x) %>% summarise(mean = mean(y)) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_idx", "", x_2)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = spain, fill = mean)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "mean") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = "", y = "Spain")`

`feats <- feature_imp_df %>% filter(response == "greece") %>% slice(1:10) as.data.frame(food_world_cup) %>% select_(.dots = c("greece", as.character(feats$feature))) %>% gather(x, y, -greece) %>% filter(greece > 0) %>% group_by(greece, x) %>% summarise(mean = mean(y)) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_idx", "", x_2)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = greece, fill = mean)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "mean") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = "", y = "Greece")`

`feats <- feature_imp_df %>% filter(response == "italy") %>% slice(1:10) as.data.frame(food_world_cup) %>% select_(.dots = c("italy", as.character(feats$feature))) %>% gather(x, y, -italy) %>% filter(italy > 0) %>% group_by(italy, x) %>% summarise(mean = mean(y)) %>% mutate(x_2 = gsub("_trans", "", x)) %>% mutate(x_2 = gsub("_idx", "", x_2)) %>% mutate(x_2 = gsub("_", " ", x_2)) %>% mutate(x_2 = gsub("(^|[[:space:]])([[:alpha:]])", "\\1\\U\\2", x_2, perl = TRUE)) %>% mutate(x_2 = gsub("And", "and", x_2)) %>% ggplot(aes(x = x_2, y = italy, fill = mean)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "mean") + my_theme() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = "", y = "Italy")`

### Conclusions

This dataset was a bit difficult because the preferences were highly unbalanced, i.e. the distributions of preferences were very different between countries. Some countries’ cuisines were largely unknown, while others’ were well known and generally popular. If I kept the unknown class (0) in the prediction models, countries with many 0s jumped up in prediction accuracy. But this was mainly due to the fact, that the chance of correctly classifying a sample as 0 was over-proportionally high (leading to high precision). Removing the 0s while working on raw preference values showed the same effect as before but now countries with a majority of 5s (e.g. Italy) were overly “accurate”. Transforming the data and removing the 0s led to a more evenly distributed accuracy of well known and less well known countries. But by removing them, the number of samples used for modeling varied strongly between countries, introducing another type of bias.

It was not possible to avoid bias entirely with this dataset. But I wanted to show it as an example nonetheless. Usually, machine learning examples show datasets where the models worked very well, leaving the reader in awe of the powers of machine learning. And while there are certainly powerful and impressive prediction models, real-life data is not always as simple. As can be seen with this dataset, it’s not always as straight forward to build a good classification model, even though you’d intuitively assume that it should be easy to predict food preferences based on which other cuisines someone likes…

The model could potentially be improved by engineering features, modeling factors and/ or different algorithms but I’ll leave this for another time.

Next week, I’ll be looking into how to use the

**h2o**package with Spark using**rsparkling**.Other machine learning topics I have covered include

- a basic machine learning workflow: Can we predict flu deaths with Machine Learning and R?
- extreme gradient boosting: Extreme Gradient Boosting and Preprocessing in Machine Learning – Addendum to predicting flu outcome with R
- feature selection: Feature Selection in Machine Learning (Breast Cancer Datasets)

`## R version 3.3.2 (2016-10-31) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: macOS Sierra 10.12.1 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] tibble_1.2 fivethirtyeight_0.1.0 dplyr_0.5.0 ## [4] ggrepel_0.6.5 ggplot2_2.2.1 tidyr_0.6.1 ## [7] sparklyr_0.5.1 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 knitr_1.15.1 magrittr_1.5 ## [4] rappdirs_0.3.1 munsell_0.4.3 colorspace_1.3-2 ## [7] R6_2.2.0 plyr_1.8.4 stringr_1.1.0 ## [10] httr_1.2.1 tools_3.3.2 parallel_3.3.2 ## [13] grid_3.3.2 gtable_0.2.0 config_0.2 ## [16] DBI_0.5-1 withr_1.0.2 htmltools_0.3.5 ## [19] lazyeval_0.2.0 yaml_2.1.14 assertthat_0.1 ## [22] rprojroot_1.2 digest_0.6.12 RColorBrewer_1.1-2 ## [25] base64enc_0.1-3 evaluate_0.10 rmarkdown_1.3 ## [28] labeling_0.3 stringi_1.1.2 scales_0.4.1 ## [31] backports_1.0.5 jsonlite_1.2`

Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**Shirin's playgRound**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... - Bar bar plots but not Babar plots(This article was first published on
**Maëlle**, and kindly contributed to R-bloggers)You might have heard of the “bar bar plots” movement whose goal is to prevent using (let’s use

`ggplot2`

language shall we)`geom_bar`

when you could have used e.g.`geom_boxplot`

or`geom_histogram`

because the bar plot hides the variability of the distributions you say you’re comparing, even if you add some sort of error bar. I whole-heartedly agree with this movement but in this post I’d like to present some ok bar plots, that represent counts of individuals in different categories. You might know them as`geom_bar(blabla, stat = "identity")`

or`geom_col`

. They’re still bar plots and they’re great, in particular when you make puns out of them which is what I did with Miles McBain.

The reason I mentioned bar plots to Miles was my wanting to recycle some old code of mine (again) for a blog post. It was code from my “let’s test all the kitsch R plots have to offer” period, when I had started using Thomas Leeper’s

`colourlovers`

package for adding patterns on bars.`colourlovers`

is a great package for getting trendy colour palettes, although I’ll admit I’m a`viridis`

person. But`colourlovers`

also has patterns, about which the package documentation states “Patterns are images created on COLOURlovers using a specified palette of colors. They serve as examples of how to use the palette in practice.”. At the time I read those sentences I got really excited because I was convinced one could do*more*with patterns, like filling bar plots with them. In this post, I’ll prove you can, with my non-generalizable worflow.`library("dplyr") library("tidyr") library("httr") library("grid") library("ggplot2") library("colourlovers")`

# General comments on using patterns

A pattern is a square image that I convert to a raster. This raster is a matrix with 200 rows and 200 columns. In this matrix $x_{i,j}$ represents the hex code of the colour of a pixel. Below I show how to get a pattern figure and how to plot it on its own.

`# get pattern catsBlue <- clpattern(id = "3363032") # get the URL of the picture corresponding to the pattern imageURL <- catsBlue$imageUrl # get the picture picture <- content(GET(as.character(imageURL))) # convert it to a raster img <- as.raster(picture) # plot it! plot(img)`

This pattern can also be seen as a puzzle piece: one can copy the pattern several times and put each copy side by side and/or pile them up and get something that looks homogeneous with no problem at the border of each piece.

In

`ggplot2`

, I’ll add the patterns with`geom_point`

. The main issue will be to re-size puzzle pieces (how big should be a single puzzle piece for the graph to look nice?), to paste pieces together (now many puzzle pieces?), and to crop the puzzle (the puzzle should not be bigger than the bar it covers, for instance).# Writing code for decorating a bar plot

Here is the original bar plot I aim to make fancier.

`# Normal plot c <- ggplot(mtcars, aes(factor(cyl))) c <- c + geom_bar(width = .5, aes(fill = factor(cyl)))+ theme(text = element_text(size=20)) c`

Now, since my puzzle pieces are squares, I want the plot to have a x/y ratio such that the distance between two major grid lines on the x axis is the same as the distance as the distance between two major grid lines on the y axis.

`plotInfo <- ggplot_build(c) extentX <- diff(plotInfo$layout$panel_ranges[[1]]$x.major_source)[1] extentY <- diff(plotInfo$layout$panel_ranges[[1]]$y.major_source)[1] c <- c + coord_fixed(ratio = extentX/extentY) c`

I’ll reckon once again that since writing the code and publishing it I’ve slept many times and thus forgotten how I came up with the solution. Today I had to change the code for getting

`extentY`

and`extentX`

a bit since`ggplot_build()`

gives a slightly different output since`ggplot2`

newest version was released.I shall now get three patterns from colourlovers. I went on the website of colourlovers itself to find three patterns using the same template with different colours so I know their IDs.

`# get the patterns get_patterns <- function(id_blue, id_red, id_green){ blue <- clpattern(id = id_blue) red <- clpattern(id = id_red) green <- clpattern(id = id_green) list(red, green, blue) } patterns <- get_patterns(id_blue = 4348319, id_red = 4376078, id_green = 4360659)`

I shall first get one colour from each pattern and re-do the figure with these colours. I do this to have a legend later. I don’t want to try and reproduce parts of the patterns to get them in the legend.

`library("purrr") get_colors <- function(pattern){ pattern$colors[3]$hex } addhash <- function(color){ paste0("#", color) } colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) c2`

Now let’s add the patterns to the graph! It’s quite inelegant since I use a loop for adding things successively to the graph.

`add_patterns <- function(plot, patterns, colors){ for (i in 1:length(levels(factor(mtcars$cyl)))){ imageURL <- patterns[[i]]$imageUrl # get pattern picture <- content(GET(as.character(imageURL))) # picture is a 4 dimensional array img <- as.raster(picture) # we have to repeat the data.frame/pattern # I repeat it so that one extentY = one square xmax <- 1 ymax <- sum(mtcars$cyl == levels(factor(mtcars$cyl))[i]) size <- ceiling(ymax*2/extentY) img2 <- apply(img,MARGIN=2,function(x) rep(x,size)) # from matrix to data.frame img2 <- tbl_df(as.data.frame(as.matrix(img2))) # I change the column names so that they correspond to x coordinates names(img2) <- seq(i - 0.25, to = i + 0.25, length = ncol(img2)) # I transform the img2 so that I have one line = 1 x, 1 y and 1 colour dataPlot <- img2 %>% mutate(y = seq(from = size/2*extentY, to = 0, length = nrow(img2)))%>% gather(x, value, 1:ncol(img2)) %>% # filter part of the pattern that doesn't fit in the original bar filter(y <= ymax) %>% mutate(x = as.numeric(x)) plot <- plot + geom_point(data = dataPlot, aes(x, y), col = dataPlot$value) } plot } add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Babar bar plot")`

We’ve just made a Babar bar plot! Thanks Miles for this pun. Now let’s apply the same principles again and by principles I mean applying the

`colourlover`

code I’ve just presented, and using puns developped with Miles. A recipe for success!# Other punny plots

## Barber plot

`patterns <- get_patterns(id_blue = 3490878, id_red = 3277163, id_green = 4038034) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Barber bar plot")`

This is one plot decoration you could use to illustrate something about Occam’s razor.## Bark plot

`patterns <- get_patterns(id_blue = 1816764, id_red = 1825775, id_green = 1815897) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("Bark bar plot")`

I’m less convinced by this one, maybe it’d have looked better with barking dogs?

## BART plot

`patterns <- get_patterns(id_blue = 1902711, id_red = 1916712, id_green = 1930435) colors <- map(patterns, get_colors) %>% map(addhash) %>% unlist() c2 <- c + scale_fill_manual(values = colors) add_patterns(plot = c2, patterns = patterns, colors = colors) + ggtitle("BART bar plot")`

This one is the loudest bar plot you’ll ever make since it was inspired by San Francisco BART. I might have come up with this idea after not finding a Bart Simpson pattern.

I told Miles I’d stop once I’d only think of a barf plot which I don’t want to draw ever. As I had said the first time I made a bar plot with patterns, I’m not sure how useful it is, but you never know when you’re gonna need a fancy plot to get someone’s attention, right?

Read more »To**leave a comment**for the author, please follow the link and comment on their blog:**Maëlle**.

R-bloggers.com offers**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...