future: Reproducible RNGs, future_lapply() and more

This post was originally published on this site

future 1.3.0 is available on CRAN. With futures, it is easy to write R code once, which the user can choose to evaluate in parallel using whatever resources s/he has available, e.g. a local machine, a set of local machines, a set of remote machines, a high-end compute cluster (via future.BatchJobs and soon also future.batchtools), or in the cloud (e.g. via googleComputeEngineR).

Futures makes it easy to harness any resources at hand.

Thanks to great feedback from the community, this new version provides:

  • A convenient lapply() function
    • Added future_lapply() that works like lapply() and gives identical results with the difference that futures are used internally. Depending on user’s choice of plan(), these calculations may be processed sequential, in parallel, or distributed on multiple machines.
    • Perfect reproducible random number generation (RNG) is guaranteed given the same initial seed, regardless of the type of futures used and choice of load balancing. Argument future.seed = TRUE(default) will use a random initial seed, which may also be specified as future.seed = <integer>. L’Ecuyer-CMRG RNG streams are used internally.
    • Load balancing can be controlled by argument future.scheduling, which is a scalar adjusting how many futures each worker should process.
  • Clarifies distinction between developer and end user
    • The end user controls what future strategy to use by default, e.g. plan(multiprocess) or plan(cluster, workers = c("machine1", "machine2", "remote.server.org")).
    • The developer controls whether futures should be resolved eagerly (default) or lazily, e.g. f <- future(..., lazy = TRUE). Because of this, plan(lazy) is now deprecated.
  • Is even more friendly to multi-tenant compute environments
    • availableCores() returns the number of cores available to the current R process. On a regular machine, this typically corresponds to the number of cores on the machine (parallel::detectCores()). If option mc.cores or environment variable MC_CORES is set, then that will be returned. However, on compute clusters using schedulers such as SGE, Slurm, and TORQUE / PBS, the function detects the number of cores allotted to the job by the scheduler and returns that instead. This way developers don’t have to adjust their code to match a certain compute environment; the default works everywhere.
    • With the new version, it is possible to override the fallback value used when nothing else is specified to not be the number of cores on the machine but to option future.availableCores.fallback or environment variable R_FUTURE_AVAILABLE_FALLBACK. For instance, by using R_FUTURE_AVAILABLE_FALLBACK=1system-wide in HPC environments, any user running outside of the scheduler will automatically use single-core processing unless explicitly requesting more cores. This lowers the risk of overloading the CPU by mistake.
    • Analogously to how availableCores() returns the number of cores, the new function availableWorkers() returns the host names available to the R process. The default is rep("localhost", times = availableCores()), but when using HPC schedulers it may be the host names of other compute notes allocated to the job.

For full details on updates, please see the NEWS file. The future package installs out-of-the-box on all operating systems.

A quick example

The bootstrap example of help("clusterApply", package = "parallel") adapted to make use of futures.

library("future")
library("boot")

run <- function(...) {
cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
boot(cd4, corr, R = 5000, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle)
}

# base::lapply()
system.time(boot <- lapply(1:100, FUN = run))
### user system elapsed
### 133.637 0.000 133.744

# Sequentially on the local machine
plan(sequential)
system.time(boot0 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
### user system elapsed
### 134.916 0.003 135.039

# In parallel on the local machine (with 8 cores)
plan(multisession)
system.time(boot1 <- future_lapply(1:100, FUN = run, future.seed = 0xBEEF))
### user system elapsed
### 0.960 0.041 29.527
stopifnot(all.equal(boot1, boot0))

What’s next?

The future.BatchJobs package, which builds on top of BatchJobs, provides future strategies for various HPC schedulers, e.g. SGE, Slurm, and TORQUE / PBS. For example, by using plan(batchjobs_torque) instead of plan(multiprocess) your futures will be resolved distributed on a compute cluster instead of parallel on your local machine. That’s it! However, since last year, the BatchJobs package has been decommissioned and the authors recommend everyone to use their new batchtools package instead. Just like BatchJobs, it is a very well written package, but at the same time it is more robust against cluster problems and it also supports more types of HPC schedulers. Because of this, I’ve been working on future.batchtools which I hope to be able to release soon.

Finally, I’m really keen on looking into how futures can be used with Shaun Jackman’s lambdar, which is a proof-of-concept that allows you to execute R code on Amazon’s “serverless” AWS Lambda framework. My hope is that, in a not too far future (pun not intended*), we’ll be able to resolve our futures on AWS Lambda using plan(aws_lambda).

Happy futuring!

(*) Alright, I admit, it was intended.

Links

See also

SatRday and visual inference of vine copulas

This post was originally published on this site
https://www.r-bloggers.com/

(This article was first published on Pareto’s Playground, and kindly contributed to R-bloggers)

SatRday

From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections:

R and Git

Easy integration of version control through git and Rstudio has never been this easy. If you are not already following this principle of Pull, Commit and Push in your workflow, I recommend you vist the website http://happygitwithr.com/ which will help you to correct your current workflow indiscretions!

If you are however an avid user of github and regularly push your commits with correct messages, here is a gem of a website that made my day – starlogs

Shiny

The Shiny workshop was an amazing exhibition on what can be done in an afternoon with the power of shiny and Rstudio! We got to build an interactive dashboard investigating disease data from South Africa using the new flexdashboard.

Training Models

Once a again I realised how powerful the caret package can be in training and testing a wide range of models in a modular way without too much hassle or fuss. I am really keen to start playing with the h20 package after a brief overview was shown on the ensemble capabilities of the package.

The Conference

The saturday saw a multitude of speakers take the stage to discuss some of the most interesting topics and application of R that I have seen in a very long time. The full programme can be viewed on the website. I also had the opportunity to talk about how we can combine Shiny and the visNetworks html widget as a way to conduct analysis of high-dimensional Vine Copulas.

R, as an open source software with a large following provides quick and easy access to complex models and methods that are used and applied widely within academics, but more importantly, in practice. This increase in complexity both theoretically and mathematically, has resulted in an obligation from practitioners to break down walls of jargon and mathematical hyperbole into easy digestable actions and insight.

My attempt at addressing this philosophical idea was to create a small package called VineVizR. The package can be viewed and downloaded here. I used this package and its function to build a shiny App where one can explore the interlinkages between 23 of South Africa’s largest companies by market cap using Vine Copulas.

Vine copulas allow us to model complex linkages among assets, but currently the visual illustration of Vine copulae from the VineCopula package offers a bland plotting output that is hard to interpret. Combining the visNetwork html-widget along with the VineCopula RVM output, offers an interactive and visually appealing way to understand to interpret your output.

The app can be found on my ShinyApp page. I hope you enjoy this small app and the meta data that has been integrated into it. If you feel that you still want to play a bit more, go download the package and play with the VizVineR function where you can play and create your own groups!! But be aware, I don’t see myself as a developer – so when you peek under the hood and see any improvements that can be made – let me know.

View of the dashboard

To leave a comment for the author, please follow the link and comment on their blog: Pareto’s Playground.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Building Shiny App exercises part 7

This post was originally published on this site
https://www.r-bloggers.com/

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Connect widgets & plots

In the seventh part of our journey we are ready to connect more of the widgets we created before with our k-means plot in order to totally control its output. Of cousre we will also reform the plot itself properly in order to make it a real k-means plot.
Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

First of all let’s move the widgets we are going to use from the sidebarPanel into the mainPanel and specifically under our plot.

Exercise 1

Remove the textInput from your server.R file. Then place the checkboxGroupInput and the selectInput in the same row with the sliderInput. Name them “Variable X” and “Variable Y” respectively. HINT: Use fluidrow and column.

Create a reactive expression

Reactive expressions are expressions that can read reactive values and call other reactive expressions. Whenever a reactive value changes, any reactive expressions that depended on it are marked as “invalidated” and will automatically re-execute if necessary. If a reactive expression is marked as invalidated, any other reactive expressions that recently called it are also marked as invalidated. In this way, invalidations ripple through the expressions that depend on each other.
The reactive expression is activated like this: example <- reactive({ })

Exercise 2

Place a reactive expression in server.R, at any spot except inside output$All and name it “Data”. HINT: Use reactive

Connect your dataset’s variables with your widgets.

Now let’s connect your selectInput with the variables of your dataset as in the example below.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
selectInput(‘ycol’, ‘Y Variable’, names(iris)) ),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$ycol)]
})
})

Exercise 3

Put the variables of the iris dataset as inputs in your selectInput as “Variable Y” . HINT: Use names.

Exercise 4

Do the same for checkboxGroupInput and “Variable X”. HINT: Use names.

Select the fourth variabale as default like the example below.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
checkboxGroupInput(“xcol”, “Variable X”,names(iris),
selected=names(iris)[[4]]),
selectInput(“ycol”, “Y Variable”, names(iris),
selected=names(iris)[[4]])
),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$xcol,input$ycol)
]
})
})

Exercise 5

Make the second variable the default choise for both widgets. HINT: Use selected.

Now follow the example below to create a new function and place there the automated function for k means calculation.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
checkboxGroupInput(“xcol”, “Variable X”,names(iris),
selected=names(iris)[[4]]),
selectInput(“ycol”, “Y Variable”, names(iris),
selected=names(iris)[[4]])
),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$xcol,input$ycol)
]
})
example2 <- reactive({
kmeans(example())
})
})

Exercise 6

Create the reactive function Clusters and put in there the function kmeans which will be applied on the function Data. HINT: Use reactive.

Connect your plot with the widgets.

It is time to connect your plot with the widgets.

Exercise 7

Put Data inside renderPlot as first argument replacing the data that you have chosen to be plotted until now. Moreover delete xlab and ylab.

Improve your k-means visualiztion.

You gan change automatically the colours of your clusters by copying and pasting this part of code as first argument of renderPlot before the plot function:

palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))

We will choose to have up to nine clusters so we choose nine colours.

Exercise 8

Set min of your sliderInput to 1, max to 9 and value to 4 and use the palette function to give colours.

This is how you can give different colors to your clusters. To activate these colors put this part of code into your plot function.

col = Clusters()$cluster,

Exercise 9

Activate the palette function.

To make your clusters easily foundable you can fully color them by adding into plot function this:
pch = 20, cex = 3

Exercise 10

Fully color the points of your plot.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

This post was originally published on this site
https://www.r-bloggers.com/

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses, including:

  1. Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

  2. Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).

  3. Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.

  4. Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).

  5. Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.

  6. Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.

However, the result is presented differently according to the used packages. To help in the interpretation and in the visualization of multivariate analysis – such as cluster analysis and dimensionality reduction analysis – we developed an easy-to-use R package named factoextra.

  • The R package factoextra has flexible and easy-to-use methods to extract quickly, in a human readable standard data format, the analysis results from the different packages mentioned above.

  • It produces a ggplot2-based elegant data visualization with less typing.

  • It contains also many functions facilitating clustering analysis and visualization.

We’ll use i) the FactoMineR package (Sebastien Le, et al., 2008) to compute PCA, (M)CA, FAMD, MFA and HCPC; ii) and the factoextra package for extracting and visualizing the results.

FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. The official website is available at: http://factominer.free.fr/. Thanks to François Husson for his impressive work.

The figure below shows methods, which outputs can be visualized using the factoextra package. The official online documentation is available at: http://www.sthda.com/english/rpkgs/factoextra.

multivariate analysis, factoextra, cluster, r, pca

Why using factoextra?

  1. The factoextra R package can handle the results of PCA, CA, MCA, MFA, FAMD and HMFA from several packages, for extracting and visualizing the most important information contained in your data.

  2. After PCA, CA, MCA, MFA, FAMD and HMFA, the most important row/column elements can be highlighted using :
  • their cos2 values corresponding to their quality of representation on the factor map
  • their contributions to the definition of the principal dimensions.

If you want to do this, the factoextra package provides a convenient solution.

  1. PCA and (M)CA are used sometimes for prediction problems : one can predict the coordinates of new supplementary variables (quantitative and qualitative) and supplementary individuals using the information provided by the previously performed PCA or (M)CA. This can be done easily using the FactoMineR package.

If you want to make predictions with PCA/MCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2: then factoextra can help you. It’s quick, write less and do more…

  1. Several functions from different packages – FactoMineR, ade4, ExPosition, stats – are available in R for performing PCA, CA or MCA. However, The components of the output vary from package to package.

No matter the package you decided to use, factoextra can give you a human understandable output.

Installing FactoMineR

The FactoMineR package can be installed and loaded as follow:

# Install
install.packages("FactoMineR")

# Load
library("FactoMineR")

Installing and loading factoextra

  • factoextra can be installed from CRAN as follow:
install.packages("factoextra")
  • Or, install the latest version from Github
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
  • Load factoextra as follow :
library("factoextra")

Main functions in the factoextra package

See the online documentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.

Visualizing dimension reduction analysis outputs

FunctionsDescription
fviz_eig (or fviz_eigenvalue)Extract and visualize the eigenvalues/variances of dimensions.
fviz_pcaGraph of individuals/variables from the output of Principal Component Analysis (PCA).
fviz_caGraph of column/row variables from the output of Correspondence Analysis (CA).
fviz_mcaGraph of individuals/variables from the output of Multiple Correspondence Analysis (MCA).
fviz_mfaGraph of individuals/variables from the output of Multiple Factor Analysis (MFA).
fviz_famdGraph of individuals/variables from the output of Factor Analysis of Mixed Data (FAMD).
fviz_hmfaGraph of individuals/variables from the output of Hierarchical Multiple Factor Analysis (HMFA).
fviz_ellipsesDraw confidence ellipses around the categories.
fviz_cos2Visualize the quality of representation of the row/column variable from the results of PCA, CA, MCA functions.
fviz_contribVisualize the contributions of row/column elements from the results of PCA, CA, MCA functions.

Extracting data from dimension reduction analysis outputs

FunctionsDescription
get_eigenvalueExtract and visualize the eigenvalues/variances of dimensions.
get_pcaExtract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs.
get_caExtract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs.
get_mcaExtract results from Multiple Correspondence Analysis outputs.
get_mfaExtract results from Multiple Factor Analysis outputs.
get_famdExtract results from Factor Analysis of Mixed Data outputs.
get_hmfaExtract results from Hierarchical Multiple Factor Analysis outputs.
facto_summarizeSubset and summarize the output of factor analyses.

Clustering analysis and visualization

FunctionsDescription
dist(fviz_dist, get_dist)Enhanced Distance Matrix Computation and Visualization.
get_clust_tendencyAssessing Clustering Tendency.
fviz_nbclust(fviz_gap_stat)Determining and Visualizing the Optimal Number of Clusters.
fviz_dendEnhanced Visualization of Dendrogram
fviz_clusterVisualize Clustering Results
fviz_mclustVisualize Model-based Clustering Results
fviz_silhouetteVisualize Silhouette Information from Clustering.
hcutComputes Hierarchical Clustering and Cut the Tree
hkmeans (hkmeans_tree, print.hkmeans)Hierarchical k-means clustering.
eclustVisual enhancement of clustering analysis

Dimension reduction and factoextra

As depicted in the figure below, the type of analysis to be performed depends on the data set formats and structures.

dimension reduction and factoextra

In this section we start by illustrating classical methods – such as PCA, CA and MCA – for analyzing a data set containing continuous variables, contingency table and qualitative variables, respectively.

We continue by discussing advanced methods – such as FAMD, MFA and HMFA – for analyzing a data set containing a mix of variables (qualitatives & quantitatives) organized or not into groups.

Finally, we show how to perform hierarchical clustering on principal components (HCPC), which useful for performing clustering with a data set containing only qualitative variables or with a mixed data of qualitative and quantitative variables.

Principal component analysis

  • Data: decathlon2 [in factoextra package]
  • PCA function: FactoMineR::PCA()
  • Visualization factoextra::fviz_pca()

Read more about computing and interpreting principal component analysis at: Principal Component Analysis (PCA).

  1. Loading data
library("factoextra")
data("decathlon2")
df 
  1. Principal component analysis
library("FactoMineR")
res.pca 
  1. Extract and visualize eigenvalues/variances:
# Extract eigenvalues/variances
get_eig(res.pca)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.1242133        41.242133                    41.24213
## Dim.2   1.8385309        18.385309                    59.62744
## Dim.3   1.2391403        12.391403                    72.01885
## Dim.4   0.8194402         8.194402                    80.21325
## Dim.5   0.7015528         7.015528                    87.22878
## Dim.6   0.4228828         4.228828                    91.45760
## Dim.7   0.3025817         3.025817                    94.48342
## Dim.8   0.2744700         2.744700                    97.22812
## Dim.9   0.1552169         1.552169                    98.78029
## Dim.10  0.1219710         1.219710                   100.00000
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))
factoextra

factoextra

4.Extract and visualize results for variables:

# Extract the results for variables
var 
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description
## 1 "$coord"   "Coordinates for the variables"
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
# Coordinates of variables
head(var$coord)
##                   Dim.1       Dim.2      Dim.3       Dim.4      Dim.5
## X100m        -0.8506257 -0.17939806  0.3015564  0.03357320 -0.1944440
## Long.jump     0.7941806  0.28085695 -0.1905465 -0.11538956  0.2331567
## Shot.put      0.7339127  0.08540412  0.5175978  0.12846837 -0.2488129
## High.jump     0.6100840 -0.46521415  0.3300852  0.14455012  0.4027002
## X400m        -0.7016034  0.29017826  0.2835329  0.43082552  0.1039085
## X110m.hurdle -0.7641252 -0.02474081  0.4488873 -0.01689589  0.2242200
# Contribution of variables
head(var$contrib)
##                  Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
## X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
## Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
## Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
## High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
## X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
## X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193
# Graph of variables: default plot
fviz_pca_var(res.pca, col.var = "black")
factoextra

factoextra

It’s possible to control variable colors using their contributions (“contrib”) to the principal axes:

# Control variable colors using their contributions
fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )
factoextra

factoextra

  1. Variable contributions to the principal axes:
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
factoextrafactoextra

factoextra

  1. Extract and visualize results for individuals:
# Extract the results for individuals
ind 
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description
## 1 "$coord"   "Coordinates for the individuals"
## 2 "$cos2"    "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
# Coordinates of individuals
head(ind$coord)
##                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
## SEBRLE     0.1955047  1.5890567  0.6424912  0.08389652  1.16829387
## CLAY       0.8078795  2.4748137 -1.3873827  1.29838232 -0.82498206
## BERNARD   -1.3591340  1.6480950  0.2005584 -1.96409420  0.08419345
## YURKOV    -0.8889532 -0.4426067  2.5295843  0.71290837  0.40782264
## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
## McMULLEN   0.1212195 -1.0139102 -0.8625170  1.34164291  1.62151286
# Graph of individuals
# 1. Use repel = TRUE to avoid overplotting
# 2. Control automatically the color of individuals using the cos2
    # cos2 = the quality of the individuals on the factor map
    # Use points only
# 3. Use gradient color
fviz_pca_ind(res.pca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )
factoextra

factoextra

# Biplot of individuals and variables
fviz_pca_biplot(res.pca, repel = TRUE)
factoextra

factoextra

  1. Color individuals by groups:
# Compute PCA on the iris data set
# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca 
factoextra

factoextra

Correspondence analysis

  • Data: housetasks [in factoextra]
  • CA function FactoMineR::CA()
  • Visualize with factoextra::fviz_ca()

Read more about computing and interpreting correspondence analysis at: Correspondence Analysis (CA).

  • Compute CA:
 # Loading data
data("housetasks")

 # Computing CA
library("FactoMineR")
res.ca 
  • Extract results for row/column variables:
# Result for row variables
get_ca_row(res.ca)

# Result for column variables
get_ca_col(res.ca)
  • Biplot of rows and columns
fviz_ca_biplot(res.ca, repel = TRUE)
factoextra

factoextra

To visualize only row points or column points, type this:

# Graph of row points
fviz_ca_row(res.ca, repel = TRUE)

# Graph of column points
fviz_ca_col(res.ca)

# Visualize row contributions on axes 1
fviz_contrib(res.ca, choice ="row", axes = 1)

# Visualize column contributions on axes 1
fviz_contrib(res.ca, choice ="col", axes = 1)

Multiple correspondence analysis

  • Data: poison [in factoextra]
  • MCA function FactoMineR::MCA()
  • Visualization factoextra::fviz_mca()

Read more about computing and interpreting multiple correspondence analysis at: Multiple Correspondence Analysis (MCA).

  1. Computing MCA:
library(FactoMineR)
data(poison)
res.mca 
  1. Extract results for variables and individuals:
# Extract the results for variable categories
get_mca_var(res.mca)

# Extract the results for individuals
get_mca_ind(res.mca)
  1. Contribution of variables and individuals to the principal axes:
# Visualize variable categorie contributions on axes 1
fviz_contrib(res.mca, choice ="var", axes = 1)

# Visualize individual contributions on axes 1
# select the top 20
fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)
  1. Graph of individuals
# Color by groups
# Add concentration ellipses
# Use repel = TRUE to avoid overplotting
grp 
factoextra

factoextra

  1. Graph of variable categories:
fviz_mca_var(res.mca, repel = TRUE)
factoextra

factoextra

  1. Biplot of individuals and variables:
fviz_mca_biplot(res.mca, repel = TRUE)
factoextra

factoextra

Advanced methods

The factoextra R package has also functions that support the visualization of advanced methods such:

Cluster analysis and factoextra

To learn more about cluster analysis, you can refer to the book available at: Practical Guide to Cluster Analysis in R

clustering book cover

The main parts of the book include:

  • distance measures,
  • partitioning clustering,
  • hierarchical clustering,
  • cluster validation methods, as well as,
  • advanced clustering methods such as fuzzy clustering, density-based clustering and model-based clustering.

The book presents the basic principles of these tasks and provide many examples in R. It offers solid guidance in data mining for students and researchers.

Partitioning clustering

Partitioning cluster analysis

# 1. Loading and preparing data
data("USArrests")
df 
factoextra

factoextra

Hierarchical clustering

Hierarchical clustering

library("factoextra")
# Compute hierarchical clustering and cut into 4 clusters
res 
factoextra

factoextra

Determine the optimal number of clusters

# Optimal number of clusters for k-means
library("factoextra")
my_data 
factoextra

factoextra

Acknoweledgment

I would like to thank Fabian Mundt for his active contributions to factoextra.

We sincerely thank all developers for their efforts behind the packages that factoextra depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), FactoMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more …..

References

  • H. Wickham (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2016). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.5.
  • Sebastien Le, Julie Josse, Francois Husson (2008). FactoMineR: An R Package for Multivariate Analysis. Journal of Statistical Software, 25(1), 1-18. 10.18637/jss.v025.i01
  • Tal Galili (2015). dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics. DOI: 10.1093/bioinformatics/btv428

Infos

This analysis has been performed using R software (ver. 3.3.2) and factoextra (ver. 1.0.4.999)


To leave a comment for the author, please follow the link and comment on their blog: Easy Guides.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Copyright Use-R.com 2012 - 2016 ©