Optimization matchup: R’s glpkAPI vs Julia’s JuMP

This post was originally published on this site

tl;dr: although I use R every day and love it, doing mathematical programming using Julia is much simpler and more flexible than anything I know that is currently available in R.

Recently I have learned that Iain Dunning and Joey Huchette and Miles Lubin have received 2016 INFORMS Computing Society prize for the development of JuMP, a Julia-based domain-specific modeling language. Congratulations!

Together with Wit Jakuczun we have decided to test drive Julia’s JuMP against R’s package glpkAPI.

The problem we decided to solve is a standard MIP model for finding clusters in data using k-medoids method (we have used this specification of the model without relaxation). The work breakdown was that Wit writes a solution in R and I developed Julia code.

Data preparation

The data set used for the analysis can be generated using this Julia code:

function gendata(N, K, scaling)
    x = randn(N, 2)
    for i in 1:N
        α = 2π * i / K
        x[i, :] += scaling * [cos(α); sin(α)] / (2 * sin(π/K))
    end
    return x
end

srand(1)
d = gendata(50, 4, 5)
writecsv(“test.txt”, d)

or similar R code respectively:

gen_data <- function(N, K, scaling) {
    alpha <- 2 * pi * (1:N) / K
    sin_pi_K <- 2 * sin(pi / K)

    X <- as.data.frame(matrix(data=rnorm(n=2*N), nrow=N, ncol=2) +
                       scaling*matrix(data = c(cos(alpha), sin(alpha)),
                                      nrow = N, ncol = 2) / sin_pi_K)
    return(data.frame(x = X$V1, y = X$V2))
}

set.seed(1)
write.table(x = gen_data(200, 4, 5), file = “test.txt”,
            col.names = TRUE, row.names = FALSE,
            sep = “,”, dec = “.”)

(Julia codes below assume that data set was generated using Julia and R assumes data was generated in R: the difference is that file generated in R has a header)

Comparing the above codes (although they are not the main objective of the exercise) highlights two things about Julia: 1) you can use unicode literals in your code and 2) in Julia using explicit loop is OK (fast). Those two points are in this case of course only an issue of taste but I actually find that both of them make the code a bit more readable.

The end result is 200 points placed in 2D plane in four clusters.

Now we move to the actual challenge: write a code that executes k-medoids algorithm for number of clusters from 2 to 6 and compare the results.

Here is the solution in Julia:

using JuMP
using GLPKMathProgInterface
using Distances

function find_clusters(K, ds)
    N = size(ds, 1)
    rho = pairwise(Euclidean(), ds’)

    m = Model(solver=GLPKSolverMIP())
    @variable(m, y[1:N], Bin)
    @variable(m, x[1:N,1:N], Bin)

    @objective(m, Min, sum{x[i,j]*rho[i,j], i=1:N, j=1:N})

    @constraint(m, sum(y) == K)
    for i in 1:N
        @constraint(m, sum(x[i,:]) == 1)
    end
    for i in 1:N, j in 1:N
        @constraint(m, x[i,j] <= y[j])
    end

    solve(m)
    return getvalue(x), getvalue(y), getobjectivevalue(m)
end

function exec()
    ds = readcsv(“test.txt”)
    N = size(ds, 1)
    for K in 6:-1:2
        println(“— K=$K —“)
        @time x, y, obj = find_clusters(K, ds)
        println(“Objective: $obj”)
        println(“Centers:”)
        for i in 1:N
            y[i] > 0.0 && println(“t$i:t”, ds[i,:])
        end
    end
end

exec()

and here is the R code:

library(glpkAPI)

find_clusters <- function(K, ds) {
    N <- nrow(ds)
    
    rho <- cbind(expand.grid(i = 1:N, j = 1:N),
        dist = as.numeric(as.matrix(dist(ds, upper=TRUE, diag=TRUE))))
    write.table(x = rho, file = “dist.csv”,
                sep = “,”, dec = “.”,
                col.names = TRUE, row.names = FALSE)

    #dat file
    cat(“data;n”, file = “kmedoids.dat”, append = FALSE)
    cat(sprintf(“param pN := %s;n”, N),file=”kmedoids.dat”, append=T)
    cat(sprintf(“param pK := %s;n”, K),file=”kmedoids.dat”, append=T)
    cat(“end;”, file = “kmedoids.dat”, append = TRUE)

    wk <- mplAllocWkspGLPK()
    termOutGLPK(GLP_OFF)
    model <-  mplReadModelGLPK(wk, “kmedoids.mod”, TRUE)
    data <- mplReadDataGLPK(wk, “kmedoids.dat”)
    mplGenerateGLPK(wk)
    lp <- initProbGLPK()
    mplBuildProbGLPK(wk, lp)
    scaleProbGLPK(lp, GLP_SF_AUTO)
    setDefaultMIPParmGLPK()
    setSimplexParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    setMIPParmGLPK(parm = c(MSG_LEV), val = c(GLP_MSG_OFF))
    solveSimplexGLPK(lp)
    solveMIPGLPK(lp)    
    
    obj  <- mipObjValGLPK(lp)
    all_values <- mipColsValGLPK(lp)
    all_names <- rep(NA_character_, length(all_values))
    for (i in 1:length(all_values)) {
        all_names[i] <- getColNameGLPK(lp, i)
    }
    sol <- data.frame(var = all_names, val = all_values)
    
    return(list(
      obj = obj,
      y = sol[grep(pattern = “vSegmentCenter”, x = all_names), ],
      x = sol[grep(pattern = “vPointToSegment”, x = all_names), ]))
}


ds <- read.table(file = “test.txt”, sep = “,”,
                 dec = “.”, header = TRUE)
N <- nrow(ds)

for (K in seq(6,2)) {
  print(sprintf(“— K = %s —“, K))
  start_time <- Sys.time()
  sol <- find_clusters(K, ds)
  calc_duration <- Sys.time() – start_time
  print(sprintf(”    %s sec”, calc_duration))
  print(sprintf(“Objective: %s”, sol$obj))
  print(“Centers:”)
  print(ds[sol$y$val > 0, ])
}

which requires auxiliary kmedoids.mod file:

param pN;
param pK;
set pPoints := 1..pN;
set pSegments := 1..pK;

param pDist{i in pPoints, j in pPoints} >= 0;

table T_dist IN “CSV” “dist.csv”:
[i,j], pDist ~ dist;

var vPointToSegment{i in pPoints, j in pPoints}, binary;
var vSegmentCenter{i in pPoints}, binary;

minimize total_cost: sum{i in pPoints, j in pPoints} (vPointToSegment[i,j] * pDist[i,j]);

subject to

cPointToOnlyOneSegment{i in pPoints}:
sum{j in pPoints} vPointToSegment[i,j] = 1;

cSegmentsCnt:
sum{i in pPoints} vSegmentCenter[i] = pK;

cPoinOnlyToActiveSegment_1{i in pPoints, j in pPoints}:
vPointToSegment[i, j] <= vSegmentCenter[j];

cPoinOnlyToActiveSegment_2{j in pPoints}:
sum{i in pPoints} vPointToSegment[i, j] <= card(pPoints)*vSegmentCenter[j];

end;

I do not want to go through the code in detail – please judge it yourself. However, the two things are immediately apparent in my opinion:

  1. the ability to be able to specify the model within Julia beats hands down glpkAPI, where you have to use external file for specification of the model and external files to communicate data to the model;
  2. in Julia if we want to switch the solver from GLPK to any other (eg. CBC) the only line we would have to change is m = Model(solver=GLPKSolverMIP()). In R you have bindings to other solvers but they are much more low-level and most importantly you would have to rewrite most of the code.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…


If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook

Analytica: Informatica PowerCenter Systems Administrator

Analytica-Logo-2-01-2542ed36620f90984e99b4c02fdb0de0fd4d12af This post was originally published on this site

Seeking an Informatica PowerCenter Systems Administrator with proven expertise and experience in managing enterprise scale Informatica architectures and environments.

Company: AnalyticaAnalytica
Location: Washington, DC
Web: www.analytica.net
Position: Informatica PowerCenter Systems Administrator

_Contact_:
Apply online.

Job Description:

Analytica is seeking an Informatica PowerCenter Systems Administrator with proven expertise and experience in managing enterprise scale Informatica architectures and environments.  This position requires a minimum of one plus years of implementation, configuration, administration and support of the Informatica Big Data Edition software package.  Collaborates with application and infrastructure architects to define, enhance, and augment architectural roadmaps to support large and highly visible Informatica systems deployments. Effectively implements and supports complex and critical Informatica environments with excellent teamwork and team building skills.

Primary Responsibilities:

  • Seeking Informatica Administrator with full knowledge of installation, performance turning and operational aspects of the full tool suite.
  • Seeking Informatica PowerCenter/BDE Developer with BI/DW experience needed to develop and enhance ETLs from various sources (e.g., Hadoop, MySQL, DB2, etc..) to DB2 BLU Data Warehouse.
  • Duties will include creation of new Informatica job schedules, creation/modification of scripts, and the development of new mappings and workflows.
  • Duties will include analyzing transaction errors, troubleshooting issues in the software, development bug-fixes, and performance tuning all workflows to the SLA’s defined by Product.
  • Position requires strong experience working with JAVA transformations in Informatica. Also, knowledge of Hadoop/MapReduce and the Informatica Hadoop adapter would be beneficial.

Qualifications:

  • BS in Computer Science or related field and 12 years of experience in enterprise software development. MS in Computer Science or related field and 9 years of experience in enterprise software development or 16 years of relevant work experience will satisfy education and experience requirements
  • 8 years of experience in software architecture and design
  • 8 years of experience with design, development and delivery of ETL processes using Informatica
  • 8 years of experience with databases including DB2 database
  • Knowledge of database SQLs and Stored Procedures
  • Experience mentoring team members
  • Experience conducting gap analysis and provide recommendations to improve performance of ETL and DB2 components
  • Experience interfacing with Production Support teams to resolve production issues, prepare for outages, generating ad-hoc and scheduled reports
  • Experience with unit test methodologies and tools
  • Excellent oral and written communication skills
  • Must be either a US Citizen or Green Card holder to obtain a Position of Public Trust clearance

About ANALYTICA: The company is a rapidly growing SBA certified 8 (a) and HUBZone firm providing technology and consulting solutions that assist clients in managing, analyzing, and protecting information. Headquartered in Washington DC, ANALYTICA provides a dynamic, entrepreneurial environment for career development. For additional career opportunities please visit: http://careers.analytica.net

TDWI Analytics, Big Data, Data Science Training: Las Vegas, February

lasvegas_sign-ad8c55dc94a17dd703643715ab1793f6e2f2baf4 This post was originally published on this site

TDWI Las Vegas, Feb 12-17 is the leading event for Analytics, Big Data, Data Management and Data Science training, bringing together the brightest minds in data to share their expertise and insights. Choose from 5 core learning tracks, TDWI Leadership Summit, or Data Science Bootcamp.

Las VegasTDWI Las Vegas Conference: The Leading Event for Analytics, Big Data and Data Science Training
Feb 12—17, 2017 // Caesars Palace

Data Insight for the Enterprise

What happens in Vegas can change your career. TDWI Las Vegas, the leading conference for analytics and data management training, addresses our greatest data challenge head-on: harnessing the power of data and analytics to extract high-value insights that enable faster, smarter business decisions. Analytics requires a team with skills across a spectrum of disciplines, from architecture, data management, and data preparation to data analysis, visualization, data storytelling, and more.

Choose from five core learning tracks on all things analytics:
Data & Analytics Foundations, Data Science A to Z, Leading with Data, Managing Data for Analytical Insight, Analyze & Discover

LEARN MORE


Co-located TDWI Leadership Summit:
Putting Big Data And Data Science To Work In Your Organization

Feb 13—14, 2017 // Caesars Palace

An interactive summit for business, IT, and BI leaders who are bringing data science and advanced analytics into their organizations.

As companies seek to drive growth from data, they must expand the types and volume of data they analyze. TDWI research indicates increasing interest in analyzing disparate data types: streaming, geospatial, machine-generated, and unstructured text, to name a few. To meet this need, organizations are modernizing their data infrastructures and platforms. They are adopting advanced analytics. They are looking at the skills they will need to be successful. They are organizing to execute.
LEARN MORE


TDWI Las Vegas Data Science Bootcamp
Feb 13—14, 2017 // Caesars Palace

A TDWI Certificate Track

TDWI’s Data Science Bootcamp will lay the groundwork for your journey to becoming data scientist. This two-day intensive learning experience covers the most important aspects of a data scientist’s role. You’ll learn essential data science skills directly from today’s data science thought leaders. From sourcing and preparing data, to analytic modeling with data, interpreting results, and delivering insights to the business, the Bootcamp provides end-to-end coverage of what it takes to succeed as a data scientist.
LEARN MORE


TDWI Accelerate Boston: Accelerate Your Path to Data Science Success
April 3—5, 2017 // The Westin Copley Place

Get ahead of the game, stay competitive, and drive innovation with Data Science.

This 3-day conference brings together the brightest minds in data to share their expertise and insights on the future of data science and analytics.

From sessions on core data science skills to talks on the latest trends in machine learning and artificial intelligence, attendees will hear from industry experts, gain valuable skills, and network and share ideas with their data peers in an exciting and collaborative environment.
LEARN MORE

In case you missed it: November 2016 roundup

This post was originally published on this site

In case you missed them, here are some articles from November of particular interest to R users. 

Microsoft R Open 3.3.2, based on R 3.3.2, has been released for Windows, Mac and Linux.

A new, free course on EdX focuses on the big-data extensions of Microsoft R Server.

Using ggplot2 to create a calendar heat map of city bike usage in Chicago.

A tutorial on creating an in-database rental prediction service for a ski shop with SQL Server R Services.

Detailed R code to calculate AUC, the area under a ROC curve.

Rankings of the 5 most popular R packages by downloads (with and without dependencies).

Using computer vision algorithms from R to find images in a photomosaic.

Airbnb has released the knowledge repository they use to share data science reports generated using R.

Using the sparklyr package to manipulate data on Azure HDInsight.

The dplyrXdf package has been updated to provide subsetting and column subsetting for out-of-memory XDF files.

RStudio v1.0 has been released.

Charting historical baseball statistics with the Lahman package.

This Bayesian US election prediction model, implemented with R and Stan, turned out to be not so successful.

Making elastic net regression with glmnet a little easier: the glmnetUtils package.

A list of notable new and updated R packages.

A tutorial on calling the Cognitive Services text analytics APIs from R.

Some lessons and examples on migrating financial data applications from SAS to R.

General interest stories (not related to R) in the past month included: giving thanks, non-transitive dice, the most unsatisfying film, the US election surprise, and the Cubs win.

As always, thanks for the comments and please send any suggestions to me at [email protected]. Don’t forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I’m @revodavid). You can find roundups of previous months here.

Copyright Use-R.com 2012 - 2016 ©