RSS Feed from R-bloggers.com
- weakly informative reparameterisations for location-scale mixtures
We have been working towards a revision of our reparameterisation paper for quite a while now and too advantage of Kate Lee visiting Paris this fortnight to make a final round: we have now arXived (and submitted) the new version. The major change against the earlier version is the extension of the approach to a large class of models that include infinitely divisible distributions, compound Gaussian, Poisson, and exponential distributions, and completely monotonic densities. The concept remains identical: change the parameterisation of a mixture from a component-wise decomposition to a construct made of the first moment(s) of the distribution and of component-wise objects constrained by the moment equation(s). There is of course a bijection between both parameterisations, but the constraints appearing in the latter produce compact parameter spaces for which (different) uniform priors can be proposed. While the resulting posteriors are no longer conjugate, even conditional on the latent variables, standard Metropolis algorithms can be implemented to produce Monte Carlo approximations of these posteriors.
Filed under: Books, pictures, R, Statistics, University life Tagged: compound Gaussian distribution, compound Poisson distribution, MCMC, Metropolis-Hastings algorithm, mixtures of distributions, Monte Carlo Statistical Methods, reparameterisationRead more »
- Diversity in the R CommunityRead more »
In the follow-up to the useR! conference in Stanford last year, the Women in R Task force took the opportunity to survey the 900-or-so participants about their backgrounds, experiences and interests. With 455 responses, the recently-published results provide an interesting snapshot about the R community (or at least that subset able to travel to the US and who were able to register before the conference sold out). Among the findings (there are summaries; check the report for the detailed breakdowns):
- 33% of attendees identified as women
- 26% of attendees identified as other than White or Caucasian
- 5% of attendees identified as LGBTQ
The report also includes some interesting demographic analysis of the attendees, including the map of home country distribution shown below. The report also offers recommendations for future conferences, one of which has already been implemented: the useR!2017 conference in Brussels will offer child-care for the first time.
Relatedly, the Women in R Task Force has since expanded its remit to promoting other under-represented groups in the R community as well. To reflect the new focus, the task force is now called Forwards, and invites members of the R community to participate. If you have an interest in supporting diversity in race, gender, sexuality, class or disability, follow that link to get in touch.
Forwards: Mapping useRs
- The Genetic Map Comparator: a user-friendly application to display and compare genetic maps
The biological perspective
A genetic map provides the position of genetic markers along chromosomes. Geneticists often have to visualize these maps and calculate their basic statistics (length, # of markers, gap size..). Multiple maps that share some markers have to be dealt with when several segregating populations are studied. These maps are compared to highlight their overall relative strengths and weaknesses (e.g. via marker distributions or map lengths), or local marker inconsistencies.
The Genetic Map Comparator is an effective user-friendly tool to complete these tasks. It is possible to upload your genetic maps and explore them using the various sheets accessible via links at the top of the window. The app provides example datasets, which makes it easy to understand its capacity in less than 2 minutes, so why not have a look?
The Dataviz perspective
This app highlights a few features of the power of shiny as a dataviz explorative tool:
- The insertion of interactive charts (mainly using plotly) is straightforward
- CSS can be applied, which gives a nice look to the app
- It is easy to share the tool: in our case we installed it on our private shiny server, but the app is also usable from the github repository!
- It is possible to load your files into the app, and export results really easily
If you want to see how to use these features, the code is freely available on github.
Here is a screenshot of the main tabs of the app:
- Two new online R courses on time series (via DataCamp)
DataCamp recently launched two new online R courses on time series analysis.
What You’ll Learn:
- Chapter One: Exploratory Time Series Data Analysis (FREE)
Learn how to organize and visualize time series data in R.
- Chapter Two: Predicting the Future
Conduct trend spotting, learn the white noise model, the random walk model, and the definition of stationary processes.
- Chapter Three: Correlation Analysis and the Autocorrelation Function
Review the correlation coefficient, then practice estimating and visualizing autocorrelations for time series data.
- Chapter Four: Autoregression
Discover the autoregressive model and several of its basic properties.
- Chapter Five: A Simple Moving Average
Learn about the simple moving average model, then compare the performance of several models.
What You’ll Learn:
- Chapter One: Time Series Data and Models
Investigate time series data and learn the basics of ARMA models, which can explain the behavior of such data.
- Chapter Two: Fitting ARMA Models
Discover the wonderful world of ARMA models and learn how to fit these models to time series data.
- Chapter Three: ARIMA Models
Learn about integrated ARMA (ARIMA) models for nonstationary time series.
- Chapter Four: Seasonal ARIMA
Learn how to fit and forecast seasonal time series data using seasonal ARIMA models.
- Chapter One: Exploratory Time Series Data Analysis (FREE)
- Reproducible Finance with R: Sector Correlations
by Jonathan Regenstein
Welcome to the first installation of reproducible finance for 2017. It’s a new year, a new President takes office soon, and we could be entering a new political-economic environment. What better time to think about a popular topic over the last few years: equity correlations. Elevated correlations are important for several reasons – life is hard for active managers and diversification gains are vanishing – but I personally enjoy thinking about them more from an inference or data exploration perspective. Are changing correlations telling us something about the world? Are sectors diverging? How much can be attributed to the Central Bank regime at hand? So many questions, so many hypotheses to be explored. Let’s get started.
Today, we will build a Notebook and start exploring the historical rolling correlations between sector ETFs and the S&P 500. That is, we want to explore how equity returns in different sectors have been correlated with the returns of the broader index. Perhaps they are all moving in lockstep, perhaps they have been diverging. Either way, this Notebook will be the first step toward an flexdashboard that lets us do more interactive exploration – choosing different sector ETFs and rolling windows.
We are going to accomplish a few things today. We will load up the sector ETF tickers, then build a function to download their price history and calculate weekly returns. We will save this to one xts object. Next, we will build a function to calculate the rolling correlations between a chosen sector ETF and the S&P 500. Finally, dygraphs will make its usual appearance to help visualize the rolling correlation time series.
As usual, we will be living in the R Markdown world and, by way of disclaimer, the data import and return calculation functions here should be familiar from previous posts. That is by design, and hopefully it won’t be too boring for devotees of this series (I know you’re out there somewhere!). More importantly, I hope the usefulness of reproducible, reusable code is emerging. Some of the code chunks in previous posts might have seemed trivially simple, containing just a simple function and little else. But, the simplicity of those code chunks made it very easy to return to those previous scripts, understand the functions, and use them in this post.
Let’s load up a few packages.
Now, we need the tickers and sectors for the sector ETFs. They are copied below and available here. I deleted the XLRE real estate ETF because it’s only been around since 2015, and I want look back several years in this Notebook.
We’ve got our dataframe of tickers and sectors. Let’s build a function to download price history and then convert those price histories to weekly returns. We’ll use a combination of getSymbols() and periodReturn() to accomplish that. If you want to change this script to use daily returns, change the argument below to period = ‘daily’, but be prepared to import quite a bit more data.
A pattern seems to be emerging in these Notebooks: grab tickers, get price history, convert to returns and save new xts object. In an ideal world, that pattern of data import and conversion would be so familiar as to be commonplace.
That said, enough with the commonplace stuff – let’s get on to something a little more dangerous: rolling correlations amongst etf returns. Correlations are important because high correlations make it hard to find diversification opportunities and they make it hard to deliver alpha – though I suppose it’s always hard to deliver alpha. Fortunately, we don’t have to worry about generating alpha today so let’s get to our function.
Calculating rolling correlations in R is pretty straightforward. We use the rollapply() function, along with the cor() function, pass in our data and a time window, and it’s off to the races. We’ll create our own function below to handle these jobs and return an xts object.
Notice that this function does something that seems unnecessary: it creates a new xts object that holds the sector returns, SPY returns and the rolling correlation. We don’t have much use for that separate object, and could probably have just added columns to our original xts object. Indeed, if this were our final product we might spend more time eliminating its present. I choose not to do that here for two reasons. First, this Notebook is built to underlie a flexdashboard that could go into production. I want to get the logic right here, then focus more on efficiency in the final app.
Second, and relatedly, we are prioritizing clarity of workflow in this Notebook. It should be crystal clear how we are moving from an xts object of ETF returns to creating a new XTS object of two returns plus one correlation. The goal is for any collaborators, including my future self, to open this Notebook and see the workflow. If that collaborator finds this step to be unnecessary and has a more clever solution – that’s fantastic because it means this document is intelligible enough to serve as the basis for more sophisticated work.
Let’s go ahead and use this function. We will pass in a time series of Information Technology ETF returns and a window of size 20 for the rolling correlation.
Alright, the function seems to have succeeded in building that new xts object and storing the rolling correlation. Now we will use dygraphs to visualize this rolling correlation over time and see if anything jumps out as interesting or puzzling.
The correlation between the Tech ETF and the S&P 500 ETF seems quite high. It dipped a bit in the middle of 2009 and again towards the end of 2013. It would be interesting to see if this was true of the other sector ETFs as well. In other words, were these periods of generally declining correlations, or was it limited to the technology/S&P 500 relationship?
The best way to do some exploratory analysis on that is, no surprise, build a shiny app that allows users to choose their own sectors and rolling windows. We’ll do that next time – see you in a few days!Read more »
- GitHub Growth Appears Scale Free
In 2013, a blogger claimed that the growth of GitHub (GH) users follows a certain type of diffusion model called Bass diffusion. Growth here refers to the number of unique user IDs as a function of time, not the number project repositories, which can have a high degree of multiplicity.
In a response, I tweeted a plot that suggested GH growth might be following a power law, aka scale free growth. The tell-tale sign is the asymptotic linearity of the growth data on double-log axes, which the original blog post did not discuss. The periods on the x-axis correspond to years, with the first period representing calendar year 2008 and the fifth period being the year 2012.
Scale free networks can arise from preferential attachment to super-nodes that have a higher vertex degree and therefore more connections to other nodes, i.e., a kind of rich-get-richer effect. Similarly for GH growth viewed as a particular kind of social network. The interaction between software developers using GH can be thought of as involving super-nodes that correspond to influential users influencing prospective GH users to open a new account and contribute to their project.
On this basis, I predicted GH would reach 4 million users during October 2013 and 5 million users during March 2014 (yellow points in the Linear axes plot below). In fact, GH reached those values slightly earlier than predicted by the power law model, and slightly later than the dates predicted by the diffusion model.
Since 2013, new data has been reported so, I extended my previous analysis in R. Details of the respective models are contained in the R script at the end of this post. In the Linear axes plot, both the diffusion model and power model essentially form an envelope around the newer data: diffusive on the upper side (red curve) and power law on the lower side (blue curve). In thise sense, it could be argued that the jury is still out on which model offers the more reliable predictions.
However, there is an aspect of the diffusion model that was overlooked in 2013. It predicts that GH growth will eventually plateau at 20 million users in 2020 (the 12th period, not shown). The beginnings of this leveling off is apparent in the 10th period (i.e., 2017). By contrast, the power law model predicts that GH will reach 23.65 million users by the end of the same period (yellow point). Whereas the diffusion and power law models respectively represent the upper and lower edges of an envelope surrounding the more recent data in periods 6–9, their predictions will start to diverge in the 10th period.
“GitHub is not the only player in the market. Other companies like GitLab are doing a good job but GitHub has a huge head start and the advantage of the network effect around public repositories. Although GitHub’s network effect is weaker compared to the likes of Facebook/Twitter or Lyft/Uber, they are the default choice right now.” —GitHub is Doing Much Better Than Bloomberg Thinks
Although there will inevitably be an equilibrium bound on the number of active GH users, it seems unlikely to be as small as 20 million, given the combination of GH’s first-mover advantage and its current popularity. Presumably the private investors in GH also hope it will be a large number. This year will tell.
Read more »
# Data source ... https://classic.scraperwiki.com/scrapers/github_users_each_year/
#LINEAR axes plot
plot(df.gh3$index, df.gh3$users, xlab="Period (years)",
ylab="Users (million)", col="gray",
ylim=c(0, 3e7), xaxt="n", yaxt="n")
axis(side=1, tck=1, at=c(0, seq(12,120,12)), labels=0:10,
axis(side=2, tck=1, at=c(0, 10e6, 20e6, 30e6), labels=c(0,10,20,30),
# Simple exp model
curve(coef(gh.exp) * exp(coef(gh.exp) * (x/13)),
from=1, to=108, add=TRUE, col="red2", lty="dot dash")
# Super-exp model
curve(49100 * (x/13) * exp(0.54 * (x/13)),
from=1, to=120, add=TRUE, col="red", lty="dashed")
# Bass diffusion model
curve(21e6 * ( 1 - exp(-(0.003 + 0.83) * (x/13)) ) / ( 1 + (0.83 / 0.003) * exp(-(0.003 + 0.83) * (x/13)) ),
from=1, to=120, add=TRUE, col="red")
# Power law model
curve(10^coef(gh.fit) * (x/13)^coef(gh.fit), from=1, to=120, add=TRUE,
title(main="Linear axes: GitHub Growth 2008-2017")
legend=c("Original data", "New data", "Predictions", "Exponentital", "Super exp", "Bass diffusion", "Scale free"),
col=c("gray", "black", "yellow", "red", "red", "red", "blue"),
pt.bg = c(NA,NA,"yellow",NA,NA,NA,NA),
- Workout Wednesday Redux (2017 Week 3)
I had started a “52 Vis” initiative back in 2016 to encourage folks to get practice making visualizations since that’s the only way to get better at virtually anything. Life got crazy, 52 Vis fell to the wayside and now there are more visible alternatives such as Makeover Monday and Workout Wednesday. They’re geared towards the “T” crowd (I’m not giving a closed source and locked-in-data product any more marketing than two links) but that doesn’t mean R, Python or other open-tool/open-data communities can’t join in for the ride and learning experience.
This week’s workout is a challenge to reproduce or improve upon a chart by Matt Stiles. You should go to both (give them the clicks and eyeballs they both deserve since they did great work). They both chose a line chart, but the whole point of these exercises is to try out new things to help you learn how to communicate better. I chose to use
geom_segment()to make mini-column charts since that:
- eliminates the giant rose-coloured rectangles that end up everywhere
- helps show the differences a bit better (IMO), and
- also helps highlight some of the states that have had more difficulties than others
Click/tap to “embiggen”. I kept the same dimensions that Andy did but unlike Matt’s creation this is a plain ol’ PNG as I didn’t want to deal with web fonts (I’m on a Museo Sans Condensed kick at the moment but don’t have it in my TypeKit config yet). I went with official annual unemployment numbers as they may be calculated/adjusted differently (I didn’t check, but I knew that data source existed, so I used it).
One reason I’m doing this is a quote on the Workout Wednesday post:
This will be a very tedious exercise. To provide some context, this took me 2-3 hours to create. Don’t get discouraged and don’t feel like you have to do it all in one sitting. Basically, try to make yours look identical to mine.
This took me 10 minutes to create in R:
#' --- #' output: #' html_document: #' keep_md: true #' --- #+ message=FALSE library(ggplot2) library(hrbrmisc) library(readxl) library(tidyverse) # Use official BLS annual unemployment data vs manually calculating the average # Source: https://data.bls.gov/timeseries/LNU04000000?years_option=all_years&periods_option=specific_periods&periods=Annual+Data read_excel("~/Data/annual.xlsx", skip=10) %>% mutate(Year=as.character(as.integer(Year)), Annual=Annual/100) -> annual_rate # The data source Andy Kriebel curated for you/us: https://1drv.ms/x/s!AhZVJtXF2-tD1UVEK7gYn2vN5Hxn #ty Andy! read_excel("~/Data/staadata.xlsx") %>% left_join(annual_rate) %>% filter(State != "District of Columbia") %>% mutate( year = as.Date(sprintf("%s-01-01", Year)), pct = (Unemployed / `Civilian Labor Force Population`), us_diff = -(Annual-pct), col = ifelse(us_diff<0, "Better than U.S. National Average", "Worse than U.S. National Average") ) -> df credits <- "Notes: Excludes the District of Columbia. 2016 figure represents October rate.\nData: U.S. Bureau of Labor Statistics <https://www.bls.gov/lau/staadata.txt>\nCredit: Matt Stiles/The Daily Viz <thedailyviz.com>" #+ state_of_us, fig.height=21.5, fig.width=8.75, fig.retina=2 ggplot(df, aes(year, us_diff, group=State)) + geom_segment(aes(xend=year, yend=0, color=col), size=0.5) + scale_x_date(expand=c(0,0), date_labels="'%y") + scale_y_continuous(expand=c(0,0), label=scales::percent, limit=c(-0.09, 0.09)) + scale_color_manual(name=NULL, expand=c(0,0), values=c(`Better than U.S. National Average`="#4575b4", `Worse than U.S. National Average`="#d73027")) + facet_wrap(~State, ncol=5, scales="free_x") + labs(x=NULL, y=NULL, title="The State of U.S. Jobs: 1976-2016", subtitle="Percentage points below or above the national unemployment rate, by state. Negative values represent unemployment rates\nthat were lower — or better, from a jobs perspective — than the national rate.", caption=credits) + theme_hrbrmstr_msc(grid="Y", strip_text_size=9) + theme(panel.background=element_rect(color="#00000000", fill="#f0f0f055")) + theme(panel.spacing=unit(0.5, "lines")) + theme(plot.subtitle=element_text(family="MuseoSansCond-300")) + theme(legend.position="top")
~/Datafor where you stored the files.
The “weird” looking comments enable me to spin the script and is pretty much just the inverse markup for
knitrR Markdown documents. As the comments say, you should really thank Andy for curating the BLS data for you/us.
If I really didn’t pine over aesthetics it would have taken me 5 minutes (most of that was waiting for re-rendering). Formatting the blog post took much longer. Plus, I can update the data source and re-run this in the future without clicking anything. This re-emphasizes a caution I tell my students: beware of dragon droppings (“drag-and-drop data science/visualization tools”).
Hopefully you presently follow or will start following Workout Wednesday and Makeover Monday and dedicate some time to hone your skills with those visualization katas.Read more »
- Elements of a successful #openscience #rstats workshop
What makes an open science workshop effective or successful*?
Over the last 15 years, I have had the good fortune to participate in workshops as a student and sometimes as an instructor. Consistently, there were beneficial discovery experiences, and at times, some of the processes highlighted have been transformative. Last year, I had the good fortune to participate in Software Carpentry at UCSB and Software Carpentry at YorkU, and in the past, attend (in part) workshops such as Open Science for Synthesis. Several of us are now deciding what to attend as students in 2017. I have been wondering about the potential efficacy of the workshop model and why it seems that they are so relatively effective. I propose that the answer is expectations. Here is a set of brief lists of observations from workshops that lead me to this conclusion.
*Note: I define a workshop as effective or successful when it provides me with something practical that I did not have before the workshop. Practical outcomes can include tools, ideas, workflows, insights, or novel viewpoints from discussion. Anything that helps me do better open science. Efficacy for me is relative to learning by myself (i.e. through reading, watching webinars, or stuggling with code or data), asking for help from others, taking an online course (that I always give up on), or attending a scientific conference.
Delivery elements of an open science training workshop
- Q & A sessions
- Hands-on exercises
- Webinars or group-viewing recorded vignettes.
Summary expectations from this list: a workshop will offer me content in more than one way unlike a more traditional course offering. I can ask questions right there on the spot about content and get an answer.
Content elements of an open science training workshop
- Data and code
- Slide decks
- Advanced discussion
- Experts that can address basic and advanced queries
- A curated list of additional resources
- Opinions from the experts on the ‘best’ way to do something
- A list of problems or questions that need to addressed or solved both routinely and in specific contexts when doing science
- A toolkit in some form associated with the specific focus of the workshop.
Summary of expectations from this list: the best, most useful content is curated. It is contemporary, and it would be a challenge for me to find out this on my own.
Pedagogical elements of an open science training workshop
- Organized to reflect authentic challenges
- Uses problem-based learning
- Content is very contemporary
- Very light on lecture and heavy on practical application
- Reasonably small groups
- Will include team science and networks to learn and solve problems
- Short duration, high intensity
- Will use an open science tool for discussion and collective note taking
- Will be organized by major concepts such as data & meta-data, workflows, code, data repositories OR will be organized around a central problem or theme, and we will work together through the steps to solve a problem
- There will be a specific, quantifiable outcome for the participants (i.e. we will learn how to do or use a specific set of tools for future work).
Summary of expectations from this list: the training and learning experience will emulate a scientific working group that has convened to solve a problem. In this case, how can we all get better at doing a certain set of scientific activities versus can a group aggregate and summarize a global alpine dataset for instance. These collaborative solving-models need not be exclusive.
Higher-order expectations that summarize all these open science workshop elements
- Experts, curated content, and contemporary tools.
- Everyone is focussed exclusively on the workshop, i.e. we all try to put our lives on hold to teach and learn together rapidly for a short time.
- Experiences are authentic and focus on problem solving.
- I will have to work trying things, but the slope of the learning curve/climb will be mediated by the workshop process.
- There will be some, but not too much, lecturing to give me the big picture highlights of why I need to know/use a specific concept or tool.