Tessera
Open Source Environment for Deep Analysis of Large Complex Data

The Power of R with Big Data

With Tessera you can apply the thousands of methods of statistics, machine learning, and visualization implemented in the powerful R language to large data sets - without being an expert in distributed computing.

More detail on Tessera components

Get Started in Minutes

Tessera is a powerful computational environment for data large and small. From installation on a workstation to the Amazon cloud, we've made it easy for you to get started.

Quickstart guide

Once you are up and running, see our Resources for using Tessera

Built by Data Scientists

The development of Tessera began and continues to be driven by our team's work on deep statistical analysis of large complex data. Expertise on our team covers all areas of data science, from cluster hardware design for big data to theoretical statistics.

About us

Tessera Components


The Tessera computational environment is powered by a statistical approach, Divide and Recombine. At the front end, the analyst programs in R. At the back end is a distributed parallel computational environment such as Hadoop. In between are three Tessera packages: datadr, Trelliscope, and RHIPE. These packages enable the data scientist to communicate with the back end by simple R commands.


Divide and Recombine (D&R)

Tessera is powered by Divide and Recombine. In D&R, we seek meaningful ways to divide the data into subsets, apply statistical methods to each subset independently, and recombine the results of those computations in a statistically valid way. Sometimes the results are approximations of the exact result had we been able to process the data as a whole, but the potential slight loss in accuracy is a small penalty to pay for the simple, fast computation. In D&R, the data are parallelized, not the statistical methods. Thus we have access the existing vast library of methods available in R.

divide and recombine

datadr

The datadr R package provides a high-level interface to D&R operations, making specification of divisions, analytic methods, and recombinations easy. It represents large datasets as native R objects and allows D&R operations to be run against them. In addition to division and recombination methods, datadr also provides several indispensable tools for reading and manipulating data as well as a collection of standard division-independent statistical methods. The interface is designed to be back end agnostic, so that as new distributed computing technology comes along, datadr will be able to harness it.

datadr

Trelliscope

Trelliscope is a visualization tool based on Trellis Display that enables scalable detailed visualization of data. Trellis Display operates by dividing the data into subsets. A visualization method is applied to each subset and shown on one panel of a multi-panel trellis display. This framework has proven to be powerful mechanism for all data, large and small. Trelliscope extends Trellis to big data. With big data, often there are too many subsets to look at all panels in a display. Trelliscope provides an interactive viewer that allows the analyst to sample, sort, and filter panels of a display on various quantities of interest.

trelliscope

RHIPE

RHIPE is the R and Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R operations directly through RHIPE MapReduce jobs, as MapReduce is sufficient for D&R, although in this case you are programming at a lower level than for datadr.

rhipe

QuickStart


Install

Whether you want to try Tessera on your workstation or you are ready to dig into some big data, we have straightforward options for you to get your Tessera environment up and running.

Workstation Installation

After you have installed R, you can get going on a workstration simply by installing a few R packages. This setup will restrict you to data sizes commensurate with the hardware of your workstation, but it is sufficient to get a feel for Tessera and to analyze small or even moderate-sized data sets.

Once you are comfortable with the Tessera tools, you can plug into large-scale backends like Hadoop with virtually no changes to your code.

To install, simply launch R, install the devtools package from CRAN, and type the following:

install.packages("devtools") # if not installed
library(devtools)
install_github("datadr", "tesseradata")
install_github("trelliscope", "tesseradata")

Vagrant Virtual Machine

To get a feel for running in a large-scale Tessera environment, we have provided a Vagrant setup that with a few simple commands allows you to provision a virtual machine with the entire Tessera stack running, including

  • R 3.1
  • Hadoop (CDH5 pseudo-distributed mode)
  • Rhipe 0.74
  • datadr
  • Trelliscope
  • RStudio Server
  • Shiny Server

The Vagrant script and instructions are available on Github.

Amazon Elastic MapReduce (EMR)

Setting up and installing all of the Tessera components on a bonafide cluster requires more commitment in terms of hardware, installation, configuration, and administration.

We have provided an easy way to get going with Tessera in a large-scale environment through a simple set of scripts that provision the Tessera environment on Amazon's Elastic MapReduce clusters. This allows you to spin up virtual clusters on-demand. An Amazon account is required.

This environment comes with RStudio Server running on the master node, so that all you need is a web browser to access R Studio, a fantastic R IDE that will be backed by your own Hadoop cluster.

The EMR scripts and instructions are available on Github.

Try It

Here is a simple example to get a feel for Tessera usage. Commentary about the example is available in the datadr tutorial here. For more compelling examples of Tessera in action, as well as in-depth tutorials, check out the Resources section.

# install package with housing data
devtools::install_github("housingData", "hafen")
library(housingData)
library(datadr); library(trelliscope)

# look at housing data
head(housing)

# divide by county and state
byCounty <- divide(housing, 
   by = c("county", "state"), update = TRUE)

# look at summaries
summary(byCounty)

# look at overall distribution of median list price
priceQ <- drQuantile(byCounty, var = "medListPriceSqft")
xyplot(q ~ fval, data = priceQ, 
   scales = list(y = list(log = 10)))

# slope of fitted line of list price for each county
lmCoef <- function(x)
   coef(lm(medListPriceSqft ~ time, data = x))[2]
# apply lmCoef to each subset
byCountySlope <- addTransform(byCounty, lmCoef)

# look at a subset of transformed data
byCountySlope[[1]]

# recombine all slopes into a single data frame
countySlopes <- recombine(byCountySlope, combRbind)
plot(sort(countySlopes$val))

# make a time series trelliscope display
vdbConn("housingjunk/vdb", autoYes = TRUE)

# make and test panel function
timePanel <- function(x)
   xyplot(medListPriceSqft + medSoldPriceSqft ~ time,
      data = x, auto.key = TRUE, ylab = "Price / Sq. Ft.")
timePanel(byCounty[[1]][[2]])

# make and test cognostics function
priceCog <- function(x) { list(
   slope = cog(lmCoef(x), desc = "list price slope"),
   meanList = cogMean(x$medListPriceSqft),
   listRange = cogRange(x$medListPriceSqft),
   nObs = cog(length(which(!is.na(x$medListPriceSqft))), 
      desc = "number of non-NA list prices")
)}
priceCog(byCounty[[1]][[2]])

# add display with this panel and cog function to vdb
makeDisplay(byCounty,
   name = "list_sold_vs_time",
   desc = "List and sold price over time",
   panelFn = timePanel, cogFn = priceCog,
   width = 400, height = 400,
   lims = list(x = "same"))

# view the display
view()
              

You can view this and some related Trelliscope displays here.

Resources


Tutorials

The best way to get started digging deeper into Tessera is to follow the tutorials for our software components in the following order:

In addition to these tutorials, we provide analysis narratives that assume knowledge of the above tutorials and go deeper into illustrating the use of the tools in more realistic data analysis situations:

Publications

The following publications provide more detail about research relevant to Tessera, as well as illustrate the principles of D&R in various applications:

Code

Tessera is open source and source code for its components is hosted on Github under the tesseradata organization.

Contributors are welcome! Feel free to fork any of our component repositories and introduce yourself on the developer's mailing list.

User Groups

If you have questions or issues, or just want to keep up on what is new, please do not hesitate to join the users mailing list. Simply send an email to tessera-users+subscribe@googlegroups.com

Also, you can join our developer's mailing list by sending email to tessera-dev+subscribe@googlegroups.com

About Us


The Tessera project team is now 25 strong and growing. Our team consists of faculty, students, and technical staff at the Purdue University Department of Statistics and Statistical Data Scientists and Computer Scientists at Pacific Northwest National Laboratory and Mozilla Corporation.

Tessera team members collectively cover all of the intellectual areas of data science, from cluster hardware design for big data to theoretical statistics. This has been necessary for Tessera to succeed in the deep analysis of large complex data. But none of this is more important than our team's experience in deep analyses of many big datasets. It got us started. It has driven our past work. And our current big data analyses keep the driving going.

logos