In BC's "Humble" Opinion: DataViz

Showing posts with label DataViz. Show all posts

Tuesday, May 17, 2011

Edward Tufte

Interview at The Washington Monthly

Edward Tufte occupies a revered and solitary place in the world of graphic design. Over the last three decades, he has become a kind of oracle in the growing field of data visualization—the practice of taking the sprawling, messy universe of information that makes up the quantitative backbone of everyday life and turning it into an understandable story.

In the public realm, data has never been more ubiquitous—or more valuable to those who know how to use it. “If you display information the right way, anybody can be an analyst,” Tufte once told me. “Anybody can be an investigator.”

“Tufte treats data like good writing,” he said. “You have a certain thought—how clearly and beautifully are you conveying it?”

Good design, then, is not about making dull numbers somehow become magically exhilarating, it is about picking the right numbers in the first place.

Tuesday, December 28, 2010

Proofiness (or data != information)

Blog post by Seth Godin

As the number of apparently significant digits in the data available to us goes up (traffic was up .1% yesterday!) we continually seek causation, even if we're looking in the wrong places. As the amount of data we get continues to increase, we need people who can help us turn that data into information.

Proofiness is a tricky thing. Data is not information, and confusing numbers with truth can help you make some bad decisions.

Friday, December 3, 2010

~1,300 pitches in a 4D visualization

New York Times visualization

First 1:40 of the video introduces some background and concepts.
Next minute is a killer 4D visualization - X, Y, Z, and time.

That's a lot of data crunched down to support some great storytelling.

Wednesday, October 27, 2010

Lessons for data visualization from Steven Johnson's "The Ghost Map"

Blog post at PeteSearch

Snow wasn't the first person to draw these kinds of maps, he wasn't the first to draw them to track disease, and in fact he wasn't even the first person to map this particular outbreak! The Sewer Commision produced a very detailed map showing the death locations. The power of Snow's version came from his decision to leave out a lot of details (sewer locations, old grave sites, etc) that cluttered up the Commision's version. Their map was so muddled that it didn't tell a story, but Snow's was stripped-down to show exactly what he needed to bolster his theory that the epidemic spread from the water pump.

As Johnson puts it in his book "the map was a triumph of marketing as much as empirical science".

Turning data into money

Blog post at PeteSearch

Here's my hierarchy showing the stages from raw data to cold, hard cash:

Data
Charts
Reports
Recommendations

You're offering them direct ways to meet their business goals, which is incredibly valuable. This is the Nirvana of data startups, you've turned into an essential business tool that your customers know is helping them make money, so they're willing to pay a lot.

My rephrasing of Pete's post: Showing people information is usually not enough. You often have to recommend what to DO with that information.

Tuesday, August 3, 2010

Data can make ANYTHING interesting

Even fashion, which every person who knows a modicum about me would say is something I show extremely little interest in.

Article at the Wall Street Journal

Online retailers, in particular, see every click we make. They know which brands we've peeked at, how long we pondered, and what we actually purchased. They know the time of day and the days of the week that we shop. They know—and record—our color choices, sizes and tastes so that they can recommend clothes that are in tune with our yearnings.

Some of the data confirm regional stereotypes. Southerners bought more white, green, and pink than other regions' residents, for instance, according to data from private-sale site Hautelook.com, which caters to young, urban professional women. Now I know, too, why I feel like such a loner wearing brown in Los Angeles, where black, white and gray are preferred.

Thursday, June 24, 2010

Sergey Brin's Search for a Parkinson's Cure

Article at Wired magazine

Brin’s tolerance for “noisy data” is especially telling, since medical science tends to consider it poisonous. Biomedical researchers often limit their experiments to narrow questions that can be rigorously measured. But the emphasis on purity can mean fewer patients to study, which results in small data sets. That limits the research’s “power”—a statistical term that generally means the probability that a finding is actually true. And by design it means the data almost never turn up insights beyond what the study set out to examine.

Increasingly, though, scientists—especially those with a background in computing and information theory—are starting to wonder if that model could be inverted. Why not start with tons of data, a deluge of information, and then wade in, searching for patterns and correlations?

This is what Jim Gray, the late Microsoft researcher and computer scientist, called the fourth paradigm of science, the inevitable evolution away from hypothesis and toward patterns. Gray predicted that an “exaflood” of data would overwhelm scientists in all disciplines, unless they reconceived their notion of the scientific process and applied massive computing tools to engage with the data. “The world of science has changed,” Gray said in a 2007 speech—from now on, the data would come first.

Wednesday, June 2, 2010

What is data science?

Article at O'Reilly Radar

An well-written, dense article covering the rise of data science.

I'm not even going to try to summarize the article with excerpts, but I have picked out a portion that best summarizes what I do.

A picture may or may not be worth a thousand words, but a picture is certainly worth a thousand numbers. The problem with most data analysis algorithms is that they generate a set of numbers. To understand what the numbers mean, the stories they are really telling, you need to generate a graph.

Sunday, April 25, 2010

Rise of the Data Scientist

This post at FlowingData, which is a response to the "R is an Epic Fail" post at another blog, led me to an older post at FlowingData (which cites Hal Varian's comment regarding the forthcoming sexiness of statisticians).

As someone who finds the label "Data Scientist" appealing, I firmly believe in the following:

Similarly, those who can build visualization and analysis tools are the ones who will provide the next big thing.

So don't get too upset, R programmers, or all data scientists for that matter. While the software was bashed, you're getting a thumbs up. R is not the next big thing. You are. Besides, we all know that data is the new sexy, and in the end it's not about the tools that you use, but what you do with the tools.

My take-home message: Tools don't matter. Results do.

Monday, April 12, 2010

Doing the unprecedented is overrated

Blog post by Stephen Few

...doing the unprecedented is highly overrated.

Most of what we can do to make the world a better place involves, not doing the unprecedented, but doing what matters and what works [emphasis mine], whether unprecedented or not. This might not be as exciting as the unprecedented, but it’s desperately needed. I believe that too many opportunities are wasted because we glorify the unprecedented for its own sake.

In the field of data visualization, failures are more common today than successes, not due to complexity, but to the fact that few people have been trained in the simple principles and practices of graph design. As a result, they rely on software tools to do the work for them and most of those tools lead them astray, encouraging them to produce silly, useless displays...

Here’s an example of one of the earliest quantitative graphs, hand drawn by William Playfair in 1786. In his time, Playfair did the unprecedented by inventing or greatly improving many of the quantitative graphs that we use today.

1786! I think it's pretty clear that software isn't the issue. It's taking the time to learn what the right things are, and being disciplined enough to use them every time you're presenting information. Proper dataviz shouldn't be saved for "when you have time" - it should be an inherent part of the data analysis process.

Tuesday, March 30, 2010

IBM's Smarter Planet Commercials

My favorite:

I like nothing better than a great analogy:
If you were to stand at a road, and the cars are whipping by, and all you can do is take a snapshot of the way the road looked five minutes ago...
How would you know when to cross the road?

More commercials from a post at FlowingData

Tuesday, February 23, 2010

Compressed Sensing

Article at Wired.com

Compressed sensing works something like this: You’ve got a picture — of a kidney, of the president, doesn’t matter. The picture is made of 1 million pixels. In traditional imaging, that’s a million measurements you have to make. In compressed sensing, you measure only a small fraction — say, 100,000 pixels randomly selected from various parts of the image. From that starting point there is a gigantic, effectively infinite number of ways the remaining 900,000 pixels could be filled in.

The key to finding the single correct representation is a notion called sparsity, a mathematical way of describing an image’s complexity, or lack thereof. A picture made up of a few simple, understandable elements — like solid blocks of color or wiggly lines — is sparse; a screenful of random, chaotic dots is not. It turns out that out of all the bazillion possible reconstructions, the simplest, or sparsest, image is almost always the right one or very close to it.

This question really highlighted the utility of compressed sensing:

Digital cameras, he explains, gather huge amounts of information and then compress the images. But compression, at least if CS is available, is a gigantic waste. If your camera is going to record a vast amount of data only to throw away 90 percent of it when you compress, why not just save battery power and memory and record 90 percent less data in the first place?

Monday, January 25, 2010

Good Graphs

Post at the Win-Vector blog

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all.

Summary of advice:

Make important differences large enough to perceive
Make important shape changes large enough to perceive: Banking to 45 degrees.
Make sure all the data is equally well resolved.
If you want to analyze the difference between two processes, then graph the difference, not the processes (or graph both).
If you are interested in rate of change, then graph rate of change.

Thursday, January 7, 2010

Chart Wars

http://www.targetpointconsulting.com/ToThePoint/2010/01/05/chart-wars

Friday, November 13, 2009

R Choropleth Challenge (color-shaded maps)

Link to the original FlowingData post

There are about a million ways to make a choropleth map. You know, the maps that color regions by some metric. The problem is that a lot of solutions require expensive software or have a high learning curve...or both. What if you just want a simple map without all the GIS stuff?

Link to the R Revolutions blog

Oh my, that was fast! Less than 24 hours after the Choropleth Map Challenge was laid down, no fewer than 5 hackers responded with complete solutions for plotting the US unemployment data on a color-coded map, each in less than 20 lines of R code.

Wednesday, May 27, 2009

3 Skills of Sexy Data Geeks

Blog post at Dataspora

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004).

Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles, this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.

Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).

And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.

Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has. Most of us have had exposure to basic chart-making widgets of Excel. But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.

Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one.

A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools.

The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.

Friday, April 10, 2009

The Power of a Sketch

Dan Roam describes the napkin sketch that inspired supply-side economics

Wikipedia article on the Laffer curve. One key sentence: Many economists have questioned the utility of the Laffer Curve in public discourse.

Wednesday, April 8, 2009

Sprint Commercial - The Now Network

As much as I normally dislike commercials and feel that their impact should be minimized (by not watching cable or broadcast TV, using some level of ad-blocking, using a RSS reader instead of browing the Web), I'll stay go out of my while to highlight interesting and well-made ads such as this one.

Monday, March 23, 2009

Animated Infographics

Amazingly dense in terms of information imparted per second.

Little Red Riding Hood

Slagsmålsklubben - Sponsored by destiny from Tomas Nilsson on Vimeo.

inspired by Royksopp - Remind Me
(Sorry - unable to embed, per poster's request)

Friday, February 6, 2009

DataViz Resources

R-Specific
Quick-R: clear, simple description of R
R Resources: at Cerebral Mastication
One R Tip A Day
StatsRUs: an R Cookbook
Revolutions: news about R, statistics, etc. from REvolution Computing
R for Psychology: many R snippets, code samples
SimpleR: a short course in R

General
Flowing Data
WallStats blog
Stephen Few's blog
Ben Fry
Information Aesthetics

Graphics Design
Squidoo Lens