Wednesday, May 27, 2009

3 Skills of Sexy Data Geeks

Blog post at Dataspora

Skill #1: Statistics (Studying). Statistics is perhaps the most important skill and the hardest to learn. It’s a deep and rigorous discipline, and one that is actively progressing (the widely used method of Least Angle Regression was only recently developed in 2004).

Skill #2: Data Munging (Suffering). The second critical skill mentioned above is “data munging.” Among data geek circles, this refers to the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy. At best it’s inconsistently delimited or packed into an unnecessarily complex XML schema. At worst, it’s a series of scraped HTML pages or a thoroughly undocumented fixed-width format.

Related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores, using a combination of SQL, scripting languages (especially Python and its SciPy and NumPy libraries), and even several oldie-but-goodie Unix utilities (cut, join).

And when data sets grow too large to manage on a single desktop, the samurai of data geeks are capable of parallelizing storage and computation with tools like 96-nodes of Postgres, snow and RMPI, Hadoop and Mapreduce, and on Amazon EC2 to boot.

Skill #3: Visualization (Storytelling). This third and last skill that Professor Varian refers to is the easiest to believe one has. Most of us have had exposure to basic chart-making widgets of Excel. But a little knowledge is a dangerous thing: these software tools are often insufficient when faced with the visualization of large, multivariate data sets.

Here it’s worth making a distinction between two breeds of data visualizations, which differ in their audience and their goals. The first are exploratory data visualizations (as named by John Tukey), intended to faciliate a data analyst’s understanding of the data. These may consist of scatter plot matrices and histograms, where labels and colors are minimally set by default. Their goal is to help develop a hypothesis about the data, and their audience typically numbers one.

A second kind of data visualization are those intended to communicate to a wider audience, whose goal is to visually advocate for a hypothesis. While most data geeks are facile with exploratory graphics, the ability to create this second kind of visualization, these visual narratives, is again a separate skill — with separate tools.

The ability to visualize and communicate data is critical, because even with good data and rigorous statistical techniques, if the results of an analysis are poorly visualized, they will not convince: whether it’s an academic discovery or a business proposal.

No comments: