Have you got what it takes to be a Big Data scientist?

Have you got what it takes to be a Big Data scientist?

Before you can answer that...                                                        

The Job Description:

Given a huge volume of unstructured information (or Big Data), a data scientist looks at it from many angles, determines what it means, then recommends ways to apply the data.

Science, in Action

Starting work in 2006 at the business networking site LinkedIn, Jonathan Goldman, a Ph.D. in physics from Stanford, began forming theories, testing hunches, and finding patterns to predict whose networks a given LinkedIn profile would land in.

Reid Hoffman, LinkedIn’s cofounder and CEO at the time, gave Goldman authority to publish small modules in the form of ads on the site’s most popular pages. Using one such module, Goldman began to test what would happen if you presented users with names of people they hadn’t yet connected with, but seemed likely to know.

Goldman's module was a custom ad displaying the three best-fit matches for each user, based on the background entered in their LinkedIn profile. Click-through on those ads was the highest ever seen. They soon became a standard feature.

The Skill-set

Formal training for data scientists is similar to that for traditional business or data analysts: computer science and applications, modeling, statistics, analytics and math. In addition, the data scientist needs good business sense – and the ability to communicate findings to both business and IT leaders in a way that inspires action.

A data scientist must explore and examine data from multiple disparate sources with an inquisitive eye, sifting to uncover previously hidden insights. These in turn can provide a competitive advantage, or address a pressing business problem.

Armed with data and analytical results, the scientist will then communicate informed conclusions and recommendations to management.

All this must be done with scientific rigor. A data scientist must have a fine-grained command of correlations, time series, trends, forecasts, segmentations and other patterns in order to understand the key factors at work.

Real-world experiments may also be required; making slight changes to analytical models to observe the effects, or monitoring performance metrics to determine which aspect of a model contributes most to the desired outcome.

Tools & Talent

Statistical modeling, predictive analysis and other tools provide data scientists with the pattern-finding instruments they require. Data warehousing, visualization, integration and governance tools are needed for exploring deep data.

The key thing is looking for hidden patterns. This may be accomplished through user-friendly advanced visualization tools, self-service search-driven business intelligence (BI) tools, interactive data exploration tools, and other approaches that don't necessarily demand a deep mastery of statistical analysis.

As with any discipline, data scientists also rely on people in adjacent roles to help them do their jobs effectively: subject-matter experts, business analysts, Big Data solution providers like IBM.

In turn, the institutions that employ data scientists may be public or private sector, nonprofit or commercial. Professional associations, journals or other forums may help them communicate and collaborate with other scientists.

A Roadmap for Success

For data science, the Cross Industry Standard for Data Mining has been in existence since the mid-1990s. Best practices for the field may be summarized as follows:

1. Choose the analytic problem carefully.

Engage only in projects that have some clear alignment with key business imperatives, like gaining a competitive advantage or enhancing customer loyalty. 

2. Set the right subject population.

If you’re modeling customer influence patterns for example, consider both customers who are very influential and those who may be less influential, but are more susceptible to being influenced. Over or under-representing either group in your population will skew your data model.

3. Consider your data sources.

Don't just use internal sources; external data, such as social media activity, may contain the key behavioral variables you need to build into your model.

4. Select the right data samples.

You'll have to decide whether your sample should be simple (so as not to introduce bias), or more complex, in order to reach the required level of precision without introducing skew.

5. Use up-to-date data and models.

There's little confidence to be had, from inconsistent and aged data. You need to incorporate strong data and model governance into every aspect of your data-science operations.

6. Identify the best predictors.

This is the heart of data science. You'll need to explore variable-selection approaches like decision trees, clustering, association rule learning and outlier analysis.

7. Use the right modeling approach and algorithms.

With continuous variables, you’ll likely be doing regression modeling of some kind. If you’re modeling discrete or categorical variables, you might explore algorithms like neural networks, genetic algorithms and support vector machines.

8. Re-validate your models at set intervals.

You may need to score your models with fresh data every month, week, day, or even hour. Choosing the correct frequency is essential if your models are to remain as accurate predictors over time.

9. Check your models for fitness.

Consider validating model fitness using multiple metrics and approaches. These could include model quality scores (K-S, Gini, ROC, etc.), goodness-of-fit charts, lift charts and comparative model evaluation.

10. Select appropriate visualizations.

Visualizations should guide data exploration, model development and results presentation. Choosing the right ones should help reveal significant statistical patterns.

So, does any of that sound like you?