What is something data scientists know that others don’t?

I was asked this question on Quora, and it seems like an interesting one, so here’s my best shot at answering it.

First, a little personal background. I have worked in the field of data science (it used to be called data analysis and statistical analysis) for about 35 years, for governments, universities, non-profits and businesses. That includes some consulting work, as well.

I got into the profession more or less by accident – my undergraduate degree was in geophysics, but work dried up in that field shortly after I graduated, so I further upgraded my math/stats/computing education, and moved into those areas instead, as the transition from mainframe computers to personal computers and servers in the 1980s and 1990s opened up a lot of opportunities.

Before I knew it, my career had shifted from geophysics to data analysis, programming, and statistics. Fortunately, a degree with a focus on physics and math prepared me nicely for that sort of work. That’s remains true, as many people coming out of university with STEM degrees transition to data science, once they need a job in the “real world”. In fact, I have a close personal relative with a PhD in astrophysics who is now happily doing data science for a major university, though not within the astrophysics department.

So, that’s something that data scientists know that most people don’t know – a lot of people doing “data science” moved into the field after studying different, but usually related disciplines.

What are some other things that a data scientist knows that others don’t?

First, I will dispense with the obvious things, such as the fact that a data scientist will naturally know many technical/academic matters that most people don’t know, as that is the very essence of any profession or specialization. Note that data scientist is a rather elastic term, so any given practitioner won’t necessarily be familiar with all of the areas noted below (that’s another thing that data scientists know, that others don’t).

Higher mathematics (calculus, linear algebra, optimization, etc.).
Statistical theory and methods (probability, multivariable methods such as regression, clustering, ANOVA, etc.).
Computer coding in any number of database, statistical, or general purpose languages (SQL, R, Python, SAS, SPSS, etc.).
Data science algorithms and their practical implementations (artificial neural nets, decision trees, random forests, sentiment analysis, topic modelling, etc.).
Effective visualization techniques and processes (proper graphing skills, expertise with visualization tools such as Tableau, etc.).
The ability to interpret a business or research need, laisse with subject matter experts, gather the necessary data, apply the appropriate analytical methods (at the high end that may mean new algorithm developments), interpret the results correctly, and communicate those results in an understandable way to clients.
This, of course, includes good writing and presentation skills.

As with any profession, there are a wide range of niches that require different skill sets and abilities. Data science is a process that goes from such (apparently) mundane tasks as extracting and cleaning data, to mid-level “what-if” reporting, to higher end inferential analysis and predictive modelling, to the really high-end PhD-level work such as researching breakthrough Artificial Intelligence algorithms.

Many people only consider the latter steps end of the process as “data science”, but I see that as a mistake. A lot of jobs are at the earlier stages of the pipeline, so people considering the career of data science (in the widest sense of the word) should be aware of that. That’s true of any field – in medicine, for example, there are a lot move general practitioners than there are brain surgeons.

In addition to that, even the higher end modelling jobs require a lot of data munging or wrangling, before the fun work can begin. A modeller will generally have to get his or her hands dirty with data prep, though some of that can be handed off to less specially trained people.

So, that’s another thing that data scientists know that others don’t – there are a wide range of activities involved, which means that there is a greater scope for involvement by different sorts of people than is generally recognized.

Here are a few other things that a data scientist knows that others usually don’t. Note that these are the opinions of a mid-level practicing data scientist – someone on the cutting edge of current research may not agree:

Data science techniques can be very useful for predictive purposes, but improving a model gets more and more difficult as the level of prediction desired or needed increases. More data and faster processing helps, but it doesn’t necessarily scale linearly, so it can be tough to improve performance past a certain point. There’s a lot of hype, so that can be hard for people to accept (especially business people who want to leverage data science techniques to make a lot of money).
This may hold back the “AI revolution”, especially the notion of “the singularity”. Human level intelligence is probably still a long way off, though out-performing humans at specific tasks is often possible.
Artificial intelligence is not intelligent in the way that most people think of the term. A multilevel perceptron model may be trained to be very good at recognizing cats, just as can a four year old child can be very good at the same task. But it is hard to tell just what it is that makes the model say “cat” when it sees one, even if you understand input layers, hidden layers, output layers, back propagation, convolutions and all the rest of the jargon as well as the detailed programming knowledge needed to implement these concepts. A four year old can not only recognize a cat, but she can explain her reasons for doing so (it has four legs, it has fur, it has a button nose, it has whiskers, it has padded feet, it is cute and cuddly, etc.).
AI tends to move in fits and starts, and some think we may be approaching another AI desert. When momentum slows (as with the fitful progress on producing fully automatic driving cars, for example), private investment money dries up and corporate research projects can whither for lack of money. And, the sense that this line of research is no longer a sure-fire way to get tenure and research money can dry up too, so university level interest can also wane.
It is often said of AI, that easy things are hard and hard things are easy. So, observing and understanding a general environment is incredibly hard for computers, while it is fairly easy for a human child. On the other hand, winning at the highest levels of games such as Go or Chess is hard for a human (even a highly intelligent adult), but relatively easy for current-day AI systems.
Some data science techniques are better at explaining than predicting, while others are better at predicting than explaining. For example, regression techniques will give good information about which variables are most important in explaining a relationship (e.g. beta coefficients, confidence intervals), while machine learning techniques won’t do so, or at least not very well. Conversely, you can throw a lot more data and a lot more input variables at a machine learning techniques, so that can lead to superior predictive power, but you won’t really know why the predictive model works so well (e.g. neural net weights are hard for humans to understand).
Machine learning models don’t care about our feelings or our political attitudes. Supervised learning models will make predictions based on the data that they are fed. If those results run counter to our assumptions about the world, the models don’t care. And, if we “fix” the models by carefully selecting data to get a picture of the world that we prefer, we won’t actually get models that are useful for prediction.
Finally, all models are wrong, but some are useful. You will probably hear that a lot, if you get into the field.
No doubt, there is a lot more that can be said, but here is something that non-data scientists know that data scientists don’t always know – don’t make a data science blog too long!

--------------------------------------------------------------------------------------------------------

So, now that you have read a bit about data science and AI, you can kick back a bit, and read some science fiction by a data scientist instead.

How about a short story about an empire of interstellar interlopers. It features one possible scenario to explain why we haven’t met ET yet (as far as we know, anyway). Only 99 cents on Amazon.

The Zoo Hypothesis or The News of the World: A Science Fiction Story

Summary

In the field known as Astrobiology, there is a research program called SETI, The Search for Extraterrestrial Intelligence. At the heart of SETI, there is a mystery known as The Great Silence, or The Fermi Paradox, named after the famous physicist Enrico Fermi. Essentially, he asked “If they exist, where are they?”.

Some quite cogent arguments maintain that if there was extraterrestrial intelligence, they should have visited the Earth by now. This story, a bit tongue in cheek, gives a fictional account of one explanation for The Great Silence, known as The Zoo Hypothesis. Are we a protected species, in a Cosmic Zoo? If so, how did this come about? Read on, for one possible solution to The Fermi Paradox.

The short story is about 6300 words, or about half an hour at typical reading speeds.

Amazon U.S.: https://www.amazon.com/dp/B076RR1PGD

Amazon U.K.: https://www.amazon.co.uk/dp/B076RR1PGD

Amazon Canada: https://www.amazon.ca/dp/B076RR1PGD

Alternatively, consider another short invasion story, this one set in the Arctic. Also 99 cents.

The Magnetic Anomaly

Summary

An attractive woman in a blue suit handed a dossier to an older man in a blue uniform.

“Give me a quick recap”, he said.

“A geophysical crew went into the Canadian north. There were some regrettable accidents among a few ex-military who had become geophysical contractors after their service in the forces. A young man and young woman went temporarily mad from the stress of seeing that. They imagined things, terrible things. But both are known to have vivid imaginations; we have childhood records to verify that. It was all very sad. That’s the official story.”

He raised an eyebrow. “And unofficially?”

“Unofficially,” she responded, “I think we just woke something up that had been asleep for a very long time.”

U.S.: http://www.amazon.com/gp/product/B0176H22B4
U.K. http://www.amazon.co.uk/gp/product/B0176H22B4
Can: http://www.amazon.ca/gp/product/B0176H22B4
Australia: http://www.amazon.com.au/gp/product/B0176H22B4
Germany: http://www.amazon.de/gp/product/B0176H22B4
Japan: http://www.amazon.co.jp/gp/product/B0176H22B4

Dodecahedron Books

Sunday, 30 December 2018

What is something data scientists know that others don’t?

What is something data scientists know that others don’t?

The Zoo Hypothesis or The News of the World: A Science Fiction Story

The Magnetic Anomaly

No comments:

Post a Comment