What is something data scientists know that others don’t?
I was asked this
question on Quora, and it seems like an interesting one, so here’s my best shot
at answering it.
First, a little
personal background. I have worked in
the field of data science (it used to be called data analysis and statistical
analysis) for about 35 years, for governments, universities, non-profits and
businesses. That includes some consulting
work, as well.
I got into the
profession more or less by accident – my undergraduate degree was in
geophysics, but work dried up in that field shortly after I graduated, so I further
upgraded my math/stats/computing education, and moved into those areas instead,
as the transition from mainframe computers to personal computers and servers in
the 1980s and 1990s opened up a lot of opportunities.
Before I knew it, my
career had shifted from geophysics to data analysis, programming, and
statistics. Fortunately, a degree with a
focus on physics and math prepared me nicely for that sort of work. That’s remains true, as many people coming
out of university with STEM degrees transition to data science, once they need
a job in the “real world”. In fact, I have a close personal relative with
a PhD in astrophysics who is now happily doing data science for a major
university, though not within the astrophysics department.
So, that’s something that data scientists
know that most people don’t know – a lot of people doing “data science” moved
into the field after studying different, but usually related disciplines.
What are some other
things that a data scientist knows that others don’t?
First, I will
dispense with the obvious things, such as the fact that a data scientist will naturally know many technical/academic matters
that most people don’t know, as that is the very essence of any profession or specialization.
Note that data scientist is a rather
elastic term, so any given practitioner won’t necessarily be familiar with all
of the areas noted below (that’s another thing that data scientists know, that
others don’t).
- Higher mathematics (calculus, linear algebra, optimization, etc.).
- Statistical theory and methods (probability, multivariable methods such as regression, clustering, ANOVA, etc.).
- Computer coding in any number of database, statistical, or general purpose languages (SQL, R, Python, SAS, SPSS, etc.).
- Data science algorithms and their practical implementations (artificial neural nets, decision trees, random forests, sentiment analysis, topic modelling, etc.).
- Effective visualization techniques and processes (proper graphing skills, expertise with visualization tools such as Tableau, etc.).
- The ability to interpret a business or research need, laisse with subject matter experts, gather the necessary data, apply the appropriate analytical methods (at the high end that may mean new algorithm developments), interpret the results correctly, and communicate those results in an understandable way to clients.
- This, of course, includes good writing and presentation skills.
As with any profession, there are a wide range of niches that require different skill sets and abilities. Data science is a process that goes from such (apparently) mundane tasks as extracting and cleaning data, to mid-level “what-if” reporting, to higher end inferential analysis and predictive modelling, to the really high-end PhD-level work such as researching breakthrough Artificial Intelligence algorithms.
Many people only
consider the latter steps end of the process as “data science”, but I see that
as a mistake. A lot of jobs are at the
earlier stages of the pipeline, so people considering the career of data science
(in the widest sense of the word) should be aware of that. That’s true of any field – in medicine, for
example, there are a lot move general practitioners than there are brain
surgeons.
In addition to that,
even the higher end modelling jobs require a lot of data munging or wrangling,
before the fun work can begin. A
modeller will generally have to get his or her hands dirty with data prep,
though some of that can be handed off to less specially trained people.
So, that’s another thing that data
scientists know that others don’t – there are a wide range of activities involved,
which means that there is a greater scope for involvement by different sorts of
people than is generally recognized.
Here are a few
other things that a data scientist knows that others usually don’t. Note that these are the opinions of a
mid-level practicing data scientist – someone on the cutting edge of current
research may not agree:
- Data science techniques can be very useful for predictive purposes, but improving a model gets more and more difficult as the level of prediction desired or needed increases. More data and faster processing helps, but it doesn’t necessarily scale linearly, so it can be tough to improve performance past a certain point. There’s a lot of hype, so that can be hard for people to accept (especially business people who want to leverage data science techniques to make a lot of money).
- This may hold back the “AI revolution”, especially the notion of “the singularity”. Human level intelligence is probably still a long way off, though out-performing humans at specific tasks is often possible.
- Artificial intelligence is not intelligent in the way that most people think of the term. A multilevel perceptron model may be trained to be very good at recognizing cats, just as can a four year old child can be very good at the same task. But it is hard to tell just what it is that makes the model say “cat” when it sees one, even if you understand input layers, hidden layers, output layers, back propagation, convolutions and all the rest of the jargon as well as the detailed programming knowledge needed to implement these concepts. A four year old can not only recognize a cat, but she can explain her reasons for doing so (it has four legs, it has fur, it has a button nose, it has whiskers, it has padded feet, it is cute and cuddly, etc.).
- AI tends to move in fits and starts, and some think we may be approaching another AI desert. When momentum slows (as with the fitful progress on producing fully automatic driving cars, for example), private investment money dries up and corporate research projects can whither for lack of money. And, the sense that this line of research is no longer a sure-fire way to get tenure and research money can dry up too, so university level interest can also wane.
- It is often said of AI, that easy things are hard and hard things are easy. So, observing and understanding a general environment is incredibly hard for computers, while it is fairly easy for a human child. On the other hand, winning at the highest levels of games such as Go or Chess is hard for a human (even a highly intelligent adult), but relatively easy for current-day AI systems.
- Some data science techniques are better at explaining than predicting, while others are better at predicting than explaining. For example, regression techniques will give good information about which variables are most important in explaining a relationship (e.g. beta coefficients, confidence intervals), while machine learning techniques won’t do so, or at least not very well. Conversely, you can throw a lot more data and a lot more input variables at a machine learning techniques, so that can lead to superior predictive power, but you won’t really know why the predictive model works so well (e.g. neural net weights are hard for humans to understand).
- Machine learning models don’t care about our feelings or our political attitudes. Supervised learning models will make predictions based on the data that they are fed. If those results run counter to our assumptions about the world, the models don’t care. And, if we “fix” the models by carefully selecting data to get a picture of the world that we prefer, we won’t actually get models that are useful for prediction.
- Finally, all models are wrong, but some are useful. You will probably hear that a lot, if you get into the field.
- No doubt, there is a lot more that can be said, but here is something that non-data scientists know that data scientists don’t always know – don’t make a data science blog too long!
--------------------------------------------------------------------------------------------------------
So, now that
you have read a bit about data science and AI, you can kick back a bit, and
read some science fiction by a data scientist instead.
How about a
short story about an empire of interstellar interlopers. It features one possible scenario to explain
why we haven’t met ET yet (as far as we know, anyway). Only 99 cents on Amazon.
The Zoo Hypothesis or The News of the World: A Science Fiction Story
Summary
In the field known as Astrobiology, there is a research
program called SETI, The Search for Extraterrestrial Intelligence. At the heart of SETI, there is a mystery
known as The Great Silence, or The Fermi Paradox, named after the famous
physicist Enrico Fermi. Essentially, he
asked “If they exist, where are they?”.
Some quite cogent arguments maintain that if there was extraterrestrial
intelligence, they should have visited the Earth by now. This story, a bit
tongue in cheek, gives a fictional account of one explanation for The Great
Silence, known as The Zoo Hypothesis.
Are we a protected species, in a Cosmic Zoo? If so, how did this come about? Read on, for one possible solution to The
Fermi Paradox.
The short story is about 6300 words, or about half an hour
at typical reading speeds.
Amazon U.S.: https://www.amazon.com/dp/B076RR1PGD
Amazon U.K.: https://www.amazon.co.uk/dp/B076RR1PGD
Amazon
Canada: https://www.amazon.ca/dp/B076RR1PGD
Alternatively, consider another short invasion story,
this one set in the Arctic. Also 99
cents.
The Magnetic Anomaly
Summary
An attractive woman in a blue suit handed a dossier to an
older man in a blue uniform.
“Give me a quick recap”, he said.
“A geophysical crew went into the Canadian north. There were some regrettable accidents among a few ex-military who had become geophysical contractors after their service in the forces. A young man and young woman went temporarily mad from the stress of seeing that. They imagined things, terrible things. But both are known to have vivid imaginations; we have childhood records to verify that. It was all very sad. That’s the official story.”
He raised an eyebrow. “And unofficially?”
“Unofficially,” she responded, “I think we just woke something up that had been asleep for a very long time.”
“Give me a quick recap”, he said.
“A geophysical crew went into the Canadian north. There were some regrettable accidents among a few ex-military who had become geophysical contractors after their service in the forces. A young man and young woman went temporarily mad from the stress of seeing that. They imagined things, terrible things. But both are known to have vivid imaginations; we have childhood records to verify that. It was all very sad. That’s the official story.”
He raised an eyebrow. “And unofficially?”
“Unofficially,” she responded, “I think we just woke something up that had been asleep for a very long time.”
U.S.: http://www.amazon.com/gp/product/B0176H22B4
U.K. http://www.amazon.co.uk/gp/product/B0176H22B4
Can: http://www.amazon.ca/gp/product/B0176H22B4
Australia: http://www.amazon.com.au/gp/product/B0176H22B4
Germany: http://www.amazon.de/gp/product/B0176H22B4
Japan: http://www.amazon.co.jp/gp/product/B0176H22B4
U.K. http://www.amazon.co.uk/gp/product/B0176H22B4
Can: http://www.amazon.ca/gp/product/B0176H22B4
Australia: http://www.amazon.com.au/gp/product/B0176H22B4
Germany: http://www.amazon.de/gp/product/B0176H22B4
Japan: http://www.amazon.co.jp/gp/product/B0176H22B4
No comments:
Post a Comment