Before we get into the fun part of working with data, let’s break down how data science involves more than just statistics, why it’s becoming more important, and the data science process.
Data Science vs. Statistics
In short, data science is extracting knowledge from data. But how is that different between statistics? Data science encompasses more than statistics. Statistics is good to have for exploratory data analysis, making sure insights are statistically significant, and creating predictive models. Some extra skills other than statistics are good to have to become a data scientist:
Why Data Science is Becoming More Important
This article on Forbes about big data helps sum up the importance of data scientists and why they are becoming more sought after. Here are a couple of highlights on what it will be like in the future:
By then (2020), our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes.
Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data.
And the one that is most relevant to data scientists:
At the moment less than 0.5% of all data is ever analyzed and used.
Imagine all the knowledge that can be gathered from looking at even 1% of that data! Also, imagine how much of that data can be used as training data for machine learning opportunities.
A good bit of that data, too, can be used for really positive outcomes. Can the data help us figure out who is more prone to certain diseases? Will doctors be able to get better diagnoses? Is safer travel possible?
Data Science Process
So what does a data scientist actually do? Data scientists seem to have a bit of a magical quality to them. They are perceived to get a data set, apply some magic to it, and instantly comes insights that will transform the business to higher profits. As much as that may seem like it is, there is a lot more work into the process.
To get a better idea of this process, here’s a diagram of the Cross Industry Standard Process for Data Mining.
Let’s look at each of these items in more detail to help give more definition to them.
The interesting parts of this diagram indicate that it’s best to have an understanding of the business. Without that, it would be much harder to ask the right questions and extract the most information from the data. Also, some items can have the potential to go back and be iterated on again. For example, if you’ve moved from data preparation to modeling but new data came in, you would have to go back to preparing the new data and merging it with the old data that you already had to help give you more accurate results.
Once you have a model and are evaluating it, like the arrow indicates, it’s helpful to go back to make sure that the results of the model are in line with the business. Does it help the business take action? Can it give answers to the business questions we had at the beginning? Are there any new questions that were raised?
For a more in depth look at this process with a practical example, this post from Springboard is a really good one.
Now that we looked at what data science is and its process, I think it’s time to look at a data set to see if we can answer any questions from it.
Microsoft Azure and Amazon Web Services (AWS) are two of the most popular cloud platforms.…
Cloud management is difficult to do manually, especially if you work with multiple cloud…
Azure’s scalable infrastructure is often cited as one of the primary reasons why it's the…
https://www.youtube.com/watch?v=wDzCN0d8SeA Watch our "Unlocking the Power of AI in your Software Development Life Cycle (SDLC)"…
FinOps is a strategic approach to managing cloud costs. It combines financial management best practices…
Using Kubernetes with Azure combines the power of Kubernetes container orchestration and the cloud capabilities…