What Is Data Science
You may be thinking what is Data Science or I’ve heard of it, but not sure I understand it. Well, simply put it’s gathering data and then analyzing that data. It’s a concept that uses machine learning which uses processes and algorithms to receive data from various areas. It’s in the same realm as data mining. It’s basically the precursor to data mining. First is Data Science then Data Mining.
It’s a big term though because it really does encompass so many things such as math, statistics, visualization, data engineering, scientific method, and more. It is incredibly complicated and hard to truly explain the depth of it.
Companies are constantly in need of data scientists who can do most or all of those mentioned. Their goal for data science is usually to understand customer needs and wants. Once the data is gathered it’s time to understand what it means which can also be quite difficult. Then once you have the answers, you must then use it to make adjustments, keep things the same, or get rid of stuff.
Those that work in this industry are called Data Scientists – what a shocker this is their title. Anyway, those in this industry have to be knowledgeable in computer science, software development/programming, statistics, math, business domains, and more. Although most of these scientists are experts at only one of these things.
Data Scientist type jobs are on the rise because there is a growing need for data analysis. More often than not the volume of data that comes in is too much for 1 person or even 1 computer to handle. Also, when you are dealing with information that is about users it gets even more complicated because of security and privacy. The skills and tools needed to be able to work with this data must be vast.
However, there are also many tools out there that you as a business owner can use to help you mine it. We’re going to focus on Apache Spark since it has a bit of an edge over others.
Apache Spark: It’s about to get a bit more technical so don’t worry if you don’t understand it.
This is definitely at the top of the food chain for data scientists because its focus is on both operational and investigative analytics. Operational is on the technical side because it is about the actual software and languages used where investigative is the actual answers you receive that allow you to interpret the information.
You will generally get people or even tools that are geared toward one or the other so it requires quite a bit of manpower to do both. The question is shouldn’t everyone be able to understand all these different languages to really see the full picture. That would be great in a perfect world but that’s not really the case. The people and tools that can do everything are invaluable, Apache Spark being one of them.
As mentioned this is a crossover platform that actually does both types of analytics making it a very important tool. If you are a Java programmer there will still be a little learning curve but since the library is used in Scala its easier.
Not only can Java programmers use this but so can Python, Scala and Crunch users. Spark uses pieces of everything so most users will find something familiar to use. Their data science can be implemented in the API’s that support integration, ETL, and more so it goes beyond just basic analysis.
It does have a machine-learning library and integrates with YARN, however, right now it’s pretty basic and doesn’t have a whole lot to it yet. It will grow and get better as time goes on so that is something to look forward to.
Stack Exchange and Hadoop
Spark uses stack exchange to dump all of the data in a simple way. Although Hadoop is integrated with the spark it is quite a bit complex. However, if CDH turns it around making it a lot easier because it integrates spark itself.
Anyone Can Use It
As mentioned most people can use a spark, it doesn’t matter if you only know 1 side. It’s pretty straightforward to figure out and you will see things you are used to seeing. If your company is in need of a tool to help you or your technical team, try this out. Data science is complicated enough, it’s good to have a well-rounded tool that looks at all data points and not just 1.