Simply put, data science is a multidisciplinary field that uses the scientific method to understand and extract insights out of any kind of data using computational approaches. These approaches include the application of algorithms, statistical methods, machine learning, and mathematical frameworks focused on a few basic steps:
- Cleaning the data
- Processing the data
- Analyzing the data
- Applying mathematical models to the data
- Generating predictions for the future (insights)
We are living in the “BIG DATA” era, where data containing terabytes of information associated with people, products, or companies is stored in computers and accessible over the internet (see Illustration 1). Extracting and analyzing key information from such a large amount of data is crucial for a modern company to survive in highly competitive markets which favour innovation, price, and quality of services/products. Thus, understanding the available data would ultimately help companies in the:
- Development of new innovative products/services.
- Optimization of the operational costs keeping product quality.
- Selection target customers for each product/service.
- Prediction of sales and key business variables (forecasting).
This will depend largely on your data and what objectives or questions you want to address. In general, if it’s quantitative data (e.g. age, salary, house prices) or categorical (e.g. profession, names, nationality), we can extract patterns, tendencies, and relations between variables to make predictions for the future that meet the initial question or goal.
The ultimate goal of a data science project is to make a predictor and apply it on data to generate insights. Ideally, these solutions include:
- Systematic and automated analysis of data.
- Smart and ultra-fast analysis of large amounts of data.
- Dynamic and attractive visualization of results, predictions, and reports.
- Incorporation into user-friendly software or application tools for CEOs, CTOs, VPs of Engineering, Tech Leads or other customers.
Python and R are the most popular programming languages used in the data science community. The reason is that most machine learning algorithms and analytical methods have been successfully implemented in Python and R libraries, ready to be used. In my experience, Python is more robust and is well adapted to deal with a huge amount of data, whereas R is ideal if we just want to perform a straightforward statistical analysis.
Both technologies have also been investing in advanced data visualization libraries and methods like the ones in Matplotlib of Python (see illustration 2) and ggplot of R. Moreover, these programming languages allow for creative customization of data visualization which enables us to address most problems without any setup restrictions.
Timings for data science projects largely depend on the complexity of the problem, which includes factors associated with data and availability of adequate methods to sort out the processing and analysis. Thus, it is critical for the project timeline to determine if we need to develop an original framework or to use an already existing and available one. However, most data science projects require from up to 6 to 18 months on average to at least have a prototype. This includes the development of the pipeline of analysis, training the models, validating the models, model selection for testing, and generating predictions.
YES! Nothing lasts forever, and even robust predictors tend to lose performance over time. This is mainly because models are an abstraction of the real world and the real world is changing dynamically. Thus, it is strongly recommended to keep monitoring the model performance, update, and improve pipelines of analysis towards adjusting the models to the dynamic changes of the real world.
First, define your objective/question and problem. Then you need to set up a data science team. This is not a one-man job; it’s more of a team effort. Ideally, this would take 2 or 3 people just for the model development stage. Meanwhile, don’t forget that models should also be implemented in software solutions. Therefore, software engineers and other specialists may also be needed, depending on the project objectives. Outsourcing specialized companies that already have in-house teams of specialists, such as Growin, is also a solution that can save both time on team development as well as boost the chances of success. In this scenario, the focus would be project in, predictor/solution out.
There is no particular checkbox formula, background, or tools expertise for a data scientist to be successful. It’s more of a mindset and a data-driven approach, which requires strong analytical, logical, and quantitative skills. In modern days, programming skills are essential for data science tasks. Further, a data scientist should be at first hand a scientist and apply scientific thinking.
In my opinion, a good data scientist should have an out-of-the-box and holistic approach, together with a huge capacity to solve complex and multidimensional problems. Thus, diverse backgrounds are more than welcome. Remember: “old school” data scientists and some of those who invented machine learning algorithms were not even mathematicians or “hackers,” some were just physicists or biologists like myself.
Have you had any working experience with data science so far? Are you thinking about hiring a data scientist for your team? Are you curious about what we’ve done in this area? Feel free to contact us or leave your comment below.