With companies requiring more than 50,000 Data Science professionals, and a growing number of professionals and students going for Data Science Certificate Course, it becomes important for all to know the Top Data Science Interview Questions and Answers to be able to prepare well for the last mile, that is the Data Science job interview. In this article we bring to you the list of Top Data Science Interview Questions and Answers to help you get your dream job in Data Science.
Top Data Science Interview Questions
What is Data Science?
Data Science is a study which deals with identification, representation and extraction of meaningful information from data sources to be used for business purposes.
With enormous amount of facts generating each minute, the requirement to extract the useful insights is a must for the businesses to stand out from the crowd. Data engineers setup the database and data storage in order to facilitatethe process of data mining, data munging and other processes. Every other organization is running behind profits, but the companies that formulate efficient strategies based on fresh and useful insights always win the game in the long-run.
Difference between Supervised Learning and Unsupervised Learning?
|Supervised Learning||Unsupervised Learning|
|1. Input data is labeled.||1. Input data is unlabeled.|
|2. Uses training dataset.||2. Uses the input data set.|
|3. Used for prediction.||3. Used for analysis.|
|4. Enables classification and regression.||4. Enables Classification, Density Estimation, & Dimension Reduction|
Difference between Data Science and Data Analysis?
The Data Scientist and Data Analyst are different in the sense that Data Scientist starts by asking the right questions, Data Analyst starts by mining the data. The Data Scientist needs substantive expertise and non-technical skills whereas a Data Analyst does not need these skills.
|Criteria||Data Scientist||Data Analyst|
|Fundamental goal||Asking the right business questions & finding solutions||Analyzing and mining business data|
|Various tasks||Data cleansing, preparation, analysis to gain insights||Data querying, aggregation to find patterns|
|Substantive expertise||Needed||Not necessary|
|Non-technical skills||Needed||Not needed|
Data Science is a multidisciplinary science and having a Data Science career means you need to master a lot of domains like data inference, working with algorithms, deploying statistics, deductive reasoning, computer programming, substantive expertise among other skills. Data Science applications can straddle across multiple industry domains.
The job of a Data Scientist is to dig into the granular level in order to understand complex behaviors, trends, inferences, analytical creativity, time series analysis, segmentation analysis, inferential model, quantitative reasoning, and more.
Python or R – Which one would you prefer for text analytics?
We will prefer Python because of the following reasons:
- Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
- R is more suitable for machine learning than just text analysis.
- Python performs faster for all types of text analytics.
How does data cleaning plays a vital role in analysis?
Data cleaning can help in analysis because:
- Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
- Data Cleaning helps to increase the accuracy of the model in machine learning.
- It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
- It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.
What are the steps in making a decision tree?
- Take the entire data set as input.
- Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
- Apply the split to the input data (divide step).
- Re-apply steps 1 to 2 to the divided data.
- Stop when you meet some stopping criteria.
- This step is called pruning. Clean up the tree if you went too far doing splits
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from reoccurring.
Difference between univariate, bivariate and multivariate analysis?
Univariate analyses are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis.
Bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and a spending can be considered as an example of bivariate analysis.
Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.
What is Cluster Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
What is Systematic Sampling?
Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
What are Eigenvectors and Eigenvalues?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
What is Collaborative Filtering?
The process of filtering used by most recommender systems to find patterns and information by collaborating perspectives, numerous data sources, and several agents.
What are the drawbacks of the linear model?
Some drawbacks of the linear model are:
- The assumption of linearity of the errors.
- It can’t be used for count outcomes or binary outcomes
- There are overfitting problems that it can’t solve
Examples where a false positive is important than a false negative?
Let us first understand what false positives and false negatives are. False positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error. False negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase.
Examples where a false negative important than a false positive?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model. What will happen if a true threat customer is being flagged as non-threat by airport model?
Example 2: What if Jury or judge decide to make a criminal go free?
Example 3: What if you rejected to marry a very good person based on your predictive model and you happen to meet him/her after few years and realize that you had a false negative?
Examples where both false positive and false negatives are equally important?
In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
What are confounding variables?
These are extraneous variables in a statistical model that correlate directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.
What is the Law of Large Numbers?
It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate.
What is star schema?
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
Difference between a Validation Set and a Test Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
What is Cross Validation?
Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
What is Selection Bias?
Selection Bias is the bias introduced by the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect. It is the distortion of a statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
What are the different types of Selection Bias?
Different types of Selection Bias are as follows:
- Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
- Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
- Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
- Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click-through rate for a banner ad.
What are the differences between overfitting and underfitting?
In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.
How do you work towards a random forest?
The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are
- Build several decision trees on bootstrapped training samples of data
- On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors
- Rule of thumb: At each split m=p√m=p
- Predictions: At the majority rule