Data Science Tutorial | Data Science Course | Data Science for Beginners | Intellipaat


data science is one of the hottest jobs
of the 21st century with an average salary of 123 thousand dollars per year
let’s understand why according to LinkedIn the data science job profile is
among the top 10 jobs in the United States according to McKinsey the United
States alone faces a job shortage of 1.5 million data scientists similarly in
India according to economic times the job postings for Data science profile
have grown over 400 times in the past 1 year so keeping this in mind we have
come up with an end to end session we’ll learn all the major concepts of data
science so before we start off with this session do subscribe to our channel and
like and share our video so that we can create more such informative content so
today we’ll start off by understanding what exactly is data science and then
we’ll learn how to do data manipulation with the deployer package and data
visualization with the ggplot2 package following which we’ll start with the
first machine learning algorithm which is linear regression and then we’ll
learn how to do binary classification with the logistic regression algorithm
going ahead we’ll learn about decision tree and random forests which are
basically tree based classifiers so once we’re done with all of these supervised
learning algorithms we’ll learn how to do unsupervised learning with the help
of k-means clustering algorithm after that we’ll learn about user based
collaborative filtering and item based collaborative filtering which are
basically recommendation techniques following which we’ll learn how to find
frequent patterns with the help of Association rule mining and finally
we’ll go through some of the most frequently asked questions in the data
science interviews so what in the world the data science have you ever come
across huge amounts of data only to discard it because you thought it wasn’t
useful to you well my friend you couldn’t be more wrong data is always
useful to you we just need to dig through it to find meaningful insights
so data science is using some tools and techniques which help you to manipulate
a wrangle the data so that you can find something new and meaningful now
what if I gave you a really long spread sheet contain all the sales figures for
the past three years it would be difficult for you to comprehend the data
wouldn’t it so instead of the spreadsheet what if I gave you some
charts and graphs related to annual sales you would obviously prefer the
graph so the spreadsheet right this again is Data science my friend
visualizing the data helps you get a better perspective and understand it
easily now every industry has its own problems and the top management needs to
take responsible and right decisions to stay ahead in the race so how does one
get good at decision making well that’s simple by going through the historical
data and understanding what worked and what did not and this is why data
science comes in data science brings along with it a bag of techniques which
makes decision-making simple well we are always curious about a
future aren’t we now what if I told you we can predict what happens in the
future by analyzing the current data sounds like magic doesn’t it
well I’m no magician but I can definitely use data science concepts to
perform predictive analysis so in simple terms data science is a set of tools and
techniques with which we can make the data talk to us now let’s go ahead and
look at some Data science use cases in the telecom industry customer attrition is a
major concern everyday a new player comes into the market with new and attractive
prices in such a scenario how does company retain its users so
this is where data science plays a pivotal role so the data scientist or
the data analyst go through the data to understand customer behavior they do a
thorough analysis of data usage patterns social media activity and voice call or
SMS patterns the data scientists also analyze customer demographics so that
proper segregation can be done in terms of age gender or geographic location
all of this analysis helps in providing the right offers to the right customers
and this in turn helps to retain the customers now let’s say you are at home
and eating a pizza and you suddenly get a message on your phone
stating that $10,000 to spend on a credit card to buy diamond necklace and
asking you to verify the transaction was actually done by you yeah sure and you
immediately call up your bank and tell them that this transaction wasn’t done
by you and the whole approaches now how did the bank know that was a fraudulent
transaction well Data Science again folks the bank keeps a check on your
poach is pattern and whenever there’s a deviance in that pattern it flags it off
is’nt normally and immediately notifies you all of this go to see data
science let’s look at some languages to
implement some data science concepts first in the list is R R is the most widely used
language for data science tasks R provides more than 10,000 packages for different
purposes such as data visualization data manipulation machine learning
statistical analysis and so on next in line is Python well Python and R
are actually close in competition Python also provides packages for deep learning such
Keras and Tensorflow which has been creating deep neural networks quite
easily we can also use a good old Java to perform data science tasks the
biggest reason why people use Java is speed and scalability with big data so
it’s finally time to implement all of the design concepts with R so let’s go
ahead and do that and we’ll be working with the Diamonds
dataset to implement the data science concepts so let’s head to R studio so
this is R studio guys this is how R studio looks like so we’ll start off by
data manipulation and since you want to work with the Diamonds dataset we would
need to load the ggplot2 package so to load a package in R all you have to
uses the library function so I’ll say library dot ggplot2 the package and the package has been loaded
now to do a bit of data manipulation you would also require a Deployr package so
I’ll say library of a deployr to load the deployr package as well so
you have successfully loading the ggplot2 in the Deployr packages now let’s
have a look at the Diamonds dataset so view of diamonds alright this is our data
set let’s understand this data set first you so see that this is a dataset which
comprises of 54,000 diamonds but price tells us the price of the diamond and US
dollars carat is the weight of the diamond cut as the quality of the card
color is a diamond color clarity is a measurement of how clear the diamond
does XY and Z being the length width and depth of the diamond in millimeters
right so now it’s finally time to manipulate this data and get some
insights from it now from this dataset I would want to filter out only those
diamonds where the cut is ideal so over here we saw that the cut or the quality
of the cut can be fair good very good premium or ideal now I would want only
those diamonds which have the ideal cut so this is how I can do it now I’m using the filter function from
the deployer packet so what I’m doing over here is from the diamonds dataset I
am filtering out all those diamonds where the cut is ideal and I’m storing
it in the object and leaving it as ideal Now let me have a look at this so I’ll save
you of ideal cut right so there are 20 1551 diamonds
which have the ideal cut right now out of all of these diamonds I would want to
sell it those diamonds whose price is greater than 15,000 US dollars so from this entire dataset I want to
filter out those diamonds whose price is greater than 15,000 and I’m storing it
in high-priced object well let me have a look at this
so I’ll save you of high price and this is the new dataset where I see that
there are 531 diamonds whose price is greater than 15,000 US dollars now
similarly I would also want to select those diamonds whose prize in between
10,000 to 15,000 and less than 10,000 so I’ll do the same process then
so from the ideal card dataset I am filtering out those diamonds whose price
is greater than 10,000 and less than 15,000 and I’m storing the result in
medium price object let me have a look at this so I’ll save you of medium price right so this is the list of all those
diamonds whose price is in the range of 10,000 to 15,000 now similarly I will
select those diamonds whose price is less than 10,000 u.s. dollars view of the price so there are 19,000 781 diamonds which
have price less than ten thousand US dollars so this was a bit of data
manipulation where we just filtered out all those diamonds
whoes cut was ideal and then we segregated them on the basis of their
price now what if I wanted to only sell in some specific columns from this
entire dataset now I would want just these three columns from this data set
so what do I do I use the Select function from the deplyr
package so here what I am doing this from the Diamonds data set I am
selecting only the XY and Z columns and then storing it in diamond dimension
so view of diamond dimensions and I see that from the entire diamonds dataset I
am selecting only the x y&z columns right now what I’ll do is from the
newly-created dataset I’ll filter out those entries where the length is
greater than five and the width is greater than 50 so what I see is there is just one entry
or there is just one diamond whose length is greater than eight and whose
width is greater than 58 right now time to head on to the next data
manipulation technique now what we are doing is we are
filtering out the diamonds on the basis of their color so here we see that if
the color is represented as D then it is the best so from the Dimond data set
I am filtering out all those diamonds whose color is equivalent to D and I’ll
store it in time in best color let me have a glance at this
so I’ll save view of diamond best colored right so there are six thousand seven
hundred and seventy five diamonds which have the best dimond color now all of
these diamonds now I want those diamonds whose clarity is I have that is
best clarity so from the diamond best color dataset I’ll filter out those
diamonds whose clarity is equal to D and I’ll store it in time in best
clarity now let me have a glance at this so I’ll save you
Dimond best clarity and this is the new dataset where the color is the best and
also the clarity is the best now all of this new dataset I would want to find
the average price of the diamond so I’ll use a summarize function so now what I’m
doing is from this new dataset I am finding out the mean of the price so the
mean price of the diamonds from this newly created data set is 8307 to this data manipulation now it’s
time to head on to our next task where we’ll do a bit of visualization with the
help of the ggplot2 package at the start of the practical we had already loaded
the ggplot2 package and hence we do not need to load it again so over here what
I’m doing is a build of univariate distribution with the help of a bar plot
so ggplot is the function where I am giving the dataset over here the data
set as diamonds and I would want to plot a bar plot with respect to the cut of
the Diamonds this is the bar plot with respect to cut
so this function basically takes two arguments over your force is a dataset
next is the aesthetics onto which we are going to map the columns of their
diamond dataset and in the plotting function we are
directly going to give the color which is light Salmon now what if I wanted to give separate
colors to each bar how would I do that that is actually quite simple so what
we’ll do is instead of assigning the color directly I’ll assign
the variable column cut to this fill attribute inside the aesthatic
attribute so now this gives me a different color for each bar of this
graph what basically we can conclude from this graph is that most of the
diamonds are of the ideal cut and the least amount of diamonds of the fair cut
now we’ll do a bit of bivariate analysis with the help of scatter plot so again we
are taking the Diamonds dataset and we are mapping the price column onto the
y-axis and we are mapping the carat column onto the x-axis and since you
want a scatter plot we will use the term point function in addition to GG plot so this is the scatterplot so we see
that as the carat size increases the price of the diamond also increases now
let me make this graph more aesthetically pleasing so what I’ll do
is I’ll also add a color to this code and to add a color I’ll just use the
call attribute inside the geom point function and assign the color pale green
4 to it so we have a result over here we have
successfully added the color to this plot now let’s do another bivariate
distribution where we’ll try to find out the distribution of price with respect
to the length of the diamond so over here the dataset uses diamonds and we are mapping
the price column onto the y axis and we are mapping the length of the time
and onto the x-axis again the geometry used is gome point because we want to
scatter plot now we are assigning the color medium orgin 4 to it this is a similar plot to the previous
one so again what we can conclude is as the length of the diamond increases the
price of the diamond would also increase so linear regression is a predictive
modeling technique which is used whenever there is a linear relationship
between the independent and dependent variables and it is used in estimating
exactly how much of Y will change when X changes by certain amount like over here
we have a flower sepal length mapped onto the x axis and petal length mapped
onto the y axis and we are trying to understand how does petal length change
with respect to sepal length with the help of linear regression so let’s have
a better understanding of linear regression with this example of U so
let’s say there’s a telecom network called as neo and the delivery manager
of the company wants to find out if there’s a relationship between the
monthly charges of the customer and the tenure of the customer so he collects
all of the customer data and implements the linear regression algorithm by
taking monthly charges as the dependent variable and tenure as the independent
variable and after implementing the algorithm what we understand us there is
a linear relationship between the monthly charges and the tenure of the
customer so was the tenure of the customer increases his monthly charges
would also increase now the best fit line helps the delivery manager to find
out interesting insights from the data with the second predict the values of Y
for every new value of X so let’s say the tenure of the customer is 45 months
then with the help of the best fit line he can predict that his monthly charges
would be somewhere around $64 similarly if the customer’s tenure is 69 months
then his monthly charges would be around 110 dollars so this is how linear
regression works now that you’ve understood what exactly is linear
regression let’s go ahead and understand how can we find the best fit line so
this time we are trying to fit a linear line between the age of an employee and
a salary so the line could either be this this or this so how do we know
which are these is the best there could be infinite possibilities
right so this is where you need to have a look at the residual values so this
red line which you see over here this denotes the residual value which is
nothing but the difference between the actual values and the predicted values
now to find out the best fit line we have something known as residual sum of
squares so when residual sum of squares we take the square of all the residuals
and then we sum them up and this gives us the value of residual sum of squares
and whichever line has the lowest value of residual sum of squares it would be
considered as the best fit line so now we’ll learn how the coefficient of x
influences the relationship between independent variable and the dependent
variable so if it is simple in a regression and value of coefficient of x
is greater than 0 then the relationship between independent and dependent
variables would be positive that as as the value of x increases the value of Y
would also increase and if the coefficient of x is lower than 0 then
the relationship between independent and response variables would be negative
that as as the value of x increases the value of Y would decrease right so when
multiple linear regression we have more than one independent variable and we try
to determine how do all of this independent variables together affect
the dependent variable like over here we have a mapping between Y x1 x2 and x3
where Y is the dependent variable and x1 x2 and x3 are the independent variables
so let’s take this example to have a better understanding of multiple linear
regression so over here we are trying to understand what factors affect the
salary of an employee here salary is the dependent variable and gender age and
department are the independent variables so the linear regression model helps us
to determine the salary of an employee when specific values are given to age
gender and department so let’s go to R studio and implement multiple linear
regression right we have a studio right in front of
us and this is our C mold customer churn data set all right so now since
we already know that before building a linear regression model we need to
divide our dataset into training and testing sets and to do that we would
require the CA tools package so I’ll type library of CA tools so we have
loaded the CA tookls package so now to build a linear regression model I will
take the tenure column as the dependent variable and I’ll try to
understand how do other columns affect the tenure of a customer right so I’ll
split this dataset with respect to the tenure column so I will use the sample
dot split function now let me select this column customer shown dollar tenure
right and the split ratio which I’ll be giving is 0.65 and I will store
this in an object called us split model okay so basically 65% of the
observations would get true values and the rest 35 observations would get the false values
so now that we’ve store this result in split model I will divide the data set using the
subset function right now this takes in the first parameter as the data set now
from this data set wherever the value of split model is
equal to true I will store all of those observations in the training set right
similarly from the entire customer churn data set wherever the value of split
model is false I’ll select all of those observations and I will store those
observations and a dataset call as test and thus we have a training and testing
sets ready now let me just have a look at the number of rows and training and
testing so n row of train as 4573 and the row of test as 2470 right
so we have our training and testing sets ready now it’s finally time to build a
linear regression model so I’ll use the LM function so the dependent variable is
tenure and the independent variables are monthly charges and we have gender calling which we have internet service
then the final independent variable is Contract
so we are trying to understand how does Tenure vary with respect to monthly
charges gender internet service and contract and the data would be trained
let us on top the train set we are building this linear regression model
and I will store this in an object called as mode one right so we have built
our multiple linear regression model now it’s time to predict the values on top
of this model and to do that I will use the predict function write predict so
the first parameter which it would take as the model which you just built so mod
1 next is a testing set on which we want to check the accuracy of the model okay
so I have given both of the parameters and I will smooth the result and let’s
say result 1 now I will bind the actual values and
the predicted values into a common data set
I’ll use the C bind function now let me take the original values so the original
values would be the tenure column from the test set and the predicted values
would be the value stored and result one object so I will name this column as
actual and I will name this column as predicted
right and I will score this in a new object with the name of final data answer this view of final data 1 right
so these are the actual values of the tenure and these are the predicted
values of the tenure which you are able to predict with the help of the predict
function right so thus we have predicted the values with the help of the linear
regression model now we’ll go ahead and find out the accuracy again to find out
the accuracy first we need to find out all of the residuals then we need to
find out the root mean square error so let’s go ahead and find out the error in
this right it’s not before that since this is actually a matrix I need to
convert this into a dataframe so I’ll use a stored data frame function and I
will convert this into a data frame I will store the result back to the same
set final theta 1 right so now this is actually a data frame right so now I
will subtract the predicted values from the actual values to get the error in
prediction so final data $1 actual minus final data 1
dollar protected and I will store this in an object called as error one okay so
let me have a glance at this view of error one
so this is the error in prediction guys now I’d also bind this error back to the
same data set again C bind of final data one I
will bind it with the error column and I will store it back to the final data one
object you of final dataone now let’s see what
do we get so these are the actual values of the
tenure these are the predicted values of the tenure and this is the error in
prediction all right so now we can find the root mean square error so let’s do
that final data $1.00 error 1 so first I’ll take the square of all of the
errors right now after this I need to take the
mean so let me type in mean over here following which I need to find out the
square root So I’ll type sqrd so it’s root mean
square error okay so the RMSE value which we get for the model which you
build us 16 so I will store this in our RMSE one right so we built a model and
we’ve performed an accuracy check and we found out that RMSE value with
respect to this model is 16 now what we’ll do is we’ll build another multiple
regression model with different set of independent variables and we’ll try to
determine which of the models is more accurate okay so this time the
independent variables which I will be taking would be partner phone service
total charges and payment method okay so again let me use the LM function and
again tenure would be the dependent variable and the independent variables
are partner then we have phone service or language we have portal charges
then the final column would be payment method right so we are trying to understand how
does tenure vary with respect to partner phone service total charges and
payment method and I will store this in mod2 okay right before that we would
also have to give the data set so data would be trained
so we have built the second model now we’ll also do the prediction so predit
I’ll give in the first parameter which is the model which you just build then
this prediction needs to be on the test set and I will store all of the values
and result okay but this time I’ll bind the actual
values from the test set which is test dollar tenure that is the tenure column
from test set and the predicted values which are stored in result two again I
will name this column as actual the second column I’ll name it as
predicted right
I will store this in an object called as final data 2 let me have a glance at
this view of final data 2 so guys these are the actual values of the
customers tenure and these are the predicted values of the customers tenure
okay so again we will go ahead and find out the error then we’ll find RMSE
value okay so before that we need to convert this to a date frame so as.
data frame of final data 2 and I will store it back to final data 2 also done so it’s finally time to get
our errors final data $2 actual – final data $2 predicted
and I will store this in error 2 now I’ll bind this error back the same data
set so see bind of final data – and I will bind the error back to the same
data set okay and I will store it back to the
same Data set which will be final2 so guys these are the actual values of the
customers tenure these are the predicted values of the customers tenure and this
column gives us the error in prediction again we will go ahead and find out the
RMSE value so final data 2 dollar
better do then I will find the square of this after which we’ll get the mean then after finding the mean we’d have to
take the square root SQRT and I restore this in RMSE 2 to
but just so that I can create the suspense right so first I will show you
guys the value of RMSE 1 which is 16 now there is something interesting over
here now RMSE 2 we get NA now we’ve got this NA because there are some na
values already present in the total charges column okay so I’m retreating
guys so since there are already some NA values present in the total charges
column that is why the RMSE value we’ve got as NA so to remove that we
need to use na dot R M equals true so let’s do that I’ll type any dot R M
equals true so this time the RMSE value which we get as twelve point eight zero
okay so RMSE one a sixteen and RMSC two is twelve point eight so this
basically means that the second model which we built is much much better than
the first model again because it’s a simple explanation over here RMSE two
is lower than RMSE one and that is why the second model is much better than
the first model so if we ever want to select four columns to determine the
tenure of the customer then it’ll be better to select partner phone service
total charges and payment method okay so this was a multiple linear regression
guys let’s take the scenario where we have three employees so the first
employee is Sam whose age is 20 and earns $50,000 next is Bob who is 25
years old and earn $75,000 and the third employee is Matt was 50 years old and
earns $100,000 now I’ll introduce a new employee to you whose age is 28 and ask
you what is the salary what would you do you would look at the general trend
between the age and salary and understand that as the age of the
employee increases his salary also increases well this is nothing but
regression we are trying to understand how does a person’s age affect his
salary based on the historical data so over you salary is the dependent
variable and each is the independent variable that is
you’re trying to ascertain the salary of the employee with respect to the age
let’s look at the second scenario here we have two students Rachel and Ross they
appear for an exam and Rachel manages to pass the exam while Ross fails now
what is another student let’s say Monica takes the same test which should be able
to clear the exam well you’ll again to get the data
provided to you and see that Rachel being a girl was able to pass the exam
while Ross being a guy failed to clear it and on the basis of this date are
you’d see there is a good probability for Monica to clear the exam as well so
this again is regression are you finding out if the student has cleard the exam
based on the gender and hence result is the dependent variable over here and
gender is the independent variable so in simple terms the aggression helps
you to understand the extent of relationship between two variables so
now that we’ve understood what exactly is regression it’s time to understand
logistic regression so logistically regression is a regression technique the
dependent variable is categorical that is we determine the probability of the
observation belonging to a particular category so let’s look at an example to
understand this better over here we’re trying to determine the probability of
frame based on independent variables test temperature and humidity or in
other words we are choosing a category namely yes or no for the question will
it rain so this is logistically reggression for you right now we’ll head
on and understand the difference between logistic regression and linear
regression and linear regression we fit a straight line that is for a given value
of x the definitely exists a Pie value which falls on the line for or in other
terms we can say that there is a linear relationship between the Y variable and
X variable so you can take the example of an employee salary and age so let’s
say the x axis denotes the employees age and y axis denotes the employee salary
so as the employees age increases the employee salary would also
increase linearly so this is linear regression but what happens in logistic
regression list the dependent variable is categorical
and hence the dependent variable can only have two values that is a to 0 or 1
now let’s understand this better so let’s say over here the x axis denotes
the number of runs scored by Virat Kohli and the y axis denotes whether Team
India has won the match or not let’s hit this line over here denotes a 50 runs so
what we can see from this graph first Virat Kohli scores more than 50 runs in
a match then there is create a probability for Team India to win the
match and the virat Kohli scores less than 50 runs in the match then there is
a good probability that India might lose the match let’s actually take these two
points over you let’s say this is around 50 runs or 49 runs let’s say if Virat Kohli
scores for 49 runs in the match then the probability of india
winning the match would also be around 48 percent or 50 percent now let’s take
this so let’s say this is around 60 odd runs so let’s say if Virat kohli
scores 60 runs then the probability of india winning the match would be around
70 percent or 75 percent right so in linear regression we have a straight fit
line and in logistic regression we have an S curve so this S curve gives us the
probability of the result being true or false so it’s finally time to implement
the concept of logistic regression with R so let’s go ahead and do that we’ll
be working with the empty cars dataset to implement logistic regression so
let’s head to R studio right so this is R studio looks like so let’s have a
glance at the empty car data set first so I’ll say view of empty cars right
this is a dataset now let’s understand this properly
so this is our data set which has 32 observations or in other words we have
32 cars and that these are the variables so mpg is the miles per gallon
Cyl is the number of cylinders in car dis is displacement then we have
horsepower and we have rear axle ratio the weight of the car Qsec
VS being whether the engine is v-shaped or straignt EMS for the transmission of
the car gear number of output gears and carb is number of carburators so in
this example we would try to determine whether the car has a v-shaped engine or
a straight engine based on other variables so let’s go ahead and do that
so in the first case I would want to determine whether the car has V shape
engine or straight engine with respect to the mileage of the car so let’s go
ahead and do that to build the logistic regression
function in our we’ll be using the GLM function so over here the GLM function
takes in three parameters first is the formula so over here in the formula we
have stated vs Tilda mpg that’ss vs is our dependent variable and mpg is our
independent variable so whatever variable is given on the left side of
the tilda symbol would be our dependent variable and whatever variables are
given on the right show the tilde symbol would be the independent variables so we
are trying to determine the type of engine of the car with respect to the
miles per gallon of the car right so this was our first parameter the next is
obviously the data set which we are going to give so the data set is empty
cars and the third parameter is the type of regression technique so the type
of regression technique used over here is binomial that is we are going to have
just two values either 0 or 1 and we’ll store this result in model 1 so we have
successfully built the model now let’s analyze this model and to analyze this
model I will say summary of model 1 so this was the model which we built and we
have some values over here so let’s understand this values properly so I’ll
start with these three values over here so we have null deviance residual
deviance and AIC so what is null deviance so in simple terms null
deviance gives up the accuracy of the model when the independent variable is
just the intercept or in other words we are trying to determine whether the car
has v-shaped engine or s shape engine without any variables or without any
independent variables so when we are not including any independent variables then
the deviance is 43 now it needs to be kept in mind that the null deviance or
any sort of deviance should be as less as possible
now when we include this one variable in the formula then to get the residual
Deviance so when we include mpg into the formula
then we see that the deviance has decreased from 43 to 25 and we also see
that the degrees of freedom have decreased from 31 to 32 degrees of
freedom is nothing but the number of observations minus one so the number of
observations in empty cars dataset were 32 thus the number of degrees of freedom
is 31 now when we include another variable over here the degrees of
freedom reduces by 1 again and we get it to be 30 so basically the point that I’m
trying to see over here is initially the null deviance that is when we are not
including any variable then the null deviance is 43 when we include the
variable mpg into this formula then the residual deviance decreases and we get
the residual deviance to be 25 so this is null deviance and the a sqldeviance
then next we have AIC so II I see basically stands for Acquired information
criteria now this is helpful when we are comparing 2 to 3 models with respect to
each other so in simple terms basically they lower
the value of AIC the better the model right so these were the three
values now let’s actually understand these values over here so this over here
which you see so mpg then the estimate is zero point four three zero four that
is with every one unit change in the miles per gallon value we have we see
zero point four three units of change in the logic value of vs since because it’s
logistic regression so we see that with every one unit change there is zero
point three units of change in the logic value of VS now this is the most
important value over here which is the p value so with the help of p value we try
to determine whether this variable is significant or not so these codes which
you see over here so if we just have one dot over here then that would mean that
the variable is 95% significant if we have one star then
that would mean that that is 95% significant if we have two stars then it
would mean that the variable is 99% significant and if you have three stars
then it would mean that the variable is 99.99% significant right and since we
see that mpg has two stars over here then it s significant with a confidence
interval of 99 percent right so we have built the model over here now it’s time
to check the accuracy of the model so what we’ll do is we’ll predict this on
some other data set so let’s go ahead and predict these values so now to predict the values on some of
the data set will be using the predict function this again takes three
parameters so the first parameter is the model which you build next takes in the
independent variables so over here since the independent variable which we had was
mpg so we’ll take this mpg and create a data frame out of it it will just
compress of one row the rest we are trying to determine whether the car will
have V type engine or s type engine with respect to the mpg when it is 20 and
will say type is equal to response because we need the probability of it
right so the probability of the engine being behaved 0.44 or 44% when the mpg
value is 20 right now what I’ll do is I’ll
use set of values ranging from 20 to 30 that is the miles per gallon value which
starts from 20 and then will go like 21 22 23 to 30 now let me determine what
would be the probability of the engine being v-type with respect to these
values right so what we see over here is as the mpg value increases from 20 to 30
we also see that the probability of the engine being v-type increases so the
probability is increasing from 44% to 98% and this is the maximum
over here so when the miles per gallon value is 30 then there is 98 percent
probability that the engine is of V type now instead of miles per gallon let me
do another column to determine whether the engine is of V type or Straight so now
I’ll be taking the HP column so this time the formula will be P s
tilde symbol H P that is the dependent variable is VS and the
independent variable is HP or in other terms I am trying to determine whether
the engine type is V or s with respect to the horsepower of the car again the
data used is empty cars and families binomial because I want to launch take
regression for this so I’ll store this in model two now let me compare the
summary of model one and model two and let me determine it properly so model
one and similarly summary of model two right
so let me first start off by comparing the AIC values of these two so AIC value
of model one is 29 and AIC value of model two is 20 that is second model is much
better and significantly better than model one because the AIC value reduces
by almost nine units and similarly if we look at the null deviance over here so
null deviance is same that is when we are not including any variable in the
formula but when we go ahead and include the variable so over here in the model
one we included mpg and in model two we included HP so when we included hotspot
there is a greater reduction in residual Deviance or in other terms we can see that
HP is more significant than miles per gallon right so again let us go ahead
and predict some values with respect to horsepower right now I’d want to determine whether
the engine is we type or s type when the horsepower is 150 units right so again
three parameters first is the model which we built which is model two next is
the data frame or the set of input values and we are just giving one
horsepower value over here type is response because you would need the
probability right so the probability of the engine
being V type is just 12% now this is quite low so what we’ll do is we’ll give
a set of values now and determine how does this horsepower affect the type of
engine so this time I am giving three values for horsepower the first value is
150 second value is 100 third value is 50
so what we see is as horsepower decreases there is a greater probability
for the engine be of V type so over here if horsepower is 150 then most probably
this type of car has a straight engine and the horsepower is 50 units then most
probably this type of car has a V type engine right so this is how we
determined whether the engine has to that the engine is V type or S type
with respect to the horsepower now what I’m going to do is I’m going to
include boot the horsepower and mpg into the same formula and see what happens so
this is my third model over here where the formula over here is this so again V
is the dependent variable and HP and MPG are the independent variables that is
I want to determine how would these two variables combined would affect the type
of engine used in the car Again data is empty cars and the
family is binomial because I’m building a logistic regression model and I
store this in model three right now well let me take out the summary of this
model as well so say summary of module three and let me have a look at this
summary three summary tw o
it’s that summary of model 2 and next we have summary of model 1 so let me start
off by checking the Fischer score let me start off by checking the AIC values
over here so for model AIC value is 29

100 Comments

  1. GOOD EVENING SIR
    I HAVE DOUBT SHOULD I USE R OR PYTHON FOR MACHINE LEARNING PLEASE REPLY SIR
    THANK YOU FOR MAKING AWSOME VIDEOS

  2. To get the job in this domain, do we need to have any experience,??
    Freshers have opportunities??

  3. This is a very detailed and informative video. I would like to request for the datasets or the links to the datasets in order to help continuing learners like myself to fasten on their skills. Thanks a lot in advance.

  4. Hi Iam from Electrical background and interested to learn data science. I knew Python very well and what should I learn next in order to get placed

  5. Hlw sir.im a student of Bsc chemistry hons.plz sir kindly tell me can i learn data science course?? Am i eligible for this? Plz sir tell me

  6. I have a few doubts:
    I currently work as a 3D artist I have knowledge on few programming languages (basic level) ,I also work with unity 3D(a lil knowledge on game programming as well),
    I want to expand my knowledge on programming and A.I to a deep level however my math qualification is limited to 10th standard only,
    Will it be possible to learn data science for me?
    Regards

  7. 👋 Guys everyday we upload high quality in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.🙂

  8. Data Science might have been useful when the humanity was static and the culture was cohesive . Increasingly humanities are advancing modifying the culture along with them. Don’t the prediction for such a dynamic environment require quantum sociology to be value?

  9. Hey Bharani! I came across this recommendation on YouTube and was surprised to see you on the thumbnail….9 hrs video on data science, great effort buddy!

  10. can you make a detailed video on Digital Marketing as well as different freelancing skills which if learnt via ur video can help one to freelance and earn while at home?? pls , hope to get ur reply asap pls… i want to learn various different freelancing skills so dt i can earn sitting at home sir 🙏🙏🙏🙏🙏🙏

  11. Excellent work.. thanks for uploading. Just subscribed.. please add some videos for data science for beginners with some technology.. thanks a lot in advance

  12. I have got so many things to study but my brain is not keeping up ! 8 hours and i start feeling fatigue! Great video! Will watch again and again!

  13. Hello friends, I completed BCA in 2014, after that I tried for civil services but finally didn't clear. Now I want to come back in IT. Kindly share ur valuable suggestions. Thanks…

  14. In the description of mtcars straight engine is taken as 1 and V-shaped engine is taken as 0. At time 44:30, after checking the accuracy of model_1, we found that the probability of being a V-shaped engine is 0.4440349 (as per the narrator verse). So how do you conclude that 0.440349 is the probability of finding V-shaped, not the straight engine?

  15. I cannot understand the set.seed concept. On the basis of which, a seed value is set? Is it any random number? Please reply.

  16. Hey i am a commerce student will it be any how beneficial for me to learn this ? i am eager to learn some extra things apart from basic academics

  17. Hello! I am a 2018 pass out from Electrical Engineering… I have not joined any IT companys till now as I was preparing for government exams…
    Now, I want to change my career path and want to learn Data Science related topics as it highly interested me …. So, is it advisable for someone like me with 0 experience to move ahead with this ? If yes , then what are the things I need to learn and how should I shape my career??

  18. I completed Bsc electronics and I don't know what to do next . Should I go for Msc electronics(Speciality in Artificial intelligence ) or MCA . However I'm not interested in those 2. I really wanted to go for civil engineering but my mom didn't allowed ne

  19. Dear sir please Tell me……if any one is unknown about all the languages and just have little Knowledge of Linux.
    So can I understand this Video lecture?

  20. Is there any platform that we can practice these tables by our own ? And if possible please teach us Complete R Language too.

  21. Woah! It took me 2 years of Masters to complete a Data Science degree and you guys are promising a course in 10 hours.

    The content is however good.

  22. sir I am 2018 BTech passed out and worked few months in non-technical process and I want to change my career into data science, is it the right choice for me

  23. Hey.. I am taking up Bioinformatics Masters program in Ireland.. The course would commence from Sept and I wanted to get a good hands on on programming, stats n algorithms.. Since my past academics involved all of biology and zero maths n programming I have very little knowledge of the challenges set in front of me.. I want to get a good hold on data science before my college begins.. How big a milestone would you think it would be me to get data science skills?

  24. Hy u r videos is assum but I'm BBA student should I changed my carrer in data science science is one of favourite subject.
    If any data scientist course in delhi either u teach me or give me address.
    I WANT TO change my carrier in scientific field plz sir help me.
    As a personal request to u

  25. Very nice video 👍👍👍👍
    You are great👏👏👏👏👏
    I like your channel 🙏🙏🙏
    Fantastic 🎁 🎁
    Very nice video i like your channel 👍👍👍👍👍👍👍👍👍👍👍👍
    I support you.
    I joined you 🙏🙏🙏🙏🙏🙏🙏
    I need support you 👑👑👑

Leave a Reply

Your email address will not be published. Required fields are marked *