data science is one of the hottest jobs

of the 21st century with an average salary of 123 thousand dollars per year

let’s understand why according to LinkedIn the data science job profile is

among the top 10 jobs in the United States according to McKinsey the United

States alone faces a job shortage of 1.5 million data scientists similarly in

India according to economic times the job postings for Data science profile

have grown over 400 times in the past 1 year so keeping this in mind we have

come up with an end to end session we’ll learn all the major concepts of data

science so before we start off with this session do subscribe to our channel and

like and share our video so that we can create more such informative content so

today we’ll start off by understanding what exactly is data science and then

we’ll learn how to do data manipulation with the deployer package and data

visualization with the ggplot2 package following which we’ll start with the

first machine learning algorithm which is linear regression and then we’ll

learn how to do binary classification with the logistic regression algorithm

going ahead we’ll learn about decision tree and random forests which are

basically tree based classifiers so once we’re done with all of these supervised

learning algorithms we’ll learn how to do unsupervised learning with the help

of k-means clustering algorithm after that we’ll learn about user based

collaborative filtering and item based collaborative filtering which are

basically recommendation techniques following which we’ll learn how to find

frequent patterns with the help of Association rule mining and finally

we’ll go through some of the most frequently asked questions in the data

science interviews so what in the world the data science have you ever come

across huge amounts of data only to discard it because you thought it wasn’t

useful to you well my friend you couldn’t be more wrong data is always

useful to you we just need to dig through it to find meaningful insights

so data science is using some tools and techniques which help you to manipulate

a wrangle the data so that you can find something new and meaningful now

what if I gave you a really long spread sheet contain all the sales figures for

the past three years it would be difficult for you to comprehend the data

wouldn’t it so instead of the spreadsheet what if I gave you some

charts and graphs related to annual sales you would obviously prefer the

graph so the spreadsheet right this again is Data science my friend

visualizing the data helps you get a better perspective and understand it

easily now every industry has its own problems and the top management needs to

take responsible and right decisions to stay ahead in the race so how does one

get good at decision making well that’s simple by going through the historical

data and understanding what worked and what did not and this is why data

science comes in data science brings along with it a bag of techniques which

makes decision-making simple well we are always curious about a

future aren’t we now what if I told you we can predict what happens in the

future by analyzing the current data sounds like magic doesn’t it

well I’m no magician but I can definitely use data science concepts to

perform predictive analysis so in simple terms data science is a set of tools and

techniques with which we can make the data talk to us now let’s go ahead and

look at some Data science use cases in the telecom industry customer attrition is a

major concern everyday a new player comes into the market with new and attractive

prices in such a scenario how does company retain its users so

this is where data science plays a pivotal role so the data scientist or

the data analyst go through the data to understand customer behavior they do a

thorough analysis of data usage patterns social media activity and voice call or

SMS patterns the data scientists also analyze customer demographics so that

proper segregation can be done in terms of age gender or geographic location

all of this analysis helps in providing the right offers to the right customers

and this in turn helps to retain the customers now let’s say you are at home

and eating a pizza and you suddenly get a message on your phone

stating that $10,000 to spend on a credit card to buy diamond necklace and

asking you to verify the transaction was actually done by you yeah sure and you

immediately call up your bank and tell them that this transaction wasn’t done

by you and the whole approaches now how did the bank know that was a fraudulent

transaction well Data Science again folks the bank keeps a check on your

poach is pattern and whenever there’s a deviance in that pattern it flags it off

is’nt normally and immediately notifies you all of this go to see data

science let’s look at some languages to

implement some data science concepts first in the list is R R is the most widely used

language for data science tasks R provides more than 10,000 packages for different

purposes such as data visualization data manipulation machine learning

statistical analysis and so on next in line is Python well Python and R

are actually close in competition Python also provides packages for deep learning such

Keras and Tensorflow which has been creating deep neural networks quite

easily we can also use a good old Java to perform data science tasks the

biggest reason why people use Java is speed and scalability with big data so

it’s finally time to implement all of the design concepts with R so let’s go

ahead and do that and we’ll be working with the Diamonds

dataset to implement the data science concepts so let’s head to R studio so

this is R studio guys this is how R studio looks like so we’ll start off by

data manipulation and since you want to work with the Diamonds dataset we would

need to load the ggplot2 package so to load a package in R all you have to

uses the library function so I’ll say library dot ggplot2 the package and the package has been loaded

now to do a bit of data manipulation you would also require a Deployr package so

I’ll say library of a deployr to load the deployr package as well so

you have successfully loading the ggplot2 in the Deployr packages now let’s

have a look at the Diamonds dataset so view of diamonds alright this is our data

set let’s understand this data set first you so see that this is a dataset which

comprises of 54,000 diamonds but price tells us the price of the diamond and US

dollars carat is the weight of the diamond cut as the quality of the card

color is a diamond color clarity is a measurement of how clear the diamond

does XY and Z being the length width and depth of the diamond in millimeters

right so now it’s finally time to manipulate this data and get some

insights from it now from this dataset I would want to filter out only those

diamonds where the cut is ideal so over here we saw that the cut or the quality

of the cut can be fair good very good premium or ideal now I would want only

those diamonds which have the ideal cut so this is how I can do it now I’m using the filter function from

the deployer packet so what I’m doing over here is from the diamonds dataset I

am filtering out all those diamonds where the cut is ideal and I’m storing

it in the object and leaving it as ideal Now let me have a look at this so I’ll save

you of ideal cut right so there are 20 1551 diamonds

which have the ideal cut right now out of all of these diamonds I would want to

sell it those diamonds whose price is greater than 15,000 US dollars so from this entire dataset I want to

filter out those diamonds whose price is greater than 15,000 and I’m storing it

in high-priced object well let me have a look at this

so I’ll save you of high price and this is the new dataset where I see that

there are 531 diamonds whose price is greater than 15,000 US dollars now

similarly I would also want to select those diamonds whose prize in between

10,000 to 15,000 and less than 10,000 so I’ll do the same process then

so from the ideal card dataset I am filtering out those diamonds whose price

is greater than 10,000 and less than 15,000 and I’m storing the result in

medium price object let me have a look at this so I’ll save you of medium price right so this is the list of all those

diamonds whose price is in the range of 10,000 to 15,000 now similarly I will

select those diamonds whose price is less than 10,000 u.s. dollars view of the price so there are 19,000 781 diamonds which

have price less than ten thousand US dollars so this was a bit of data

manipulation where we just filtered out all those diamonds

whoes cut was ideal and then we segregated them on the basis of their

price now what if I wanted to only sell in some specific columns from this

entire dataset now I would want just these three columns from this data set

so what do I do I use the Select function from the deplyr

package so here what I am doing this from the Diamonds data set I am

selecting only the XY and Z columns and then storing it in diamond dimension

so view of diamond dimensions and I see that from the entire diamonds dataset I

am selecting only the x y&z columns right now what I’ll do is from the

newly-created dataset I’ll filter out those entries where the length is

greater than five and the width is greater than 50 so what I see is there is just one entry

or there is just one diamond whose length is greater than eight and whose

width is greater than 58 right now time to head on to the next data

manipulation technique now what we are doing is we are

filtering out the diamonds on the basis of their color so here we see that if

the color is represented as D then it is the best so from the Dimond data set

I am filtering out all those diamonds whose color is equivalent to D and I’ll

store it in time in best color let me have a glance at this

so I’ll save view of diamond best colored right so there are six thousand seven

hundred and seventy five diamonds which have the best dimond color now all of

these diamonds now I want those diamonds whose clarity is I have that is

best clarity so from the diamond best color dataset I’ll filter out those

diamonds whose clarity is equal to D and I’ll store it in time in best

clarity now let me have a glance at this so I’ll save you

Dimond best clarity and this is the new dataset where the color is the best and

also the clarity is the best now all of this new dataset I would want to find

the average price of the diamond so I’ll use a summarize function so now what I’m

doing is from this new dataset I am finding out the mean of the price so the

mean price of the diamonds from this newly created data set is 8307 to this data manipulation now it’s

time to head on to our next task where we’ll do a bit of visualization with the

help of the ggplot2 package at the start of the practical we had already loaded

the ggplot2 package and hence we do not need to load it again so over here what

I’m doing is a build of univariate distribution with the help of a bar plot

so ggplot is the function where I am giving the dataset over here the data

set as diamonds and I would want to plot a bar plot with respect to the cut of

the Diamonds this is the bar plot with respect to cut

so this function basically takes two arguments over your force is a dataset

next is the aesthetics onto which we are going to map the columns of their

diamond dataset and in the plotting function we are

directly going to give the color which is light Salmon now what if I wanted to give separate

colors to each bar how would I do that that is actually quite simple so what

we’ll do is instead of assigning the color directly I’ll assign

the variable column cut to this fill attribute inside the aesthatic

attribute so now this gives me a different color for each bar of this

graph what basically we can conclude from this graph is that most of the

diamonds are of the ideal cut and the least amount of diamonds of the fair cut

now we’ll do a bit of bivariate analysis with the help of scatter plot so again we

are taking the Diamonds dataset and we are mapping the price column onto the

y-axis and we are mapping the carat column onto the x-axis and since you

want a scatter plot we will use the term point function in addition to GG plot so this is the scatterplot so we see

that as the carat size increases the price of the diamond also increases now

let me make this graph more aesthetically pleasing so what I’ll do

is I’ll also add a color to this code and to add a color I’ll just use the

call attribute inside the geom point function and assign the color pale green

4 to it so we have a result over here we have

successfully added the color to this plot now let’s do another bivariate

distribution where we’ll try to find out the distribution of price with respect

to the length of the diamond so over here the dataset uses diamonds and we are mapping

the price column onto the y axis and we are mapping the length of the time

and onto the x-axis again the geometry used is gome point because we want to

scatter plot now we are assigning the color medium orgin 4 to it this is a similar plot to the previous

one so again what we can conclude is as the length of the diamond increases the

price of the diamond would also increase so linear regression is a predictive

modeling technique which is used whenever there is a linear relationship

between the independent and dependent variables and it is used in estimating

exactly how much of Y will change when X changes by certain amount like over here

we have a flower sepal length mapped onto the x axis and petal length mapped

onto the y axis and we are trying to understand how does petal length change

with respect to sepal length with the help of linear regression so let’s have

a better understanding of linear regression with this example of U so

let’s say there’s a telecom network called as neo and the delivery manager

of the company wants to find out if there’s a relationship between the

monthly charges of the customer and the tenure of the customer so he collects

all of the customer data and implements the linear regression algorithm by

taking monthly charges as the dependent variable and tenure as the independent

variable and after implementing the algorithm what we understand us there is

a linear relationship between the monthly charges and the tenure of the

customer so was the tenure of the customer increases his monthly charges

would also increase now the best fit line helps the delivery manager to find

out interesting insights from the data with the second predict the values of Y

for every new value of X so let’s say the tenure of the customer is 45 months

then with the help of the best fit line he can predict that his monthly charges

would be somewhere around $64 similarly if the customer’s tenure is 69 months

then his monthly charges would be around 110 dollars so this is how linear

regression works now that you’ve understood what exactly is linear

regression let’s go ahead and understand how can we find the best fit line so

this time we are trying to fit a linear line between the age of an employee and

a salary so the line could either be this this or this so how do we know

which are these is the best there could be infinite possibilities

right so this is where you need to have a look at the residual values so this

red line which you see over here this denotes the residual value which is

nothing but the difference between the actual values and the predicted values

now to find out the best fit line we have something known as residual sum of

squares so when residual sum of squares we take the square of all the residuals

and then we sum them up and this gives us the value of residual sum of squares

and whichever line has the lowest value of residual sum of squares it would be

considered as the best fit line so now we’ll learn how the coefficient of x

influences the relationship between independent variable and the dependent

variable so if it is simple in a regression and value of coefficient of x

is greater than 0 then the relationship between independent and dependent

variables would be positive that as as the value of x increases the value of Y

would also increase and if the coefficient of x is lower than 0 then

the relationship between independent and response variables would be negative

that as as the value of x increases the value of Y would decrease right so when

multiple linear regression we have more than one independent variable and we try

to determine how do all of this independent variables together affect

the dependent variable like over here we have a mapping between Y x1 x2 and x3

where Y is the dependent variable and x1 x2 and x3 are the independent variables

so let’s take this example to have a better understanding of multiple linear

regression so over here we are trying to understand what factors affect the

salary of an employee here salary is the dependent variable and gender age and

department are the independent variables so the linear regression model helps us

to determine the salary of an employee when specific values are given to age

gender and department so let’s go to R studio and implement multiple linear

regression right we have a studio right in front of

us and this is our C mold customer churn data set all right so now since

we already know that before building a linear regression model we need to

divide our dataset into training and testing sets and to do that we would

require the CA tools package so I’ll type library of CA tools so we have

loaded the CA tookls package so now to build a linear regression model I will

take the tenure column as the dependent variable and I’ll try to

understand how do other columns affect the tenure of a customer right so I’ll

split this dataset with respect to the tenure column so I will use the sample

dot split function now let me select this column customer shown dollar tenure

right and the split ratio which I’ll be giving is 0.65 and I will store

this in an object called us split model okay so basically 65% of the

observations would get true values and the rest 35 observations would get the false values

so now that we’ve store this result in split model I will divide the data set using the

subset function right now this takes in the first parameter as the data set now

from this data set wherever the value of split model is

equal to true I will store all of those observations in the training set right

similarly from the entire customer churn data set wherever the value of split

model is false I’ll select all of those observations and I will store those

observations and a dataset call as test and thus we have a training and testing

sets ready now let me just have a look at the number of rows and training and

testing so n row of train as 4573 and the row of test as 2470 right

so we have our training and testing sets ready now it’s finally time to build a

linear regression model so I’ll use the LM function so the dependent variable is

tenure and the independent variables are monthly charges and we have gender calling which we have internet service

then the final independent variable is Contract

so we are trying to understand how does Tenure vary with respect to monthly

charges gender internet service and contract and the data would be trained

let us on top the train set we are building this linear regression model

and I will store this in an object called as mode one right so we have built

our multiple linear regression model now it’s time to predict the values on top

of this model and to do that I will use the predict function write predict so

the first parameter which it would take as the model which you just built so mod

1 next is a testing set on which we want to check the accuracy of the model okay

so I have given both of the parameters and I will smooth the result and let’s

say result 1 now I will bind the actual values and

the predicted values into a common data set

I’ll use the C bind function now let me take the original values so the original

values would be the tenure column from the test set and the predicted values

would be the value stored and result one object so I will name this column as

actual and I will name this column as predicted

right and I will score this in a new object with the name of final data answer this view of final data 1 right

so these are the actual values of the tenure and these are the predicted

values of the tenure which you are able to predict with the help of the predict

function right so thus we have predicted the values with the help of the linear

regression model now we’ll go ahead and find out the accuracy again to find out

the accuracy first we need to find out all of the residuals then we need to

find out the root mean square error so let’s go ahead and find out the error in

this right it’s not before that since this is actually a matrix I need to

convert this into a dataframe so I’ll use a stored data frame function and I

will convert this into a data frame I will store the result back to the same

set final theta 1 right so now this is actually a data frame right so now I

will subtract the predicted values from the actual values to get the error in

prediction so final data $1 actual minus final data 1

dollar protected and I will store this in an object called as error one okay so

let me have a glance at this view of error one

so this is the error in prediction guys now I’d also bind this error back to the

same data set again C bind of final data one I

will bind it with the error column and I will store it back to the final data one

object you of final dataone now let’s see what

do we get so these are the actual values of the

tenure these are the predicted values of the tenure and this is the error in

prediction all right so now we can find the root mean square error so let’s do

that final data $1.00 error 1 so first I’ll take the square of all of the

errors right now after this I need to take the

mean so let me type in mean over here following which I need to find out the

square root So I’ll type sqrd so it’s root mean

square error okay so the RMSE value which we get for the model which you

build us 16 so I will store this in our RMSE one right so we built a model and

we’ve performed an accuracy check and we found out that RMSE value with

respect to this model is 16 now what we’ll do is we’ll build another multiple

regression model with different set of independent variables and we’ll try to

determine which of the models is more accurate okay so this time the

independent variables which I will be taking would be partner phone service

total charges and payment method okay so again let me use the LM function and

again tenure would be the dependent variable and the independent variables

are partner then we have phone service or language we have portal charges

then the final column would be payment method right so we are trying to understand how

does tenure vary with respect to partner phone service total charges and

payment method and I will store this in mod2 okay right before that we would

also have to give the data set so data would be trained

so we have built the second model now we’ll also do the prediction so predit

I’ll give in the first parameter which is the model which you just build then

this prediction needs to be on the test set and I will store all of the values

and result okay but this time I’ll bind the actual

values from the test set which is test dollar tenure that is the tenure column

from test set and the predicted values which are stored in result two again I

will name this column as actual the second column I’ll name it as

predicted right

I will store this in an object called as final data 2 let me have a glance at

this view of final data 2 so guys these are the actual values of the

customers tenure and these are the predicted values of the customers tenure

okay so again we will go ahead and find out the error then we’ll find RMSE

value okay so before that we need to convert this to a date frame so as.

data frame of final data 2 and I will store it back to final data 2 also done so it’s finally time to get

our errors final data $2 actual – final data $2 predicted

and I will store this in error 2 now I’ll bind this error back the same data

set so see bind of final data – and I will bind the error back to the same

data set okay and I will store it back to the

same Data set which will be final2 so guys these are the actual values of the

customers tenure these are the predicted values of the customers tenure and this

column gives us the error in prediction again we will go ahead and find out the

RMSE value so final data 2 dollar

better do then I will find the square of this after which we’ll get the mean then after finding the mean we’d have to

take the square root SQRT and I restore this in RMSE 2 to

but just so that I can create the suspense right so first I will show you

guys the value of RMSE 1 which is 16 now there is something interesting over

here now RMSE 2 we get NA now we’ve got this NA because there are some na

values already present in the total charges column okay so I’m retreating

guys so since there are already some NA values present in the total charges

column that is why the RMSE value we’ve got as NA so to remove that we

need to use na dot R M equals true so let’s do that I’ll type any dot R M

equals true so this time the RMSE value which we get as twelve point eight zero

okay so RMSE one a sixteen and RMSC two is twelve point eight so this

basically means that the second model which we built is much much better than

the first model again because it’s a simple explanation over here RMSE two

is lower than RMSE one and that is why the second model is much better than

the first model so if we ever want to select four columns to determine the

tenure of the customer then it’ll be better to select partner phone service

total charges and payment method okay so this was a multiple linear regression

guys let’s take the scenario where we have three employees so the first

employee is Sam whose age is 20 and earns $50,000 next is Bob who is 25

years old and earn $75,000 and the third employee is Matt was 50 years old and

earns $100,000 now I’ll introduce a new employee to you whose age is 28 and ask

you what is the salary what would you do you would look at the general trend

between the age and salary and understand that as the age of the

employee increases his salary also increases well this is nothing but

regression we are trying to understand how does a person’s age affect his

salary based on the historical data so over you salary is the dependent

variable and each is the independent variable that is

you’re trying to ascertain the salary of the employee with respect to the age

let’s look at the second scenario here we have two students Rachel and Ross they

appear for an exam and Rachel manages to pass the exam while Ross fails now

what is another student let’s say Monica takes the same test which should be able

to clear the exam well you’ll again to get the data

provided to you and see that Rachel being a girl was able to pass the exam

while Ross being a guy failed to clear it and on the basis of this date are

you’d see there is a good probability for Monica to clear the exam as well so

this again is regression are you finding out if the student has cleard the exam

based on the gender and hence result is the dependent variable over here and

gender is the independent variable so in simple terms the aggression helps

you to understand the extent of relationship between two variables so

now that we’ve understood what exactly is regression it’s time to understand

logistic regression so logistically regression is a regression technique the

dependent variable is categorical that is we determine the probability of the

observation belonging to a particular category so let’s look at an example to

understand this better over here we’re trying to determine the probability of

frame based on independent variables test temperature and humidity or in

other words we are choosing a category namely yes or no for the question will

it rain so this is logistically reggression for you right now we’ll head

on and understand the difference between logistic regression and linear

regression and linear regression we fit a straight line that is for a given value

of x the definitely exists a Pie value which falls on the line for or in other

terms we can say that there is a linear relationship between the Y variable and

X variable so you can take the example of an employee salary and age so let’s

say the x axis denotes the employees age and y axis denotes the employee salary

so as the employees age increases the employee salary would also

increase linearly so this is linear regression but what happens in logistic

regression list the dependent variable is categorical

and hence the dependent variable can only have two values that is a to 0 or 1

now let’s understand this better so let’s say over here the x axis denotes

the number of runs scored by Virat Kohli and the y axis denotes whether Team

India has won the match or not let’s hit this line over here denotes a 50 runs so

what we can see from this graph first Virat Kohli scores more than 50 runs in

a match then there is create a probability for Team India to win the

match and the virat Kohli scores less than 50 runs in the match then there is

a good probability that India might lose the match let’s actually take these two

points over you let’s say this is around 50 runs or 49 runs let’s say if Virat Kohli

scores for 49 runs in the match then the probability of india

winning the match would also be around 48 percent or 50 percent now let’s take

this so let’s say this is around 60 odd runs so let’s say if Virat kohli

scores 60 runs then the probability of india winning the match would be around

70 percent or 75 percent right so in linear regression we have a straight fit

line and in logistic regression we have an S curve so this S curve gives us the

probability of the result being true or false so it’s finally time to implement

the concept of logistic regression with R so let’s go ahead and do that we’ll

be working with the empty cars dataset to implement logistic regression so

let’s head to R studio right so this is R studio looks like so let’s have a

glance at the empty car data set first so I’ll say view of empty cars right

this is a dataset now let’s understand this properly

so this is our data set which has 32 observations or in other words we have

32 cars and that these are the variables so mpg is the miles per gallon

Cyl is the number of cylinders in car dis is displacement then we have

horsepower and we have rear axle ratio the weight of the car Qsec

VS being whether the engine is v-shaped or straignt EMS for the transmission of

the car gear number of output gears and carb is number of carburators so in

this example we would try to determine whether the car has a v-shaped engine or

a straight engine based on other variables so let’s go ahead and do that

so in the first case I would want to determine whether the car has V shape

engine or straight engine with respect to the mileage of the car so let’s go

ahead and do that to build the logistic regression

function in our we’ll be using the GLM function so over here the GLM function

takes in three parameters first is the formula so over here in the formula we

have stated vs Tilda mpg that’ss vs is our dependent variable and mpg is our

independent variable so whatever variable is given on the left side of

the tilda symbol would be our dependent variable and whatever variables are

given on the right show the tilde symbol would be the independent variables so we

are trying to determine the type of engine of the car with respect to the

miles per gallon of the car right so this was our first parameter the next is

obviously the data set which we are going to give so the data set is empty

cars and the third parameter is the type of regression technique so the type

of regression technique used over here is binomial that is we are going to have

just two values either 0 or 1 and we’ll store this result in model 1 so we have

successfully built the model now let’s analyze this model and to analyze this

model I will say summary of model 1 so this was the model which we built and we

have some values over here so let’s understand this values properly so I’ll

start with these three values over here so we have null deviance residual

deviance and AIC so what is null deviance so in simple terms null

deviance gives up the accuracy of the model when the independent variable is

just the intercept or in other words we are trying to determine whether the car

has v-shaped engine or s shape engine without any variables or without any

independent variables so when we are not including any independent variables then

the deviance is 43 now it needs to be kept in mind that the null deviance or

any sort of deviance should be as less as possible

now when we include this one variable in the formula then to get the residual

Deviance so when we include mpg into the formula

then we see that the deviance has decreased from 43 to 25 and we also see

that the degrees of freedom have decreased from 31 to 32 degrees of

freedom is nothing but the number of observations minus one so the number of

observations in empty cars dataset were 32 thus the number of degrees of freedom

is 31 now when we include another variable over here the degrees of

freedom reduces by 1 again and we get it to be 30 so basically the point that I’m

trying to see over here is initially the null deviance that is when we are not

including any variable then the null deviance is 43 when we include the

variable mpg into this formula then the residual deviance decreases and we get

the residual deviance to be 25 so this is null deviance and the a sqldeviance

then next we have AIC so II I see basically stands for Acquired information

criteria now this is helpful when we are comparing 2 to 3 models with respect to

each other so in simple terms basically they lower

the value of AIC the better the model right so these were the three

values now let’s actually understand these values over here so this over here

which you see so mpg then the estimate is zero point four three zero four that

is with every one unit change in the miles per gallon value we have we see

zero point four three units of change in the logic value of vs since because it’s

logistic regression so we see that with every one unit change there is zero

point three units of change in the logic value of VS now this is the most

important value over here which is the p value so with the help of p value we try

to determine whether this variable is significant or not so these codes which

you see over here so if we just have one dot over here then that would mean that

the variable is 95% significant if we have one star then

that would mean that that is 95% significant if we have two stars then it

would mean that the variable is 99% significant and if you have three stars

then it would mean that the variable is 99.99% significant right and since we

see that mpg has two stars over here then it s significant with a confidence

interval of 99 percent right so we have built the model over here now it’s time

to check the accuracy of the model so what we’ll do is we’ll predict this on

some other data set so let’s go ahead and predict these values so now to predict the values on some of

the data set will be using the predict function this again takes three

parameters so the first parameter is the model which you build next takes in the

independent variables so over here since the independent variable which we had was

mpg so we’ll take this mpg and create a data frame out of it it will just

compress of one row the rest we are trying to determine whether the car will

have V type engine or s type engine with respect to the mpg when it is 20 and

will say type is equal to response because we need the probability of it

right so the probability of the engine being behaved 0.44 or 44% when the mpg

value is 20 right now what I’ll do is I’ll

use set of values ranging from 20 to 30 that is the miles per gallon value which

starts from 20 and then will go like 21 22 23 to 30 now let me determine what

would be the probability of the engine being v-type with respect to these

values right so what we see over here is as the mpg value increases from 20 to 30

we also see that the probability of the engine being v-type increases so the

probability is increasing from 44% to 98% and this is the maximum

over here so when the miles per gallon value is 30 then there is 98 percent

probability that the engine is of V type now instead of miles per gallon let me

do another column to determine whether the engine is of V type or Straight so now

I’ll be taking the HP column so this time the formula will be P s

tilde symbol H P that is the dependent variable is VS and the

independent variable is HP or in other terms I am trying to determine whether

the engine type is V or s with respect to the horsepower of the car again the

data used is empty cars and families binomial because I want to launch take

regression for this so I’ll store this in model two now let me compare the

summary of model one and model two and let me determine it properly so model

one and similarly summary of model two right

so let me first start off by comparing the AIC values of these two so AIC value

of model one is 29 and AIC value of model two is 20 that is second model is much

better and significantly better than model one because the AIC value reduces

by almost nine units and similarly if we look at the null deviance over here so

null deviance is same that is when we are not including any variable in the

formula but when we go ahead and include the variable so over here in the model

one we included mpg and in model two we included HP so when we included hotspot

there is a greater reduction in residual Deviance or in other terms we can see that

HP is more significant than miles per gallon right so again let us go ahead

and predict some values with respect to horsepower right now I’d want to determine whether

the engine is we type or s type when the horsepower is 150 units right so again

three parameters first is the model which we built which is model two next is

the data frame or the set of input values and we are just giving one

horsepower value over here type is response because you would need the

probability right so the probability of the engine

being V type is just 12% now this is quite low so what we’ll do is we’ll give

a set of values now and determine how does this horsepower affect the type of

engine so this time I am giving three values for horsepower the first value is

150 second value is 100 third value is 50

so what we see is as horsepower decreases there is a greater probability

for the engine be of V type so over here if horsepower is 150 then most probably

this type of car has a straight engine and the horsepower is 50 units then most

probably this type of car has a V type engine right so this is how we

determined whether the engine has to that the engine is V type or S type

with respect to the horsepower now what I’m going to do is I’m going to

include boot the horsepower and mpg into the same formula and see what happens so

this is my third model over here where the formula over here is this so again V

is the dependent variable and HP and MPG are the independent variables that is

I want to determine how would these two variables combined would affect the type

of engine used in the car Again data is empty cars and the

family is binomial because I’m building a logistic regression model and I

store this in model three right now well let me take out the summary of this

model as well so say summary of module three and let me have a look at this

summary three summary tw o

it’s that summary of model 2 and next we have summary of model 1 so let me start

off by checking the Fischer score let me start off by checking the AIC values

over here so for model AIC value is 29

GOOD EVENING SIR

I HAVE DOUBT SHOULD I USE R OR PYTHON FOR MACHINE LEARNING PLEASE REPLY SIR

THANK YOU FOR MAKING AWSOME VIDEOS

I have doubt of what is the purpose for split ratio

what is the purpose of k means clustering in the dataset

I have doubt of how to find item based collaborative in cosine similarity

To get the job in this domain, do we need to have any experience,??

Freshers have opportunities??

Subscribe this channel

100/100

Sir plz make a video on Full course of-

1-Java

2.net

3-c++

4-python

5- c language etc

can u give us the datasets which are used in this video, so we can practice by seeing the video.

This is a very detailed and informative video. I would like to request for the datasets or the links to the datasets in order to help continuing learners like myself to fasten on their skills. Thanks a lot in advance.

Great job guys👍. Pl make a vedio for sap fico.

many many thanks

you people are great

hats off to you

Hi Iam from Electrical background and interested to learn data science. I knew Python very well and what should I learn next in order to get placed

Hlw sir.im a student of Bsc chemistry hons.plz sir kindly tell me can i learn data science course?? Am i eligible for this? Plz sir tell me

Hey please provide subtitles for this video

I want to learn data science but my working experience is from non it programming background.

Is this for machine learning engineering also?

OK , so after going through what certificate can I take too verify my skill

Great work keep it up

I have a few doubts:

I currently work as a 3D artist I have knowledge on few programming languages (basic level) ,I also work with unity 3D(a lil knowledge on game programming as well),

I want to expand my knowledge on programming and A.I to a deep level however my math qualification is limited to 10th standard only,

Will it be possible to learn data science for me?

Regards

👋 Guys everyday we upload high quality in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.🙂

which RStudio you are using?

Data Science might have been useful when the humanity was static and the culture was cohesive . Increasingly humanities are advancing modifying the culture along with them. Don’t the prediction for such a dynamic environment require quantum sociology to be value?

Thanks for the video but can you please make it same with python and upload it.

Hey Bharani! I came across this recommendation on YouTube and was surprised to see you on the thumbnail….9 hrs video on data science, great effort buddy!

Play this video on 2x speed so that you can be a Data Scientist in 5 hours.

can you make a detailed video on Digital Marketing as well as different freelancing skills which if learnt via ur video can help one to freelance and earn while at home?? pls , hope to get ur reply asap pls… i want to learn various different freelancing skills so dt i can earn sitting at home sir 🙏🙏🙏🙏🙏🙏

Excellent work.. thanks for uploading. Just subscribed.. please add some videos for data science for beginners with some technology.. thanks a lot in advance

Wow 9+ hours video with 0 ads.

Respect!!!

I have got so many things to study but my brain is not keeping up ! 8 hours and i start feeling fatigue! Great video! Will watch again and again!

😂😂😂😂😂😂

Which language is best for data science?

thank you

#5:09

Hello, I could not understand the 65-35 ratio while you form the object split_model (21:40)? Please describe.

Add this video as parts.

So it can be easy to watch.

difficult to continue watching entire video

Make 1 hr parts

Badiya

M a android developer…

Can i become a data scientist. Guide me pls if posiible

You gotta be Indian to even attempt this

Hello friends, I completed BCA in 2014, after that I tried for civil services but finally didn't clear. Now I want to come back in IT. Kindly share ur valuable suggestions. Thanks…

In the description of mtcars straight engine is taken as 1 and V-shaped engine is taken as 0. At time 44:30, after checking the accuracy of model_1, we found that the probability of being a V-shaped engine is 0.4440349 (as per the narrator verse). So how do you conclude that 0.440349 is the probability of finding V-shaped, not the straight engine?

Hi ,, Could you please send me the resume format related to data science

in option trading strategy, data analysis and backtesting. how to useful. plz reply me

Wow ppl take like 12 months to become data scientists…but now we do it in 9 hours…wasted my life😁😁😁😁

Is python and data science for python are different?

I cannot understand the set.seed concept. On the basis of which, a seed value is set? Is it any random number? Please reply.

Will u provide any certificate course???

Thanks.

Can you provide a certification and projects on this it could really be helpful….

Sir hindi me bnao

Hey i am a commerce student will it be any how beneficial for me to learn this ? i am eager to learn some extra things apart from basic academics

Sir iam in class12 with science stream maths can I learn data science

Kya ise aap Hindi Me bnaye sir

Is this enough to go for an interview after msc maths

Hello! I am a 2018 pass out from Electrical Engineering… I have not joined any IT companys till now as I was preparing for government exams…

Now, I want to change my career path and want to learn Data Science related topics as it highly interested me …. So, is it advisable for someone like me with 0 experience to move ahead with this ? If yes , then what are the things I need to learn and how should I shape my career??

I am a student from commerce can i am able to do this course

I completed Bsc electronics and I don't know what to do next . Should I go for Msc electronics(Speciality in Artificial intelligence ) or MCA . However I'm not interested in those 2. I really wanted to go for civil engineering but my mom didn't allowed ne

Sir I have graduationin economics.can I choose data. Scientist as my career.plz. Reply

I am from ETL testing background.. Is this course enough to switch into data scientist role….

Sir, can u plzz tell us how to learn MS excel used for Data Science

Guys, is this right for absolute beginners?

I'm joining Manipal prolearn for data science. Kindly enlighten someone please

Sir,, am I mechanical engineer can I learn data science??

Who else think the thumbnail gut looks like spiderman tobey maguire..??

Mechanical engineer with approx 0 knowledge of any computer language how should I start

Brilliant explanation

Useful

Can you please make a video like this on full stack web development?

Nice video for students

How can we get a certificate? As a fresher it is important to get certified.

Dear sir please Tell me……if any one is unknown about all the languages and just have little Knowledge of Linux.

So can I understand this Video lecture?

What are the Pre knowledge requirements for start this video??

Non CSE background

No subtitle

How to join complete training

How to know which library to use at what situation , how to remember best libraries

bro it is useful but how can we do practice this data tables and graphs practically?

Is there any platform that we can practice these tables by our own ? And if possible please teach us Complete R Language too.

Kindly speak in hindi plzz

Then we share your videos

awesome video

hindi me sir

Woah! It took me 2 years of Masters to complete a Data Science degree and you guys are promising a course in 10 hours.

The content is however good.

Kaise samajh main ayega course help caiye pleas you help my son

Myno9743214834 mumbai

You provide certificate this tutorial

nice but .. wish it was in python. 🙁

sir I am 2018 BTech passed out and worked few months in non-technical process and I want to change my career into data science, is it the right choice for me

Hi Guy. Thanks for your efforts. Will this course cover most of the Data Science.

No top job in java only.l ,but after 5 years demands of data scientist more .

Hey.. I am taking up Bioinformatics Masters program in Ireland.. The course would commence from Sept and I wanted to get a good hands on on programming, stats n algorithms.. Since my past academics involved all of biology and zero maths n programming I have very little knowledge of the challenges set in front of me.. I want to get a good hold on data science before my college begins.. How big a milestone would you think it would be me to get data science skills?

Hy u r videos is assum but I'm BBA student should I changed my carrer in data science science is one of favourite subject.

If any data scientist course in delhi either u teach me or give me address.

I WANT TO change my carrier in scientific field plz sir help me.

As a personal request to u

G…

Y u r making tutorial data science? , if u say 123,000 dollars salary, y u havent got a job?

Good morning Sir

Too, long video

Next video

Become a data scientist in 1 hour……

Plz Hindi

For whom it is useful,can mechanical engineering student learn this

Very nice video 👍👍👍👍

You are great👏👏👏👏👏

I like your channel 🙏🙏🙏

Fantastic 🎁 🎁

Very nice video i like your channel 👍👍👍👍👍👍👍👍👍👍👍👍

I support you.

I joined you 🙏🙏🙏🙏🙏🙏🙏

I need support you 👑👑👑

Nthng is visible vry disappointing video

Thank you 😊👍

Sir Can Commerce Student do the Data Science course or not ?

Sir I am confused that I need to learn data science or python please reply me sir

0:04 When he says hottest , it feels like he is high on weed

Sir ye video Hindi Version main Banate to Bahut Helpful hota