top of page

# Machine Learning & Python & Data Science -140 Hours HD Video #### Development & IT

13/7/20, 8:19 am

6 week Course

Price:

\$0

##### EXPIRED

What you'll learn

• Hypothesis Space and Inductive Bias

• Evaluation and Cross-Validation

• Linear Regression

• Learning Decision Tree

• Python Exercise on Decision Tree and Linear Regression

• and Much MUch More!!

Description

Hello everyone and welcome to this  course on an introduction to machine learning

in this course we will have a quick introduction to machine learning and this will not be very deep in a mathematical sense but it will have some amount of mathematical trigger and what we will be doing in this course is covering different paradigms of machine learning and with special emphasis on classification and regression tasks and also will introduce you to various other machine learning paradigms. In this introductory lecture set of lectures I will give a very quick overview of the different kinds of machine learning paradigms and therefore I call this lectures machine learning. )

A brief introduction with emphasis on brief right, so the rest of the course would be a more elongated introduction to machine learning right.

So what is machine learning so I will start off with a canonical definition put out by Tom Mitchell in 97 and so a machine or an agent I deliberately leave the beginning undefined because you could also apply this to non machines like biological agents so an agent is said to learn from experience with respect to some class of tasks right and the performance measure P if the learners performance tasks in the class as measured by P improves with experience.

So what we get from this first thing is we have to define learning with respect to a specific class of tasks right it could be answering exams in a particular subject right or it could be diagnosing patients of a specific illness right.

So but we have to be very careful about defining the set of tasks on which we are going to define this learning right, and the second thing we need is of a performance measure P right so in the absence of a performance measure P you would start to make vague statement like oh I think something is happening right that seems to be a change and something learned is there is some learning going on and stuff like that.

So if you want to be clearer about measuring whether learning is happening or not you first need to define some kind of performance criteria right.

So for example if you talk about answering questions in an exam your performance criterion could very well be the number of marks that you get or if you talk about diagnosing illness then your performance measure would be the number of patients that you say are the number of patients who did not have adverse reaction to the drugs you gave them there could be variety of ways of defining performance measures depending on what you are looking for right and the third important component here is experience right.

So with experience the performance has to improve right and so what we mean by experience here in the case of writing exams it could be writing more exams right so the more the number of exams you write the better you write it better you get it test taking or it could be a patient's in the case of diagnosing illnesses like the more patients that you look at the better you become at diagnosing illness right.

So these are the three components so you need a class of tasks you need a performance measure and you need some well-defined experience so this kind of learning right where you are learning to improve your performance based on experience is known as a this kind of learning where you are trying to where you learn to improve your performance with experience is known as inductive learning.

And then the basis of inductive learning goes back several centuries people have been debating about inductive learning for hundreds of years now and are only more recently we have started to have more quantified mechanisms of learning right. So but one thing I always point out to people is that if you take this definition with a pinch of salt, so for example you could think about the task as fitting your foot comfortably right.

So you could talk about whether a slipper fits your foot comfortably or let me put so I always say that you should take this definition with a pinch of salt because take the example of a slipper you know, so the slipper is supposed to give protection to your foot right and a performance measure for the slipper would be whether it is fitting the leg comfortably or not or whether it is you know as people say there is biting your leg or is it

Chaffin your feet right and with experience you know as the slipper knows more and more about your foot as you keep varying the slipper for longer periods of time it becomes better at the task of fitting your foot right as measured by whether it is shattering your foot or whether it is biting your foot or not right.

So would you say that the slipper is learned to fit to your foot well by this definition yes right so we have to take this with a pinch of salt and so not every system that confirms to this definition of learning can be set to learn usually okay. (Refer Slide Time: 06:11) So going on so there are different machine learning paradigms that we will talk about and the first one is supervised learning where you learn an input to output map right so you are given some kind of an input it could be a description of the patient who comes to comes to the clinic and the output that have to produce is whether the patient has a certain disease or not so this they had to learn this kind of an input to output map or the input could be some kind of equation right and then output would be the answer to the question or it could be a true or false question I give you a description of the question you have to give me true or false as the output.

And in supervised learning what you essentially do is on a mapping from this input to the required output right if the output that you are looking for happens to be a categorical output like whether he has a disease or does not have a disease or whether the answer is true or false then the supervised learning problem is called the classification problem right and if the output happens to be a continuous value like, so how long will this product last before it fails right or what is the expected rainfall tomorrow right so those kinds of problems they would be called as regression problems. These are supervised learning problems where the output is a continuous value and these are called as regression problems. So we will look at in more detail

classification and regression as we go on right, so the second class of problems are known as unsupervised learning problems right where the goal is not really to produce an output in response to an input but given a set of in data right we have to discover patterns in the data right. So that is more of the testicle unsupervised learning there is no real desired output that we are looking for right we are more interested in finding patterns in the data. So clustering right is one task one unsupervised learning task where you are interested

in finding cohesive groups among the input pattern right, for example I might be looking at customers who come to my shop right and I want to figure out if there are categories of customers like so maybe college students could be one category and sewing IT professionals could be another category and so on so forth and when I'm looking at this kinds of grouping in my data, so I would call that a clustering task right.

So the other popular unsupervised learning paradigm is known as the Association rule mining or frequent pattern mining where you are interested in finding a frequent co-occurrence of items right in the data that is given to you so whenever A comes to my shop B also comes to my shop right. So those kinds of co-occurrence so I can always say that okay if I see A then there is likely very likely that B is also in my shop somewhere you know so I can learn these kinds of associations between data right. And again we look at this later in more detail

these are I mean there are many different variants on supervised and unsupervised learning but these are the main ones that we look at so the third form of learning which is called reinforcement learning it is neither supervised or unsupervised in nature and typically these are problems where you are learning to control the behavior of a system and I will give you more intuition intone enforcement learning now

in one of the later modules, so like I said earlier. (Refer Slide Time: 09:33) So for every task right, so you need to have some kind of a performance measure so if you are looking at classification the performance measure is going to be classification error so typically right.

So we will talk about many, many different performance measures in the duration of this course but the typical performance measure you would want to use this classification error it's how many of the items or how many of the patients did I get incorrect so how many of them who are not having the disease today predict had the disease and how many of them that had the disease that I missed right. So that would be one of the

measures that I would use and that would be the measure that we want to use but we will see later that often that is not is not possible to actually learn directly with respect to this measure. So we use other forms right and likewise for regression again so we have the prediction error suppose I say it is going to rain like 23 millimeters and then it ends up raining like 49centimeters I do not know so that is a huge prediction error right and in terms of clustering so this is little becomes a little trickier to define performance measures we don't

know what is a good clustering algorithm because we do not know what how to measure the quality of clusters.

So people come up with all different kinds of measures and so one of the more popular ones is a scatter or spread of the cluster that essentially tells you how spread out the points are that belong to a single group if you remember we are supposed to find cohesive groups, so if the group is not that cohesive it's not all of them are not together then you would say the clustering is of a poorer quality and if you have other ways of measuring things like Alec was telling you, so if you know that people are college students right and then you can figure out that how many what fraction of your cluster or college students.

So you can do this kinds of external evaluations so one measure that people use popularly there is known as purity right and in the Association rule mining we use variety of measures called support and confidence that takes a little bit of work to explain support in confidence so I will defer it and I talked about Association rules in detail and in more in the reinforcement learning tasks so if we remember I told you it is learning to control so you are going to have a cost for controlling the system and also the measure here is cost and you would

like to minimize the cost that you are going to accrue while controlling the system. So these are the basic machine learning tasks. (Refer Slide Time: 12:11) So there are several challenges when you are trying to build a build a machine learning solution right so a few of these I have listed on this slide right the first one is you have to think about how good is a model that you have learned right so I talked about a few measures on the previous slide but often those are not sufficient there are other practical considerations that come into play and

we will look at some of these towards thee there was a middle of the course somewhere right and the bulk of the time would be spent on answering the second question which is how do I choose a model right. So given some kind of data which will be the experience that we are talking about so given this experience how would I choose how would I choose a model right that somehow learns what I want to do right so how that improves itself with experience and so on so how do I choose this model and how do I actually find the parameters of the model that gives me the right answer right. So this is what we will spend much of our time on in this

course and then there are a whole bunch of other things that you really have to answer to be able to build a useful machine loose full data analytics or data mining solutions questions like do I have enough data do I have enough experience to say that my model is good right it's the data efficient quality that could be errors in the data right suppose I have medical data and a is recorded as 225, so what does that mean it could be 225 days in which case it is a reasonable number it could be 22.5 years again is a reasonable number or 22.5 months is reasonable.

But if it is 225 years it's not a reasonable number so there is something wrong in the data right so how do you handle these things or noise in images right or missing values so I will talk briefly about handling missing values later in the course but this is as I mentioned in the beginning is a machine learning course right and this is not there is not primarily it is primarily concerned about the algorithms of machine learning and the and the math and the intuition behind those and not necessarily about the questions of building a practical

systems based on this. So I will be talking about many of these issues during the course but just that I want to reiterate that will not be the focus right and so the next challenge I have listed here is how confident can I be of the results and I want that I certainly we will talk a little bit because the whole premise of reporting machine learning results depends on how confident you can be of the results right and the last question am I describing the data correctly.

So that is a very, very domain dependent and the question that you can answer only with your experience as a machine learning or a data scientist professional or with time right, so but there are typical questions that you would like to ask that are there on the slides so from the next in the next module we look at the different learning paradigms in slightly more detail.

If you remember in supervised learning we talked about experience right where you have some kind of a description of the data. So in this case let us assume that I have a customer database and I am describing that by two attributes here, age and income.

So I have each customer that comes to my shop I know the age of the customer and the income level of the customers right. (Refer Slide Time: 00:48) And my goal is to predict whether the customer will buy a computer or not buy a computer right. So I have this kind of labeled data that is given to me for building a classifier right, remember we talked about classification where the output is a discrete value in this case it is yes or no, yes this is the person will buy a computer, no the person will not buy a computer.

And the way I describe the input is through a set of attributes in this case we are looking at age and income as the attributes that describe the customer right. And so now the goal is to come up with a function right, come up with a mapping that will take the age and income as the input and it will give you an output that says the person will buy the computer or not buy the computer. So there are many different ways in which you can create this function and given that we are actually looking at a geometric interpretation of the data,

I am looking at data as points in space. (Refer Slide Time: 01:57) The one of the most natural ways of thinking about defining this function is by drawing lines or curves on the input space right.

So here is one possible example, so here I have drawn a line and everything to the left of the line right. So these are points that are red right, so everything to the left of the line would be classified as will not buy a computer, everything to the right of the line where the predominantly the data points are blue will be classified as will buy a computer. So how would the function look like, it will look like something like if the income of a person remember that the x-axis is income and the y-axis is age.

So in this case it basically says that if the income of the person is less than some value right, less than some X then the person will not buy a computer. If the income is greater than X the person will buy your computer. So that is the kind of a simple function that we will define. It will just notice that way we completely ignore one of the variables here which is the age. So we are just going by income, if the income is less than some X then the person will not buy a computer, if the income is greater than X the person will buy a computer. So is this a good rule more or less I mean we get most of the points correct right except a few right.

So it looks like yeah, we can we can survive with this rule right. So this is not too bad right, but then you can do slightly better. (Refer Slide Time: 03:29) All right, so now we got those two red points that those just keep that points are on the wrong side of the line earlier. Now seem to be on the right side right, so

everything to the left of this line will not buy a computer, everything to the right will buy a computer right, everyone moves to the right will buy a computer. So if you think about what has happened here, so we have improved our performance measure right. So the cost of something, so what is the cost here. So earlier we are only paying attention to the income right, but now we have to pay attention to the age as well right. So the older you are right, so the income threshold at which we will buy a computer is higher

right. So the younger you are, younger means lower on the y axis, so the younger you are the income threshold at which you will buy a computer is lower right. So is that clear, so the older you are right, so the income threshold is shifted to the right here right so the older you are, so you need to have a higher income before you buy a computer and the anger you are your income threshold is lower, so you do not mind buying a computer even if your income is slightly lesser right.

So now we have to start paying attention to the age right, but then the advantage is you get much better performance right can you do better than this yes okay. (Refer Slide Time: 04:54) Now almost everything is correct except that one pesky red point, but everything else is correct. And so what has happened here we get much better performance, but at the cost of having a more complex classifier right. So earlier if you thought about it in geometric terms, so first you had a line that was parallel to the y-axis therefore, I just needed to define a intercept on the x-axis right. So if X is less than some value then it was one class was greater than some value was another class.

Then the second function it was actually a slighting line like that, so I needed to define both the intercept and the slope right. And now here it is now a quadratic so I have to define three parameters right. So I have to define something like ax2+ bx+c, so I have defined the ABC the three parameters in order to find the quadratic, and I am getting better performance. So can you do better than this. (Refer Slide Time: 05:57)

Okay the sum for does not seem right correct seems to be too complex a function just to be getting this one point there right. And I am not sure I am not even sure how many parameters you need for drawing that because Microsoft use some kind of spline PowerPoint use some kind of spline interpolation to draw this curve I am pretty sure that it is lot, lot more parameters than it is worth another thing to note here is that that particular red point that you see is actually surrounded by a sea of blue right. So it is quite likely that there was some glitch there either the person actually bought a computer and we never we have not recorded it has been having what computer or there are some extremist reason the person comes into the shop sure that is going to buy a computer but then gets a phone call saying that some emergency please come out immediately and therefore he left without buying a computer right there could be variety of reasons for why that noise occurred and this will probably be the more appropriate classifier right. So these are the kinds of issues I would like to think about what is the complexity of the classifier that I would like to have right and versus the accuracy of the classifier, so how good

. So that gives me the gives me a search bias right, so putting these two these things together we are able to generalize from a few training points to the entire space of inputs right I will make this more formal as we go on and then in the next night set of modules right. (Refer Slide Time: 10:01) And so here is one way of looking at the whole process so I am going to be giving you a set of data which we will call the training set so the training set will be will consists of say as an input which we'll call as X and an output which we call as Y right, so I am going to have a set of inputs I have X1, X2, X3, X4 likewise I will have Y1, Y2, Y3, Y4 and t

his data is fed into a training this data is fed into a training algorithm right and so the data is going to look like this in our case right. So remember our X’s are the input variable success all the inputs so in this case that should have the income and the age, so x1 is like 30,000 and 25 and x2 is like 80,000 and 45 and so on so forth and the Y's or the

labels they correspond to the colors in the previous picture right so y1 does not buy a computer Y2 buys a computer and so on so forth so this essentially gives me the color coding so y1 is essentially red and y2 is blue right and I really if I am going to use something numeric this is what we will be doing later on I really cannot be using these values first of all wise or not numeric and the X is varied too much right. So the first coordinate in the X is like 30,000 and 80,000 and so on so forth and the second coordinate is like 25 and 45 so that is a lot a lot smaller in magnitude so this will lead to some kind of numerical instabilities, so what will typically end up doing is normalizing these so that they form appropriate approximately in the same range so you can see that I have try to normalize these X values between 0 and 1 right.

So have chosen an income level of say 2 lakhs it is the maximum and age of 100 and you can see the normalized values and likewise for buys and not buy I have taken not by as - 1 and by as computer is + 1these are arbitrary choices, now but later on you will see that there are specific reasons for wanting to choose this encoding in this way alright and then the training algorithm chugs over this data right and it will produce a classifier so now this classifier I do not know I do not know whether it is good or bad right so we had a straight line in the first case right an axis parallel line if we did not know the good or bad and we needed to have some mechanism by which we evaluate this right. So how do we do the evaluation typically is that you have what is

called a test set or a validation set right so this is another set of x and y paths like we had in the training set, so again in the test set we know what the labels are it is just that we are not showing it to the training algorithm we know what the labels are because we need to use the correct labels to evaluate whether your trading algorithm is doing good or bad right so, so this process by which this evaluation happens is called validation later then of the validation. If you are happy with the quality of the classifier we can keep it if you are not happy they go back to the training algorithm and say hey I am not happy with what you produced give me something different right, so we have to either iterate over the algorithm again we will go over the data again and try to refine the parameter estimation or we could even think of changing some parameter values and then trying to redo the training algorithm all over again but this is the general process and we will see that many of the different algorithms that we look, look at in the course of fitting the course of these lectures actually follow this kind of a process okay so what happens inside that green box. (Refer Slide Time: 13:48) So inside the training algorithm is that there will be this learning agent right which will take an input and it will p

roduce an output white at which it thinks is the correct output right but it will compare it against the actual target why it was given for the in the training right, so in the training you actually have a target why so it will compare it against a target why right and then figure out what the error is and use the error to change the agent right so then it can produce the right output next time around this is essentially an iterative process so you see that input okay produce an output Y ha

t and then you take the target Y. You can compare it to the Y hat figure out what is the error and use the error to change the agent again right and this is by and large the way most of the learning all algorithms will operate most of the classification algorithms or even regression algorithms will open it and we will see how each of this works as, we go on right there are many, many applications. (Refer Slide Time: 14:46) I mean this is too numerous to list here are a few examples you could look at say a fraud detection right, so we have

ict is no longer a discrete value it is not like we will buy a computer whereas not buy a computer it is more of a continuous value so here is an example, where at different times of day you have recorded the temperature so the input to the system is going to be the time of day and the output from the system is going to be the temperature that was measured at a particular point at the time right so you are going to get your experience or your training data is going to take this form so the blue points woul

d be your input and the red points would be the outputs that you are expected to predict. So note here that the outputs are continuous or real value right and so you could think of this in this toy example as points to the left being day and the points to the right being night right and just as in the previous case of classification, so we could try to do these simple as possible fit in this case which would be to draw a straight line that is as close as possible to these points now you do see that like in the classification case when it choose a simple solution there are certain points at which we are making large errors right so we could try to fix that.

And try to do something more fancy but you can see that while the daytime temperatures are more or less fine with the night times we seem to be doing something really off right because we are going off too much to thee the right-hand side all right how are you could do something more complex just like in the classification case where we wanted to get that one point right so we could try and fit all these temperatures that were given to us by looking at a sufficiently complex curve. And again this as we discussed earlier is probably not the right answer and you are probably in this case surprisingly or better off fitting the straight line rig

ht and so these kinds of solutions where we trying to fit the noise in the data we are trying to make the solution predict the noise in the training data correctly are known as over fitting over fit solut

ions and one of the things that we look to avoid in, in machine learning is to over fit to the training data. (Refer Slide Time: 19:21) So we will talk about this again and then new course right and so what we do is typically we would like to do what is called linear regression some of you might have come across this and of different circumstances and the typical aim in linear regression is to say take the error that your line is making so if you take an example point let us say I take any let us say I take an example point somewhere here righ

t. So this is the actual training data that is given to you and this is the prediction that your line is making at this point so this quantity is essentially the, the prediction error that this line is making and so what you do is you try to find that line that has the least prediction error right so you take the square of the errors that your prediction is making and then you try to minimize the, the sum of the squares of the errors why do we take the squares. (Refer Slide Time: 20:31) Because errors could be both positive or negative and we want to make sure that you are minimizing that regardless of the sign of the error okay and so with sufficient data right so a linear regression is simple enough you could just already using matrix inversions as we will see later but with many dimensions like the challenge is to avoid

over fitting like we talked about earlier and then there are many ways of avoiding this. And so I will again talk about this in detail when we look at linear regression

right so one point that I want to make is that linear regression is not as simple as it sounds right so here is an example so I have two input variables x1 and x2 right and if I try to fit a straight line with x1 and x2 I will probably end up with something like a1 x1 plus a2 x2 right and that looks like, like a plane in two dimensions right. But then if I just take these two dimensions and then transform them transform the input so instead of saying just the x1 and x2 if I say my input is going to look like x1 square x2 squared x1 x2 and then the x1 and x2 s it was in the beginning so instead of looking at a two-dimensional input if I am going to look at a 5 dimensional input right. So that wil

l and out now I am going to fit a line or a linear plane in this 5 dimensional input so that will be like a1 x1 squared plus a2 x2 square plus a3 x1 x2 plus a4 x1 plus a5 x2 now that is no longer the equation of a line in two dimensions right so that is the equation of a second-order polynomial in two dimensions but I can still think of this as doing linear regression because I am only fitting a function that is going to be linear in the input variables right so by choosing an appropriate transformation of the inputs. (Refer Slide Time: 22:38) I can fit any higher-order function so I could solve very complex problems using linear regression and so it is not really a weak method as you would think at first, first glance again we will

look at this in slightly more detail in the later lectures right and regression our prediction can be applied in a variety of places one popular places in time series prediction you could think about predicting rainfall in a certain region or how much you are going to spend on your telephone calls you could think of doing even classification using this.

If you think of you remember our encoding of plus 1 and minus 1 for the class labels so you could think of plus 1 and minus 1 as the outputs right and then you can fit a regression line regression curve to that and if the output is greater than 0 you would say this classis plus 1 its output is less than 0 you see the class is minus 1 so it could use the regression ideas to fitness will solve the classificat

ion problem and you could also do data addiction. So I really do not want to you know give you all the millions of data points that I have in my data set but what I would do is essentially fit the curve to that and then give you just the coefficients of the curve right. And more often than not that is sufficient for us to get a sense of the data and that brings us to the next application I have listed their which is trend analysis so I am not really interested in quite many times. I

am not interested in the actual values of the data but more in the, the trends so for example I have a solution that I am trying to measure the running times off and I am not really interested in the actual running time because with 37seconds to 38 seconds is not going to tell me much. But I would really like to know if the running time scales linearly or exponentially with the size of the important all right so those kinds of analysis again can be done using regression and in the last one here is again risk factor analysis like we had in classification and you can look at which are the factors that contribute most to the output so that brings us to the end of this module on supervised learning,,

Hello and welcome to this module on introduction to unsupervised learning, right. So in supervised learning we looked at how you will handle training data that had labels on it. (Refer Slide Time: 00:26) So this is this particular place this is a classification data set where red denotes one class and blue denotes the other class right. (Refer Slide Time: 00:35) And in unsupervised learning right so you basically have a lot of data that is given to you but they do not have any labels attached to them right so we look at first at the problem of clus

tering where your goal is to find groups of coherent or cohesive data points in this input space right so here is an example of possible clusters. (Refer Slide Time: 00:57) So those set of data points could form a cluster right and again now those set of data points could form a cluster and again those and those so there are like four clusters that we have identified in this in this setup so one thing to note here is that even in something like clustering so I need to have some form of a bias right so in this case the bias that I am having is in the shape of the cluster so I am assuming that the clusters are all ellipsoids right and therefore you know I have been drawing a specific shape curves for representing the clusters.

And also note that not all data points need to fall into clusters and there are a couple of points there that do not fall into any of the clusters this is primarily a artifact of me assuming that they are ellipsoids but still there are other points in the center is actually faraway from

all the other points in the in the data set to be considered as what are known as outliers so when you do clustering so there are two things so one is you are interested in finding cohesive groups of points and the second is you are also interested in finding data points that do not conform to the patterns in the input and these are known as outliers all right. (Refer Slide Time: 02:23) And that is as many mean different ways of an which you can accomplish clustering and we will look at a few in the course and the applications are numerous right so here are a few representative ones so one thing is to look at customer data right and try to discover the classes of customers you kno

w there are so earlier we looked at in the supervised learning case we looked at is that a customer will buy a computer or will not buy a computer as opposed to that we could just take all the customer data that you have and try to just group them into different kinds

of customers who come to your shop and then you could do some kind of targeted promotions and different classes of customers right. And this need not

necessarily come with labels you know I am not going to tell you that okay this customer is class 1 that customer is class 2 you are just going to find out which of the customers are more similar with each other all right. And as the second application which you have illustrated here is that I could do clustering on image pixels so that you could discover different regions in the image and then you could do some segmentation based on that different region so for example here it have a picture of a picture of a beach scene and then you are able to figure out the clouds and the sand and the sea and the tree from the image so that allows you to make more sense out of the image right.

Or you could do clustering on world usages right and you could discover synonyms and you could also do clustering on documents right and depending on which kind of documents are similar to each other and if I give you a collection of say 100,000 documents I might be able to figure out what are the different topics that are discussed in this collection of documents and many ways in which you can use clustering rule mining. (Refer Slide Time: 04:17)

And as I should give you a site about the usage of the word mining here so many of you might have heard of the term data mining and more often than not the purported data mining tasks are essentially machine learning problems right so it could be classification regression and so on so forth and the first problem that was essentially introduced as a mining problem and not as a learning problem was the one of mining frequent patterns and associations and that is one of the reasons

I call this Association rule mining as opposed to Association rule learning just to keep the historic connection intact right, so in Association rule mining we are interested in finding frequent patterns that occur in the input data and then we are looking at conditional dependencies among these patterns right.

And so for example if A and B occur together often right then I could say something like if A happens then B will happen let us suppose that so you have customers that are coming to your shop and whenever customer A visits your shop custom B also tags along with him right, so the next time you find customary