Random forests A single decision tree can learn quite complex functions. However, in many ways it will be prone to overfitting—learning rules that work only for the training set. One of the ways that we can adjust for this is to limit the number of rules that it learns. For instance, we could limit the depth of the tree to just three layers. Such a tree will learn the best rules for splitting the dataset at a global level, but won't learn highly specific rules that separate the dataset into highly accurate groups. This trade-off results in trees that may have a good generalization, but overall slightly poorer performance.
To compensate for this, we could create many decision trees and then ask each to predict the class value. We could take a majority vote and use that answer as our overall prediction. There are two problems with the aforementioned procedure.
The first problem is that building decision trees is largely deterministic—using the same input will result in the same output each time. We only have one training dataset, which means our input will be the same if we try build multiple trees. We can address this by choosing a random subsample of our dataset, effectively creating new training sets. The second problem is that the features that are used for the first few decision nodes in our tree will be quite good. Even if we choose random subsamples of our training data, it is still quite possible that the decision trees built will be largely the same.
To compensate for this, we also choose a random subset of the features to perform our data splits on. Then, we have randomly built trees using randomly chosen samples, using randomly chosen features. This is a Random Forest and, perhaps unintuitively, this algorithm is very effective on many datasets. An introduction to online learning Online learning is the incremental updating of a model as new data arrives.
Algorithms that support online learning can be trained on one or a few samples at a time, and updated as new samples arrive. In contrast, algorithms that are not online require access to all of the data at once. The standard k-means algorithm is like this, as are most of the algorithms we have seen so far in this module. Online versions of algorithms have a means to partially update their model with only a few samples. Neural networks are a standard example of an algorithm that works in an online fashion. As a new sample is given to the neural network, the weights in the network are updated according to a learning rate, which is often a very small value such as 0.01.
This means that any single instance only makes a small change to the model. Neural networks can also be trained in batch mode, where a group of samples are given at once and the training is done in one step. Algorithms are faster in batch mode, but use more memory. In this same vein, we can slightly update the k-means centroids after a single or small batch of samples. To do this, we apply a learning rate to the centroid movement in the updating step of the k-means algorithm.
Assuming that samples are randomly chosen from the population, the centroids should tend to move towards the positions they would have in the standard, offline, and k-means algorithm. Online learning is related to streaming-based learning; however, there are some important differences. Grouping news articles The aim of this chapter is to discover trends in news articles by clustering, or grouping, them together. To do that, we will use the k-means algorithm, a classic machine-learning algorithm originally developed in 1957. Clustering is an unsupervised learning technique and we use clustering algorithms for exploring data.
Our dataset contains approximately 500 stories, and it would be quite arduous to examine each of those stories individually. Even if we used summary statistics, that is still a lot of data. Using clustering allows us to group similar stories together, and we can explore the themes in each cluster independently.
We use clustering techniques when we don't have a clear set of target classes for our data. In that sense, clustering algorithms have little direction in their learning. They learn according to some function, regardless of the underlying meaning of the data. For this reason, it is critical to choose good features.
In supervised learning, if you choose poor features, the learning algorithm can choose to not use those features. For instance, support vector machines will give little weight to features that aren't useful in classification. However, with clustering, all features are used in the final result— even if those features don't provide us with the answer we were looking for. When performing cluster analysis on real-world data, it is always a good idea to have a sense of what sorts of features will work for your scenario.
In this chapter, we will use the bag-of-words model. We are looking for topic-based groups, so we will use topic-based features to model the documents. We know those features work because of the work others have done in supervised versions of our problem.
In contrast, if we were to perform an authorship-based clustering, we would use features such as those found in the Chapter 9, Authorship Attribution experiment. Creating your own transformer As the complexity and type of dataset changes, you might find that you can't find an existing feature extraction transformer that fits your needs. We will see an example of this in Chapter 7, Discovering Accounts to Follow Using Graph Mining, where we create new features from graphs. It takes data of one form as input and returns data of another form as output. Transformers can be trained using some training dataset, and these trained parameters can be used to convert testing data.
It takes data of a specific format as input and returns data of another format as output. Note The curse of dimensionality It is important to mention that KNN is very susceptible to overfitting due to the curse of dimensionality. The curse of dimensionality describes the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space to give a good estimate. We have discussed the concept of regularization in the section about logistic regression as one way to avoid overfitting.
This will be discussed in more detail in the next chapter. Kernels When the data cannot be separated linearly, the trick is to embed it onto a higher dimensional space. What this means, with a lot of hand-waving about the details, is to add pseudo-features until the data is linearly separable . The trick is that we often compute the inner-produce of the samples when finding the best line to separate the dataset. Given a function that uses the dot product, we effectively manufacture new features without having to actually define those new features.
This is handy because we don't know what those features were going to be anyway. We can now compute what that dot product is and then just use that. The linear kernel is the most straightforward and is simply the dot product of the two sample feature vectors, the weight feature, and a bias value. There is also a polynomial kernel, which raises the dot product to a given degree . Others include the Gaussian and Sigmoidal functions. In our previous code sample, we tested between the linear kernel and the rbf kernels.
The end result from all this derivation is that these kernels effectively define a distance between two samples that is used in the classification of new samples in SVMs. In theory, any distance could be used, although it may not share the same characteristics that enable easy optimization of the SVM training. In scikit-learn's implementation of SVMs, we can define the kernel parameter to change which kernel function is used in computations, as we saw in the previous code sample. The user can select either Yes or No depending on whether they think the choice made by the machine-learning algorithm matches the choice they would have made. This is not scientifically accurate (it's ripe for observation bias), but it's good enough for playing around. Using my eyes, it succeeded about 84 percent of the time, which is better than my grade 12 average.
Not bad for our first ever machine learning experience, eh? You might be wondering, "what does this have to do with object-oriented programming? There isn't even one class in this code!". In some ways, you'd be right; neither coroutines nor generators are commonly considered object-oriented. However, the functions that create them return objects; in fact, you could think of those functions as constructors. The constructed object has appropriate send() and __next__() methods.
Basically, the coroutine/generator syntax is a syntax shortcut for a particular kind of object that would be quite verbose to create without it. This case study has been an exercise in bottom-up design. We created various lowlevel objects that did specific tasks and hooked them all together at the end.
I find this to be a common practice when developing with coroutines. The alternative, topdown design sometimes results in more monolithic pieces of code instead of unique individual pieces. In general, we want to find a happy medium between methods that are too large and methods that are too small and it's hard to see how they fit together.
This is true, of course, regardless of whether the iterator protocol is being used as we did here. Let's assume that we fit our model on a training dataset and use the same data to estimate how well it performs in practice. To find an acceptable bias-variance trade-off, we need to evaluate our model carefully. In this section, you will learn about an extremely handy tool, the Pipeline class in scikit-learn. It allows us to fit a model including an arbitrary number of transformation steps and apply it to make predictions about new data.
As a simple algorithm; if the height is more than x, the person is tall, otherwise they are short. Our training algorithm will then look at the data and decide on a good value for x. For the preceding dataset, a reasonable value would be 170 cm.
Anyone taller than 170 cm is considered tall by the algorithm. In the preceding dataset, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This engineering feature is an important problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset.
Ultimately, this step often requires some expert domain knowledge or at least some trial and error. It is called lazy not because of its apparent simplicity, but because it doesn't learn a discriminative function from the training data but memorizes the training dataset instead. This objective function is often a cost function that we want to minimize. In the case of Adaline, we can define the cost function to learn the weights as the Sum of Squared Errors between the calculated outcomes and the true class labels . Printing c.most_common gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place.
The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this chapter.
Another popular method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus. We will use tf-idf in Chapter 10, Clustering News Articles. There are a number of libraries for working with text data in Python.
We will use a major one, called Natural Language ToolKit . The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it . However the NLTK version has more options for word tokenization.
If you are doing natural language processing in python, NLTK is a great library to use. Our media player can play this object just as easily as one that extends AudioFile. Polymorphism is one of the most important reasons to use inheritance in many object-oriented contexts. Because any objects that supply the correct interface can be used interchangeably in Python, it reduces the need for polymorphic common superclasses. Inheritance can still be useful for sharing code but, if all that is being shared is the public interface, duck typing is all that is required. Of course, just because an object satisfies a particular interface does not mean it will simply work in all situations.
It has to fulfill that interface in a way that makes sense in the overall system. Just because an object provides a play() method does not mean it will automatically work with a media player. For example, our chess AI object from Chapter 1, Objectoriented Design, may have a play() method that moves a chess piece. Even though it satisfies the interface, this class would likely break in spectacular ways if we tried to plug it into a media player! Another useful feature of duck typing is that the duck-typed object only needs to provide those methods and attributes that are actually being accessed.
More succinctly, duck typing doesn't need to provide the entire interface of an object that is available, it only needs to fulfill the interface that is actually accessed. Compressing Data via Dimensionality Reduction—are the size n of the bootstrap sample and the number of features d that is randomly chosen for each split (step 2.1), respectively. Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance. For the number of features d at each split, we want to choose a value that is smaller than the total number of features in the training set.
A reasonable default that is used in scikit-learn and other implementations is training set. Character n-grams We saw how function words can be used as features to predict the author of a document. An n-gram is a sequence of n objects, where n is a value . Word n-grams have been used in many studies, usually relating to the topic of the documents. However, character n-grams have proven to be of high quality for authorship attribution. Character n-grams are found in text documents by representing the document as a sequence of characters.