Some Machine Learning with Python

ByRafael Novello in

Hey folks, how has it been?

Today I want to show a little program I created to test and practice the Machine Learning techniques in categorizing texts. It’s just a prototype that uses newspaper articles, but you can download it to test and study!

Important: My goal is not to deeply address the different techniques, approaches and algorithms available, because my knowledge still does not allow me to do so. Instead, I want to show you how you can apply this knowledge and perhaps encourage you to also study the subject!

As you can see in my Github, the program has 152 lines of code, which is little but I will focus on the parts related to data processing, training and prediction which are the focus of today’s article.

The program can be divided into the following steps:

  • Download and extract the contents of the materials from a CSV with links and categories;
  • Clear content and leave only what is relevant to the model training;
  • Create the Bag of Words and train the model that will make the categorization;
  • Categorize new links with the trained model;
  • With these steps in mind, let’s get to it!

1. Download and extract the content (Goose, the finding!)

If you have already created some kind of web scraping, you know how boring and laborious it can be. For my purposes, it was necessary to extract from the HTML pages only the content of the material without polluting the result with tags or peripheral texts. After many attempts using requests, lxml and regex, I found the goose-extractor in pypi.

Gotta love Goose

In a quick look at the documentation, we realize that this project is perfect for our case and in three lines it solves the problem:

from goose import Goose
goose = Goose()
article = goose.extract(link)
text = article.cleaned_text

Just move the page link and Goose does the rest! It also provides access to other page attributes, such as the title of the article, descriptors, etc. It is worth checking!

Because it is an I/O Bound task and more than 700 links to download, I’m using the ThreadPoolExecutor class from backport futures. The use is very simple and you can learn more in the project documentation!

2. Clear the contents (NLTK)

With the text of the materials at hand, we need to remove characters and words that can disrupt the categorization algorithm, leaving only the words that can contribute to the “comprehension” of the text. In the first stage, remove nearly everything that is not text:

import unicodedata


def remove_nonlatin(string): 
    new_chars = []
    for char in string:
        if char == '\n':
            new_chars.append(' ')
            if'LATIN', 'SPACE')):
    return ''.join(new_chars)


text = remove_nonlatin(to_unicode(text))

After that, we need to remove the so-called stop words that, in short, are words that are repeated a lot in any text and can undermine the analysis made by the algorithm. This is done with the help of NLTK project.

NLTK can be installed via pip, but after installation you need to download the stopwords package via NLTK’s manager. I’ll leave it linked in the two-step instructions.

After all installed, it is easy to remove the stop words of our text:

from nltk.corpus import stopwords
stops = set(stopwords.words("portuguese"))
words = ' '.join([w for w in words if not w in stops])

At the end, we create a DataFrame pandas to gather the links, categories and processed texts:

from pandas import DataFrame
lines = []
words = pre_processor(link)
lines.append((link, categ, words))
df = DataFrame(lines)
df.columns = ['link', 'categoria', 'texto']

3. Bag of words

In short, bag of words is a textual representation model that ignores the order and grammar of words, but preserves its multiplicity. An example can help us:

The text: “Rafael is very fond of watching movies. Ana is fond of movies too” is transformed into a list of words:

"Rafael", "is", "very", "fond", "of", "watching", "movies", "Ana", "too"

After that, the model generates a list with the frequency that each word appears in the text: [1, 2, 1, 2, 2, 1, 2, 1, 1].

This process of translating words into numbers is necessary because the algorithms that we use to classify texts only accept/”understand” numbers. Luckily, the scikit-learn, one of the main libraries of the type in Python, already has a class to help in this process.

Below, I will show the training function altogether. I believe it is easier to explain as follows:

import sklearn
from sklearn.externals import joblib
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

def train(df, fit_file):
    print "\nTraining..."
    df = df.dropna()
    train_size = 0.8
    vectorizer = CountVectorizer(
    logreg = LogisticRegression()
    pipe = Pipeline([('vect', vectorizer), ('logreg', logreg)])
    X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
        df.texto, df.categoria, train_size=train_size
    ), Y_train)
    accuracy = pipe.score(X_test, Y_test)
    msg = "\nAccuracy with {:.0%} of training data: {:.1%}\n".format(train_size, accuracy)
    print msg, df.categoria)
    joblib.dump(pipe, fit_file)

Lines 11-16, we created the instance that will mount the bag of words model of the texts that we downloaded, but so far nothing has been done.

Line 17 creates the instance that will effectively analyze our data and make predictions. As the name suggests, it is used a technique called logistic regression to the texts classification. I tested a few different techniques of classification, but this showed the best results, reaching 84% of accuracy! You can check the test I did on this link here!

In line 18, we gathered two previous processes in a pipeline. This was necessary to facilitate the preservation and storage of the trained model. Let’s talk about that in a moment!

Lines 19 to 21 use a scikit-learn function to split our mass of data by 80% for training and 20% for testing the model’s accuracy.

Lines 22 to 25, we train our model, evaluated its accuracy and show this information in the terminal.

Finally, in line 26 the model is retrained, but now with 100% of the data, and in line 27 we use another scikit-learn tool to save the trained model to disk, because we do not need to redo this whole process every time the program is used.

4. Ready to use!

Finally! The program is ready to categorize new texts! Let’s see predict function!

def predict(url, fit_file):
    pipe = joblib.load(fit_file)
    words = pre_processor(url)
    resp = pipe.predict([words])
    print "\nCategory: %s \n" % resp[0]
    resp = zip(pipe.classes_, pipe.predict_proba([words])[0])
    resp.sort(key=lambda tup: tup[1], reverse=True)
    for cat, prob in resp:
        print "Category {:16s} with {:.1%} probab.".format(cat, prob)

The function receives as parameters a URL (which we categorize) and the path to the saved file on disk with our trained model.

Line 2 loads the pipeline we saved on disk as the CountVectorizer and LogisticRegression.

Line 3 uses a function to download and process the text of the URL provided. Basically, steps 1 and 2 of this article.

Line 4 uses our pipeline to create the text’s bag of words and to make a prediction of what category best represents this text.

Lines 6-9 show all the categories known by our model and their odds on the text being categorized.

Let’s see how it works!

Now that we’ve talked about the main parts of the program, let’s see it in action! To test on your computer, and download the program file and the list of links, you need to install the dependencies. All via pip install:

  • futures
  • goose-extractor
  • pandas
  • nltk
  • scikit-learn

The program uses three parameters; FILE is the path to the CSV file with the links of the materials that you can download here. The file format is very simple and you can assemble yours with other links and categories.

TRAIN is the file name with the trained model. If the file already exists, the program uses the existing model; otherwise, it will download the data and process.

First, we will download the data, process them and train our model. This is the program output showing the number of words in each subject:

After some time downloading and processing the data, the program shows us that he has saved the bag_words.csv file with the processed data and the accuracy of the model after training. We reached 79%!

Now let’s test if it can guess the category of this newspiece on technology (in Portuguese):

Conclusion (finally!)

I’d love to stay here and show you several examples of how the program guesses the category of various texts, but you must be very tired – I am! LOL

Well, I hope I have achieved my goal and at least motivated you to also study the subject. I left several links along the text and in the references below to help you understand better what we’re talking.

Please feel free to comment on what you think down here and if you know how can I improve the accuracy of the models used, it will be of great help!


Leave a comment! 0

read more