Machine Learning: Classify Gender By Name using Django and NLTK

Posted on: 19 November 2016
By: admin

Machine learning is one of the current hot topics in the IT world. The huge growth of data and the increase of computing power may be the best supporting factors which lead to the current condition. Besides the good trend, machine learning is very interesting for me personally. So, I try to find some information about it and that leads me to Python and its NLTK (Natural Language Toolkit) library. The last time I use Python is about two years ago when I learn about Django, the web development framework for Python. That gives me the idea to create a simple Django web application that can classify gender by name.

Installing Apache, Django, and NLTK

I'm developing this application on a Virtual Private Server (VPS). With many operating systems provided in the VPS setup process, I decided to install Debian Linux. Python is already installed, so I just need to install Apache, Django, and NLTK.

There are many articles on the internet about installing Apache and Django on Debian. One of them is here. You can follow the instruction until you get your Django application running.
The way to install NLTK is also available on its official website. You can read it in the following link.
After installing NLTK, you need to install the sample data too so you don't have to prepare your own data test your machine learning application. Read about that from here.

Additional Configuration

NLTK is using NumPy library which use C extension module for Python. So, you need to add the following line is your /etc/apache2/sites-enabled/000-default file. Please add it after the line contains WSGIScriptAlias.

    WSGIApplicationGroup %{GLOBAL}

Without adding that line, I got problem which caused my script never stop running when I execute it from web browser.

A Glimpse About Supervised Learning

Before you start reading the code, I want to share a little bit about Supervised Learning. It is a machine learning category where the output is already defined. The most common example is the Spam Detection method. It can classify the text as "Spam" or "Not Spam (Ham)". The "Spam" or "Ham" is the output that we already defined from the first. We usually call them label.

The other machine learning category is Unsupervised Learning where we don't have label for the data. I borrow the most common explanation about it which is the Recommendation System in the E-Commerce websites. Their "smart program" will read the profile of each users, what they see, like, and buy. Then the program groups those people who look similar to each other. This is called Clustering. Then that "smart program" will recommend the items which are bought the other people that similar to you.

If we classify the gender based on the name, what category will it fall to? Yes, it's Supervised Learning. Because we already defined the expected output which is "Male" or "Female".

The Code

Here comes the code. I assume you already has the basic of Django so I don't have to explain too detail about it.

views.py

In the following code, I use the Naive Bayes classification to classify the gender by name. The sample names used in here is taken from the data provided by NLTK. You can change them to your own data when needed.

Two most interesting part of this machine learning method are the training part, where the function nltk.NaiveBayesClassifier.train is called, and the Features which can be seen in gender_features function. The features we defined will decide the accuracy of the output and it will be used in training process that I mention earlier. If we use the good features, the result will be good too. But please be careful. Don't use too many features because your machine learning model will be overfitting.

I this tutorial, I use the first and last character and also the first three and last three characters as the features. I personally think they are good enough.

from django.shortcuts import render
 
def gender_features(name):
    features = {}
 
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
 
    features["first_three_letter"] = name[:3].lower()
    features["last_three_letter"] = name[-3:].lower()
 
    return features
 
def index(request):
    message = ''
    if request.method == 'POST':
        input_name = request.POST.get("name", "")
 
        if input_name != '':
            import nltk
 
            from nltk.corpus import names
 
            # Prepare the label for each name
            labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
 
            import random
            random.shuffle(labeled_names)
 
            # Generate the training set and test set
            feature_set = [(gender_features(n), gender) for (n, gender) in labeled_names]
            train_set = feature_set[:3000]
            test_set = feature_set[3000:]
 
            classifier = nltk.NaiveBayesClassifier.train(train_set)
 
            message = input_name + " is probably " + classifier.classify(gender_features(input_name)) + ". (accuracy : " + str(round(nltk.classify.accuracy(classifier, test_set) * 100, 2)) + "%)"
        else:
            message = "Name cannot be empty!"
 
    context = {'message': message}
    return render(request, 'NameClassification/form.html', context)

templates/NameClassification/form.html

There is no anything special in this template file except we use the {% csrf_token %} to prevent the CSRF.

<!DOCTYPE html>
<html>
<head>
    <title>Guess Gender By Name</title>
</head>
<body>
    <form method="post">
        {% csrf_token %}
        <label for="name">Name: </label>
        <input id="name" type="text" name="name" required />
        <input type="submit" value="Submit" />
    </form>
    <p><b>{{ message }}</b></p>
</body>
</html>

Live Demo