Understanding Feature Engineering (Part 3)—Traditional Methods for Text Data

<h3 id="62e4">Introduction</h3>
We have covered various feature engineering strategies for dealing with structured data in the first two parts of this series. Check out <a href="https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b">Part-I: Continuous, numeric data</a> and <a href="https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63">Part-II: Discrete, categorical data</a> for a refresher. In this article, we will look at how to work with text data, which is definitely one of the most abundant sources of unstructured data. Text data usually consists of documents which can represent words, sentences or even paragraphs of free flowing text. The inherent unstructured (no neatly formatted data columns!) and noisy nature of textual data makes it harder for machine learning methods to directly work on raw text data. Hence, in this article, we will follow a hands-on approach to explore some of the most popular and effective strategies for extracting meaningful features from text data. These features can then be used in building machine learning or deep learning models easily.
<h3 id="6e2f">Motivation</h3>
Feature Engineering is often known as the secret sauce to creating superior and better performing machine learning models. Just one excellent feature could be your ticket to winning a <a href="https://www.kaggle.com/">Kaggle</a> challenge! The importance of feature engineering is even more important for unstructured, textual data because we need to convert free flowing text into some numeric representations which can then be understood by machine learning algorithms. Even with the advent of automated feature engineering capabilities, you would still need to understand the core concepts behind different feature engineering strategies before applying them as black box models. Always remember, “If you are given a box of tools to repair a house, you should know when to use a power drill and when to use a hammer!”.
<h3 id="1540">Understanding Text Data</h3>
I’m sure all of you must be having a fair idea of what textual data comprises of in this scenario. Do remember you can always have text data in the form of structured data attributes, but usually those fall under the umbrella of structured, categorical data.
<img alt="" class="blockcode" src="https://201907.oss-cn-shanghai.aliyuncs.com/cs/5606289-41759c41ed0c369de6005c7163db3e39.png">
In this scenario, we are talking about free flowing text in the form of words, phrases, sentences and entire documents. Essentially, we do have some syntactic structure like words make phrases, phrases make sentences which in turn make paragraphs. However, there is no inherent structure to text documents because you can have a wide variety of words which can vary across documents and each sentence will also be of variable length as compared to a fixed number of data dimensions in structured datasets. This article itself is a perfect example of text data!
<h3 id="c453">Feature Engineering Strategies</h3>
Let’s look at some popular and effective strategies for handling text data and extracting meaningful features from the same which can be used in downstream machine learning systems. Do note that you can access all the code used in this article in <a href="https://github.com/dipanjanS/practical-machine-learning-with-python/tree/master/bonus%20content/feature%20engineering%20text%20data">my GitHub repository</a> also for future reference. We’ll start by loading up some basic dependencies and settings.
<pre id="f238">import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt</pre>
<pre id="9e5c">pd.options.display.max_colwidth = 200
%matplotlib inline</pre>
Let’s now take a sample corpus of documents on which we will run most of our analyses in this article. A <a href="https://en.wikipedia.org/wiki/Text_corpus">corpus</a> is typically a collection of text documents usually belonging to one or more subjects.
 
<img alt="" class="blockcode" src="https://201907.oss-cn-shanghai.aliyuncs.com/cs/5606289-40c0845f71521398438f68285a980011.png">
Our sample text corpus
You can see that we have taken a few sample text documents belonging to different categories for our toy corpus. Before we talk about feature engineering, as always, we need to do some data pre-processing or wrangling to remove unnecessary characters, symbols and tokens.
Text pre-processing
There can be multiple ways of cleaning and pre-processing textual data. In the following points, we highlight some of the most important ones which are used heavily in Natural Language Processing (NLP) pipelines.
<ul><li id="b639">Removing tags: Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing text. The BeautifulSoup library does an excellent job in providing necessary functions for this.</li><li id="7627">Removing accented characters: In any text corpus, especially if you are dealing with the English language, often you might be dealing with accented characters\letters. Hence

Understanding Feature Engineering (Part 3)—Traditional Methods for Text Data

浏览过的版块