7 Readability Features for Your Next Machine Learning Model

In this article, you will learn how to extract seven useful readability and text-complexity features from raw text using the Textstat Python library.

Topics we will cover include:

How Textstat can quantify readability and text complexity for downstream machine learning tasks.
How to compute seven commonly used readability metrics in Python.
How to interpret these metrics when using them as features for classification or regression models.

Let’s not waste any more time.

7 Readability Features for Your Next Machine Learning Model
Image by Editor

Introduction

Unlike fully structured tabular data, preparing text data for machine learning models typically entails tasks like tokenization, embeddings, or sentiment analysis. While these are undoubtedly useful features, the structural complexity of text — or its readability, for that matter — can also constitute an incredibly informative feature for predictive tasks such as classification or regression.

Textstat, as its name suggests, is a lightweight and intuitive Python library that can help you obtain statistics from raw text. Through readability scores, it provides input features for models that can help distinguish between a casual social media post, a children’s fairy tale, or a philosophy manuscript, to name a few.

This article introduces seven insightful examples of text analysis that can be easily conducted using the Textstat library.

Before we get started, make sure you have Textstat installed:

While the analyses described here can be scaled up to a large text corpus, we will illustrate them with a toy dataset consisting of a small number of labeled texts. Bear in mind, however, that for downstream machine learning model training and inference, you will need a sufficiently large dataset for training purposes.

import pandas as pd import textstat # Create a toy dataset with three markedly different texts data = { ‘Category’: [‘Simple’, ‘Standard’, ‘Complex’], ‘Text’: [ “The cat sat on the mat. It was a sunny day. The dog played outside.”, “Machine learning algorithms build a model based on sample data, known as training data, to make predictions.”, “The thermodynamic properties of the system dictate the spontaneous progression of the chemical reaction, contingent upon the activation energy threshold.” ] } df = pd.DataFrame(data) print(“Environment set up and dataset ready!”)

import pandas as pd

import textstat

# Create a toy dataset with three markedly different texts

data = {

‘Category’: [‘Simple’, ‘Standard’, ‘Complex’],

‘Text’: [

“The cat sat on the mat. It was a sunny day. The dog played outside.”,

“Machine learning algorithms build a model based on sample data, known as training data, to make predictions.”,

“The thermodynamic properties of the system dictate the spontaneous progression of the chemical reaction, contingent upon the activation energy threshold.”

]

}

df = pd.DataFrame(data)

print(“Environment set up and dataset ready!”)

1. Applying the Flesch Reading Ease Formula

The first text analysis metric we will explore is the Flesch Reading Ease formula, one of the earliest and most widely used metrics for quantifying text readability. It evaluates a text based on the average sentence length and the average number of syllables per word. While it is conceptually meant to take values in the 0 – 100 range — with 0 meaning unreadable and 100 meaning very easy to read — its formula is not strictly bounded, as shown in the examples below:

df[‘Flesch_Ease’] = df[‘Text’].apply(textstat.flesch_reading_ease) print(“Flesch Reading Ease Scores:”) print(df[[‘Category’, ‘Flesch_Ease’]])

df[‘Flesch_Ease’] = df[‘Text’].apply(textstat.flesch_reading_ease)

print(“Flesch Reading Ease Scores:”)

print(df[[‘Category’, ‘Flesch_Ease’]])

Output:

Flesch Reading Ease Scores: Category Flesch_Ease 0 Simple 105.880000 1 Standard 45.262353 2 Complex -8.045000

Flesch Reading Ease Scores:

Category Flesch_Ease

0 Simple 105.880000

1 Standard 45.262353

2 Complex –8.045000

This is what the actual formula looks like:

$$ 206.835 – 1.015 \left( \frac{\text{total words}}{\text{total sentences}} \right) – 84.6 \left( \frac{\text{total syllables}}{\text{total words}} \right) $$

Unbounded formulas like Flesch Reading Ease can hinder the proper training of a machine learning model, which is something to take into consideration during later feature engineering tasks.

2. Computing Flesch-Kincaid Grade Levels

Unlike the Reading Ease score, which provides a single readability value, the Flesch-Kincaid Grade Level assesses text complexity using a scale similar to US school grade levels. In this case, higher values indicate greater complexity. Be warned, though: this metric also behaves similarly to the Flesch Reading Ease score, such that extremely simple or complex texts can yield scores below zero or arbitrarily high values, respectively.

df[‘Flesch_Grade’] = df[‘Text’].apply(textstat.flesch_kincaid_grade) print(“Flesch-Kincaid Grade Levels:”) print(df[[‘Category’, ‘Flesch_Grade’]])

df[‘Flesch_Grade’] = df[‘Text’].apply(textstat.flesch_kincaid_grade)

print(“Flesch-Kincaid Grade Levels:”)

print(df[[‘Category’, ‘Flesch_Grade’]])

Output:

Flesch-Kincaid Grade Levels: Category Flesch_Grade 0 Simple -0.266667 1 Standard 11.169412 2 Complex 19.350000

Flesch–Kincaid Grade Levels:

Category Flesch_Grade

0 Simple –0.266667

1 Standard 11.169412

2 Complex 19.350000

3. Computing the SMOG Index

Another measure with origins in assessing text complexity is the SMOG Index, which estimates the years of formal education required to comprehend a text. This formula is somewhat more bounded than others, as it has a strict mathematical floor slightly above 3. The simplest of our three example texts falls at the absolute minimum for this measure in terms of complexity. It takes into account factors such as the number of polysyllabic words, that is, words with three or more syllables.

df[‘SMOG_Index’] = df[‘Text’].apply(textstat.smog_index) print(“SMOG Index Scores:”) print(df[[‘Category’, ‘SMOG_Index’]])

df[‘SMOG_Index’] = df[‘Text’].apply(textstat.smog_index)

print(“SMOG Index Scores:”)

print(df[[‘Category’, ‘SMOG_Index’]])

Output:

SMOG Index Scores: Category SMOG_Index 0 Simple 3.129100 1 Standard 11.208143 2 Complex 20.267339

SMOG Index Scores:

Category SMOG_Index

0 Simple 3.129100

1 Standard 11.208143

2 Complex 20.267339

4. Calculating the Gunning Fog Index

Like the SMOG Index, the Gunning Fog Index also has a strict floor, in this case equal to zero. The reason is straightforward: it quantifies the percentage of complex words along with average sentence length. It is a popular metric for analyzing business texts and ensuring that technical or domain-specific content is accessible to a wider audience.

df[‘Gunning_Fog’] = df[‘Text’].apply(textstat.gunning_fog) print(“Gunning Fog Index:”) print(df[[‘Category’, ‘Gunning_Fog’]])

df[‘Gunning_Fog’] = df[‘Text’].apply(textstat.gunning_fog)

print(“Gunning Fog Index:”)

print(df[[‘Category’, ‘Gunning_Fog’]])

Output:

Gunning Fog Index: Category Gunning_Fog 0 Simple 2.000000 1 Standard 11.505882 2 Complex 26.000000

Gunning Fog Index:

Category Gunning_Fog

0 Simple 2.000000

1 Standard 11.505882

2 Complex 26.000000

5. Calculating the Automated Readability Index

The previously seen formulas take into consideration the number of syllables in words. By contrast, the Automated Readability Index (ARI) computes grade levels based on the number of characters per word. This makes it computationally faster and, therefore, a better alternative when handling huge text datasets or analyzing streaming data in real time. It is unbounded, so feature scaling is often recommended after calculating it.

# Calculate Automated Readability Index df[‘ARI’] = df[‘Text’].apply(textstat.automated_readability_index) print(“Automated Readability Index:”) print(df[[‘Category’, ‘ARI’]])

# Calculate Automated Readability Index

df[‘ARI’] = df[‘Text’].apply(textstat.automated_readability_index)

print(“Automated Readability Index:”)

print(df[[‘Category’, ‘ARI’]])

Output:

Automated Readability Index: Category ARI 0 Simple -2.288000 1 Standard 12.559412 2 Complex 20.127000

Automated Readability Index:

Category ARI

0 Simple –2.288000

1 Standard 12.559412

2 Complex 20.127000

6. Calculating the Dale-Chall Readability Score

Similarly to the Gunning Fog Index, Dale-Chall readability scores have a strict floor of zero, as the metric also relies on ratios and percentages. The distinctive feature of this metric is its vocabulary-driven approach, as it works by cross-referencing the entire text against a prebuilt lookup list that contains thousands of words familiar to fourth-grade students. Any word not included in that list is labeled as complex. If you want to analyze text intended for children or broad audiences, this metric might be a good reference point.

df[‘Dale_Chall’] = df[‘Text’].apply(textstat.dale_chall_readability_score) print(“Dale-Chall Scores:”) print(df[[‘Category’, ‘Dale_Chall’]])

df[‘Dale_Chall’] = df[‘Text’].apply(textstat.dale_chall_readability_score)

print(“Dale-Chall Scores:”)

print(df[[‘Category’, ‘Dale_Chall’]])

Output:

Dale-Chall Scores: Category Dale_Chall 0 Simple 4.937167 1 Standard 12.839112 2 Complex 14.102500

Dale–Chall Scores:

Category Dale_Chall

0 Simple 4.937167

1 Standard 12.839112

2 Complex 14.102500

7. Using Text Standard as a Consensus Metric

What happens if you are unsure which specific formula to use? textstat provides an interpretable consensus metric that brings several of them together. Through the text_standard() function, multiple readability approaches are applied to the text, returning a consensus grade level. As usual with most metrics, the higher the value, the lower the readability. This is an excellent option for a quick, balanced summary feature to incorporate into downstream modeling tasks.

df[‘Consensus_Grade’] = df[‘Text’].apply(lambda x: textstat.text_standard(x, float_output=True)) print(“Consensus Grade Levels:”) print(df[[‘Category’, ‘Consensus_Grade’]])

df[‘Consensus_Grade’] = df[‘Text’].apply(lambda x: textstat.text_standard(x, float_output=True))

print(“Consensus Grade Levels:”)

print(df[[‘Category’, ‘Consensus_Grade’]])

Output:

Consensus Grade Levels: Category Consensus_Grade 0 Simple 2.0 1 Standard 11.0 2 Complex 18.0

Consensus Grade Levels:

Category Consensus_Grade

0 Simple 2.0

1 Standard 11.0

2 Complex 18.0

Wrapping Up

We explored seven metrics for analyzing the readability or complexity of texts using the Python library Textstat. While most of these approaches behave somewhat similarly, understanding their nuanced characteristics and distinctive behaviors is key to choosing the right one for your analysis or for subsequent machine learning modeling use cases.

Source link