Lab 13: Text Processing

Objectives

Resources

Introduction

In this assigment you will have the computer read in a novel stored as a text file, then do some frequency analysis on the text and store your results as text files.

Natural Language Processing

Natural Language Processing (NLP) is one of the most widely used forms of AI by the average person. Web searches, auto-correct, auto-complete, spell-check, virtual assistants (Siri, Alexa, etc), machine translation (Google Translate), and of course Chat-GPT all rely on various NLP techniques ranging from very simple to very complex. All of these models start with a lot of text that is stored in a way that a computer can use. The first steps in building these models are reading the text into memory and cleaning it up so that we can perform statistical analysis on the data.

Normalizing the Data

Normalizing data is the act of converting all the data you are working with into the same format. Depending on the type of data you have you can do many things to normalize to ensure you have homogenous data that can be processed in the same way. In the case of text data, we often remove any punctuation, such as periods, quotation marks, apostrophes, and so on. We also will often convert all of the text to the same case, usually lowercase. This is done because in most programming languages, upper and lower case letters are not considered equals as the ASCII values that are stored in memry are not the same.

Word Position

Often we will want to know where a word is in relation to the other words around it to identify and generate phrases and infer meaning. For example, in the following sentences, the word right has a different meaning depending on the words that surround it:

Turn right now
Turn to the right now

To keep track of the position of a given word, we will keep track of it's location in the source text by simply numbering the words as they are read, so the first word in the document will have a position of 0, the next word will have a position of 1, etc. Since our documents will be relatively small, this will be sufficient.

N-grams

Another common technique used in NLP is grouping words together into larger chunks to process. This is because often in language the same words appear next to each other regularly, and by keeping track of these relationships we can gain more information about the language. These chunks are called N-grams, where N is the number of words in the chunk. The simplest of these is called a Bigram. A Bigram is simply two consecutive words whose position is the position of the first word in the Bigram. For example, the Bigram "i want", the position would be whatever the position of the word "i" is.

Frequency Analysis

One of the most important tasks required in NLP is finding the frequency of words and N-grams in a given document. The Frequency of a word is the number of times the word appears in the document. Additionally, knowing how many different unique words is also useful, and so a Vocabulary, which contains one of each word that exists in the document, is also tracked separately. This will result in three separate collections of data from a single text document:

Implementation

For this lab, you will read in a text document using a java.util.Scanner, normalize the data that you read in, and generate the three collections listed above for that document. You will then save your bigram and vocabulary lists with their frequencies as separate text files. You will also properly handle any exceptions that arise.

Use the following UML diagram and the associated javadocs for more detailed instructions.

Additional information

Sample Output

The output for executing the program on the three files stored in the data folder (test.txt, connecticutYankee.txt, and warAndPeace.txt)can be found below. Each run of the program will produce two text files, one containing the vocabulary of the file and one containing all the bigrams.

Acknowledgment

This laboratory assignment was developed by Prof. Sean Jones.

See your professor's instructions for details on submission guidelines and due dates.