A text structure is defined by syntactic and semantic rules too complex for traditional algorithms to grasp. Classifying a text is useful to analyze each of its constituent parts. Today we’ll explore this problem, and discuss a few solutions and ways to overcome common issues.
Texts: Structure and Standardization
Every text has a syntactic and semantic structure. For example, short stories usually have an introduction, a problem and a resolution. Some structures are more standardized than others, such as letters of recommendation, business and legal documents, manuals, books.
Laws are examples of highly structured texts, with a hierarchy that goes from titles to chapters and sections. If a law was in plain text, it would be easy to identify where each article begins and ends thanks to indicators like “ARTICLE 1”, “ARTICLE 2”, and so on.
The less standardization and the more semantic dependence, the less viable to use common expressions to chunk texts, that is, to divide it in sections and classify them into categories.
In our case, we had hundreds of documents in different formats (txt, pdf, html) with no metadata associated to their structure, and the aim was to make them available for the user in a readable way, so that they could explore them and find the information they looked for. Text chunking also allows to create a table of contents, which makes a text even more discoverable. In addition, sections can be classified, indexed or processed separately, and associated to the sections that precede or follow them.
The first step of the analysis is to classify the text:
- Text have a finite number of sections.
- Sections depend on syntax, semantics and textual context.
- Every text has the same type of sections.
- A sort of “dialogue” can be established between different types of sections.
- There are no keywords telling which type each section belongs to; and if there were, they’re different based on each text.
- A line of text cannot belong to two different sections.
To develop and model our problem we used a Named-Entity Recognition (NER) system. Although the problem wasn’t exactly the same, it was quite similar, and this allowed us to reach a solution.
NER tagging is a problem of natural languages processing. The system tries to identify the text’s known entities, such as names of people, places or companies, addresses, dates, custom categories (names of projects or forms), and compound categories. Usually, NER entities comprise a few words, while the sections we needed to identify were made up of several sentences.
In general, NER tagging is considered to be a problem of sequence classification. Given a list of words, the model has to be able to classify each one of them, interpret if they belong to an entity or not, and produce a list of tags associated to them.
One tagging method is IOB tagging, which stands for Inside–Outside–Beginning. These tags indicate whether a given word marks the beginning of an entity, if it’s inside one or if it doesn’t belong to any.
Consider this sentence, for example: Are there seats available to watch Frozen II on Thursday 21st in any of the movie theaters of Palermo between 6 and 11 p.m.?
The entities to extract are date (D), time (T), movie (M) and neighborhood (N), and the system will use these tags: O for outside, BX to indicate the beginning of an entity, and IX to indicate that a word is inside an entity. Punctuation and question marks are considered to be words. In this way, the sentence would be tagged as follows:
Are(O) there(O) seats(O) available(O) to(O) watch(O) Frozen(BM) II(IM) on(O) Thursday(BD) 21st(ID) in(O) any(O) of(O) the(O) movie(O) theaters(O) of(O) Palermo(BN) between(O) 6(BT) and(O) 11(BT) p.m.(IT)?(O)
Once the tagging task is complete, you can gather together words belonging to the same entity. This is what we get: Date: “Thursday 21st”, Neighborhood: “Palermo”, Movie: “Frozen II”, Time: “6”, Time: “11 p.m.”.
There are other types of tagging systems. Some are simpler and only take into consideration whether the word is inside or outside an entity, while others are more complex and add tags to indicate where entities end or which was the preceding tag. We tried different types of tagging systems and IOB was the most suitable for our solution.
In order to tag or classify inputs, first it’s necessary to define the form of the inputs and a representation for each word.
One method is to match each word to an identifier, another is to represent each word with a real number vector (much smaller than the vocabulary). These are called word embeddings, and in order to create them you need a large number of texts (from a similar knowledge domain). Its aim is to gather the representation of words that appear in similar contexts (distributional hypothesis), capturing its semantics in different dimensions.
The Simple Model
A simple model would take the embedding of each word as input and try to predict a tag for it. Ideally, this should produce good results, but it has limitations. One throwback is that it doesn’t take into consideration each word’s context, nor does it consider the structure of the sequence it should produce. Moreover, each word would remain linked to one tag only regardless of its position, and the model would always produce the same prediction.
The Model with Context
If we want the model to consider the context, we can use a Long Short-Term Memory (LSTM). This allows to consider a sequence of variable length as input and produce a tag for each input. As it processes word after word, it saves the relations between words in an internal memory. You can add another LSTM to read the sentence backwards and save word relations of the sentence in both ways.
This model incorporates word context information into the predictions but ignores the structure of the tags. For instance, the fact that an Inside tag must always come after a Beginning tag or after another Inside tag, but never after an Outside tag. Also, it doesn’t model the possibility to migrate between categories.
The Model with Structure
Conditional Random Fields (CRF) seek to parameterize relations between words and tags, and the migration between different tags. This model focuses on the structure of what we’re trying to predict, but doesn’t use word context information to create such predictions.
The Solution We Chose
We decided to adopt a combination of models with context and structure, that is, a biLSTM-CRF model. We used a first layer of the bi-LSTM model which, instead of trying to predict tags, created a word representation based on context. Then, CRF used that as input to predict tags.
As we explained at the beginning, we had long sections and the restriction that one line of text couldn’t belong to two different sections. So, after classifying each word, a heuristics method was implemented to determine the category each line belonged to. To improve that, a LSTM was used for each line of text, but instead of producing a representation for each word, it tried to represent the entire line of text. Each line representation was then feed to the biLSTM-CRF model, which produced in return a tag for each line. Such modification had a huge impact on metrics improvement.
We’d like to finish by highlighting the importance of exploring problems that are similar to ours to guide us to possible solutions. In our case, it was key to incorporate the higher possible number of aspects of the problem and integrate the problem’s particular structure into the model.