Relationship extraction: An Intro
When fields like event detection or question answering require machines to draw absolute relations between different entities in a single sentence, a lot many tasks are solved in a flow. This is the major reason why quite a few approaches to relationship extraction have come up by different groups across the community. Entities like person, place and organisation are the most basic form of information present in a sentence. These entities are often linked through well-defined relations. What is basically desired of any approach is to identify these relationships by a machine automatically.
In achieving this there could be ways, approaches and algorithms that control the flow of how it all happens. An overview of one such approach has been published by H. Elsahar et al. at the Laboratoire Hubert Curien, Université de Lyon, Saint-Étienne.
The reference approach: Unsupervised Open Relationship Extraction
Being an expert in the field, who has put in a precious one month into validating the potential approach, I feel it as a responsibility to contribute to my professional community, the hacks that I cracked, the challenges that I faced and the logistics that I figured out during this exercise of implementing H Elsahar et all’s work on an entirely new dataset.
Firstly, what I took up was to look for a publicly available dataset suitable for relationship extraction, as the dataset used in the paper (NYT-FB) is not a free dataset. I found a candidate dataset kbp37 from a github repository which is suitable for applying the unsupervised relationship extraction approach described in the research work. The presence of labeled relationships among different entities was what made me narrow down to this dataset. Now, you must be wondering why one must need a labeled dataset for unsupervised learning? To settle down the curiosity, to evaluate the approach in this particular piece of work, I needed a labeled dataset.
Preprocessing: Making data feed-able for the machine learning task
Now, the next step according to the reference paper being worked on here was to perform preprocessing on the dataset. Since, the choice of dataset was different in my study there were few extra steps that I had to perform for preprocessing, other than those mentioned in the paper. My preprocessing steps included removing stop words, removing punctuations and any non-parsable character using NLTK library from each of the sentences.
The next part of preprocessing involved, getting named entities along with their types from each of the sentences. The two entities involved in a relation can be defined as a subject and an object, which can be denoted in the triplet form of a subject – relation – object. Most of the relationship extraction approaches only use entities having some type of person, location or organization. The task of extracting named entities from a sentence is defined as NER and can be easily performed using already available libraries. There were two libraries which were used for NER task in the paper; first Stanford CoreNLP library and the DBpedia Spotlight. Both libraries were used to find named entities from a sentence having entity type of a person, location or organization. DBpedia spotlight provides further tags in these three categories (like if a person is an artist, it will also gives artist tag or a musician tag that helps in getting more relevant features for relation between entities. After performing NER task, sentences were removed if they contained more than two entities, as the approach only handles binary relations.
The last step in the preprocessing was to find the lexical dependency path between two entities in a sentence. The reason for finding dependency path is its usage in finding re-weighted word embedding in the feature extraction step. The lexical dependency was found using the Stanford parser library.
Further on in the study, was to extract features from each sentence that can represent the relation among different entities uniquely. Any person working in the area of NLP might have definitely heard about the word embedding or word2vec. Putting it in simple words, word embedding can be defined as the learned representation of text where words that have the same meaning have the similar representation. There are many pre-trained word embeddings that are publicly available, each accomplishing almost the same goal. The paper here uses the GloVe word embeddings where each vector is represented by a 100 dimensional vector.
The USP of the approach developed in the reference paper was, their novel approach of getting sentence feature using re-weighted word embedding from the sentence. The idea behind the novel approach was, that not all words in the sentence contribute equally to the representation of a relationship among two entities. The words, which occur in the lexical dependency path between two entities, contribute more than the other words. The sentence embedding is however calculated by adding the word vectors of all words in the sentence. The formula for re-weighting the word embeddings is:
The above statement can be simply explained as: for all the words in the lexical dependency path, a weight of Ci is multiplied along with normalisation factor (is number of words in sentence, is number of words in dependency path). All the other words of the sentence which are not in the dependency path are just multiplied by Co weight.
The paper reported the values of Ci to be 1.85 and Coto be 0.02 which were found through experiments. The paper also mentions the usage of other features, which were concatenated with the above calculated feature vector. However, there coverage in the paper is very minimalistic. So, here I will try to describe how those features were calculated and concatenated to make a single feature vector for the sentence.
The entity types that were calculated in the preprocessing steps are utilised to get the other features of the sentence. The first approach uses the entity obtained using the Stanford CoreNLP library. Using these entity types, three separate feature vectors were calculated which were sparse in nature. The method involved starts by separating the entities into subject and object type. All the entity types corresponding to subject type were converted into one-hot encoding for each sentence. Similarly, the entity type for object entities were converted into one-hot encoding vectors. For third feature, a combination of subject-object entity type is used to generate the one-hot encoding feature vector. The reason for using the combination is, that some relations frequently occur between two types of entities (for example, place_of_birth relation always occur when the entities involved are of type Person-Location).
The entity types which were generated using the DBpedia Spotlight libraries were also converted using the same approach as for the Stanford CoreNLP tags. Only the combined approach was not used to get the features due to a large number of combinations involved in this case.
Now, the feature vectors obtained from the entity-based features were of sparse nature. To remove the problem of sparsity from these features, a feature reduction approach using PCA was followed in the paper. Also, the reduced features were then concatenating with the proposed re-weighted sentence vector.
Finally after getting a single feature vector for each sentence in the dataset, clustering was applied to cluster these sentences where each cluster signifies a particular relation. The paper reports better results with Hierarchical Agglomerative Clustering (HAC) with ward’s linkage criteria rather than k-means clustering.
For evaluation, we used a similar approach as reported. The dataset was divided into 80%-20% test-validation split for evaluation. The validation set is used for parameter tuning of the PCA algorithm. Since the values of Ci and Co where given in the paper, we choose the same values in our study too. The paper reported the F1 score of 0.416 on the NYT-FB dataset, which they claim to be outperforming the state-of-the-art relation discovery algorithms. Our model was however only able to give the F1 score of 0.343 on the kbp37 dataset we used in the study. The difference in the performance is high, likely due to an entirely different dataset used in our study.
Although the said approach brings out outstanding results on the ideal dataset, the one the article has referred to, we can conclude, that the relationship extraction would in any case depend on the characteristics of the dataset to which the approach is being applied. So, as an applicator of the approach, I would score it as an exceptional one, pressing on the fact that we identify a good dataset to have it work on.
Would appreciate any comments on this, along with specialist inputs that could help improve the applications of this approach.
We at DataToBiz connect business to data by implementing academic papers as well. By exploring academic papers, we are able to understand academics ‘ view point in addition to professionals from industry.