Knowledge Graph and Linked Data
Supporting all of the above is a pan-Thomson Reuters knowledge-graph. This knowledge graph is a dynamic repository of linked information objects. The graph encompasses internal authorities, repositories, metadata, curated data, external data and derived data – all melded together and semantically inter-linked. It follows linked-data principles on a linear scalable platform. Besides serving the “classical” knowledge graph/linked-data query side, the knowledge-graph also provides a shared whiteboard for distributed information interoperability and “information-in-motion” aspects. The knowledge-graph provides the fabric that is in the junction of reactive near real-time smart-processing of information, and big-data analytics.
The easiest way to merge two separate data sets is to use a primary key, a unique ID that allows mapping a record in one data set to the equivalent record in another data set. When such an ID is not in place, matching is reliant on potentially ambiguous data such as a first name or a company name. Thomson Reuters has developed a solution called Concord, a tool for resolving the ambiguities in the data when merging disparate data sets without the benefit of a unique ID.
Events and Facts Identification
After extracting entities from a document, the next task is to understand and extract the facts described in the text that are associated with the mentioned entities, such as the age of a person, and the relations between the entities, such as deals between companies. Moreover, in some cases we need to extract attributes for the extracted relations, such as the amount of revenue a company declared, or the name of a product one company supplies another. Extracting relations is more challenging than extracting entities, as it requires deeper automatic understanding of the underlying text. We use different text processing tools on top of machine learning algorithms, and enforced crowd sourcing techniques that allow us to work on huge labeled training sets to get state of the art classifiers for relations such as mergers and acquisitions, and supply chain.
Equipped with the ability to extract metadata at the document level, we move forward to extracting knowledge from huge corpora of data. We build a graph in which every metadata object, being an entity or a topic, is a node, and every two entities that co appear in at least one document are connected by an edge. We then assign weights to these edges, viewing the entities and topics with the strongest relation to a given entity as its metadata fingerprint. For example, since Microsoft is frequently mentioned in news stories about operating systems, it will have “Windows” and “Operating Systems” as part of its fingerprint. Moreover, “Bill Gates”, “Redmond” and “Mobile Phones” will also be part of this fingerprint. The graph can be computed in the financial, scientific, and legal domains, enabling different uses such as identifying company peers, locating a reviewer for a scientific paper, or finding an expert witness for a legal case.
Connecting unstructured data with the entities they mention is key to building a full picture of entities such as companies, people, pharmaceutical drugs, and more. Traditional NLP techniques, based on machine learning tools, allow a reasonable quality baseline for this task. However, understanding the nature of the covered entities, and utilizing the Thomson Reuters structured information that describes these entities allows us to get to higher extraction quality and deliver our solutions faster. Beyond extracting entities, we have a set of algorithms for identifying the key entities in a document and filtering out entities that are essentially irrelevant to the text. Our toolbox includes a proprietary, rules-based language, and a variety of machine learning algorithms.
Tagging a document with a list of the topics it discusses is essential to ensuring that professional customers get reliable and clean feeds of the exact data they need. Our text categorization engines allow training a classification solution on top of any corpus of data with any topic taxonomy describing it. Beyond the standard machine learning techniques that build a base-line solution, we have built a set of capabilities to allow correction and curation on the training data, dealing with topics with few training examples and preventing embarrassing classification mistakes. Moreover, we have developed an algorithm to associate the most relevant Wikipedia categories to input text, allowing our customers to enjoy the dynamics of the Wikipedia taxonomy.
TMS employs a combination of machine learning and rule based techniques. When manually labeled training data is available, we use supervised machine learning techniques such as logistic regression or SVM. In other cases, we use unsupervised techniques, and in combination with an advanced rule based extraction mechanism, we devise semi supervised techniques that bootstrap from the rules and use the topology of the data to complete the NLP task. Some of our algorithms work at the document level while others extract information and allow question answering by looking at a large corpus of documents. The breadth of our technologies allow us to tackle a wide range of use cases that reflect the business diversity and innovation of Thomson Reuters.