Getting Started

Accern's NoCodeNLP platform allows users to create workflows that extract and generate structured data from textual data. Below we present the terminology used to designate each of the components of the output data.

Document

We designate any individual textual information, such as news articles, tweets, financial documents, blog posts, etc., as a document.

Document Cluster (group)

A document cluster is a group of contextually similar documents. A new document cluster is formed when our document clustering algorithm cannot find a related document cluster for a document. A document cluster can contain multiple documents. However, a document can be part of only one document cluster.

Signal

We define signals as text parts that provide similar information about an entity/event or both. A document can contain multiple signals.

Metadata

We define any information extracted directly from a document as its metadata. Metadata includes information like hyperlinks, publish time, etc.

Analytics

Analytics that are generated by Accern's analytics engines using the metadata information is classified as derived analytics. Below we present the list of analytics present on the output data.


Output Analytics

signal_id

Definition: Uniquely identifiable ID that is randomly generated.

Process: Each signal is a theme extracted from the document. While creating a signal, a random unique identifier function is used to create the ID that is used as a string value.

  • Data Type: String (Unique ID)

  • Value Range: N/A

  • Analytics Type: Derived


doc_id

Definition: Unique ID assigned to the document.

Process: For every document that is processed (news, blog, etc.), we have a unique ID that helps us identify specific articles. One document can have multiple signals associated with it.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


doc_source

Definition: The source of the document.

Process: We look up a parent source of information. If the domain information is not available, the doc_source is extracted from the unique URL of the article. For the local database, the doc_source information is either set to 'custom' or any other category value provided to us.

  • Data Type: String (Categorical)

  • Value Range: N/A

  • Analytics Type: Metadata


doc_type

Definition: Classifies document based on where the article/document is published and its mode of access.

Process: To determine the doc_type, we first look at the mode through which we found the document. For instance, if a document is found in one of the RSS feeds, then the doc_type will be set to 'Feed.' If it was accessed through a premium news feed API, then the value is set to 'Premium News'; or when it was accessed via SEC's EDGAR database, the value is set to 'SEC Filing,' and so on. Next, we use the doc_source information to classify whether the document is news or blog. Accern actively maintains a mapping between doc_source by doc_type, like news/blogs. When a new source is encountered for which there's no historical data or news/blog mapping, the default value is set to 'blog.' The Accern team then reviews these sources and the mapping files are updated periodically.

  • Data Type: String (Categorical)

  • Value Range: [news, blog, dowjones, ache, custom]

  • Analytics Type: Metadata


doc_content

Definition: Content of the original document. This field is only available in particular scenarios.

Process: The original text of the document as scraped/extracted from the source.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


harvested_at

Definition: When the original document was crawled (at time).

Process: This info is generated when the document is actually crawled.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


published_at

Definition: The time when the doc was published.

Process: If the published date is extractable from the document, it’s used; otherwise, it’s the timestamp of when the document was crawled or provided to Accern.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


provider_id

Definition: ID of the data provider.

Process: N/A

  • Data Type: Int

  • Value Range: Positive integers

  • Analytics Type: Metadata


doc_title

Definition: The title of the original document.

Process: N/A

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


doc_url

Definition: Online URL for the original document.

Process: The original URL of the document as extracted from the source.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


provider_doc_id

Definition: Document ID given by a third-party data provider.

Process: ID of the document that was assigned by the original source.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Metadata


doc_sentiment

Definition: The average sentiment of all entities and events in a document.

Process: A simple averaging of all the signal_sentiment in a document, which in turn are overall signal sentiments.

  • Data Type: Double

  • Value Range: [-100, 100]

  • Analytics Type: Derived


doc_cluster_id

Definition: It is a unique identifier for clusters to which a given article belongs. By tracking a doc_cluster_id, a user can trace how different articles evolved.

Process: We group similar documents into clusters based on events and entities discussed in the document.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Derived


signal_tag

Definition: It is a theme identifier.

Process: It is formed by the combination of the entity_accern_id and event_accern_id.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Derived


signal_relevance

Definition: Overall relevance of the signal.

Process: It is calculated as the average of entity and event relevance. The entity and event mentioned are the ones present in the signal itself.

  • Data Type: Double

  • Value Range: [0, 100]

  • Analytics Type: Derived


signal_sentiment

Definition: Overall sentiment of the signal.

Process: It is calculated as the average of entity and event sentiment. The entity and event mentioned are the ones present in the signal itself.

  • Data Type: Double

  • Value Range: [-100, 100]

  • Analytics Type: Derived


primary_signal

Definition: A boolean indicator for whether a signal is one of the most relevant signals of the document or not.

Process: Max signal relevance is calculated for each document. The signal(s) with the signal relevance equal to max signal relevance are classified as primary signals.

  • Data Type: Boolean

  • Value Range: [True, False]

  • Analytics Type: Derived


entity_accern_id

Definition: Accern id of the entity

Process: Entities are extracted by Accern's proprietary entity extraction models. Once an entity is extracted from a theme, we retrieve its Accern ID available in our databases and update the entity_accern_id value.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Derived


entity_relevance

Definition: Scores an entity based on the emphasis it receives in the document.

Process: To determine the relevance of an entity, we consider the following two factors - a) the number of times an entity is mentioned in the article and b) the positions within the text where the entity is mentioned. We then combine these two factors into a single relevance score. Entities mentioned frequently and appear earlier in the document receive higher relevance scores than entities mentioned relatively fewer times and mostly in the later sections of the document. It is important to note that a document may have multiple highly-relevant entities. On the contrary, we reject any document that does not contain any relevant entity.

  • Data Type: Double

  • Value Range: [0, 100]

  • Analytics Type: Derived


entity_sentiment

Definition: This is the sentiment value calculated for each entity based on the text surrounding it.

Process: We calculate the sentiment using Accern’s proprietary sentiment analysis models.

  • Data Type: Double

  • Value Range: [-100, 100]

  • Analytics Type: Derived


entity_source

Definition: Identifies which knowledge graph was used for this entity (primary or custom).

Process: Entities in our databases are classified as ‘primary’ and the ones added by the client are classified as ‘custom.’

  • Data Type: String

  • Value Range: [“primary”, “custom”]

  • Analytics Type: Derived


entity_ticker

Definition: It is the traded ticker symbols of the extracted entity.

Process: Entities are extracted by Accern's proprietary entity extraction models. Once an entity is extracted from a theme, we retrieve its traded symbol available in our databases and update the entity_ticker value. Our ticker symbol database gets updated every night with information such as initial public offerings (IPOs), symbol changes, sector/industry updates, entity name updates etc.

  • Data Type: String (Categorical)

  • Value Range: Symbols for global equities, commodities, forex, and cryptocurrencies

  • Analytics Type: Derived


entity_exch_code

Definition: It is the stock exchange code where the entity is traded.

Process: We refer to the 'Entities' database and update the entity_exch_code field.

  • Data Type: String (Categorical)

  • Value Range: All Global Exchanges

  • Analytics Type: Metadata


entity_name

Definition: It is the name of the company as it is registered on the stock exchange.

Process: Accern has access to the list of all companies traded on each stock exchange. We actively maintain this 'Entities' database with important corporate events that may affect an entity's name, sector, stock ticker, etc.

  • Data Type: String (Categorical)

  • Value Range: All Global Equities, Commodities, Cryptocurrencies, and Forex

  • Analytics Type: Metadata


entity_type

Definition: It is the type of entity, such as public equity, commodity, cryptocurrency, etc.

Process: We refer to the 'Entities' database and update the entity_type field.

  • Data Type: String (Categorical)

  • Value Range: [US_EQUITY, INTERNATIONAL_EQUITY, FOREX, COMMODITY, CRYPTOCURRENCY]

  • Analytics Type: Metadata


entity_indices

Definition: A list of popular indices where the entity is a constituent.

Process: We refer to the 'Entities' database and update the entity_indices field.

  • Data Type: Array of Strings

  • Value Range: [US_EQUITY, INTERNATIONAL_EQUITY, FOREX, COMMODITY, CRYPTOCURRENCY]

  • Analytics Type: Metadata


entity_figi

Definition: Figi Code of the entity (asset class).

Process: We refer to the 'Entities' database and update the entity_figi field.

  • Data Type: String

  • Value Range: Please see openfigi.com

  • Analytics Type: Metadata


entity_country

Definition: It is the parent country of the entity.

Process: We refer to the 'Entities' database and update the entity_country field.

  • Data Type: String

  • Value Range: Global

  • Analytics Type: Metadata


entity_share_class

Definition: Share class Figi code for an entity (asset class).

Process: We refer to the 'Entities' database and update the entity_share_class field.

  • Data Type: String

  • Value Range: Please see openfigi.com

  • Analytics Type: Metadata


entity_region

Definition: It is the region where the entity is traded.

Process: We refer to the 'Entities' database and update the entity_region field.

  • Data Type: String (Categorical)

  • Value Range: All major regions

  • Analytics Type: Metadata


entity_sector

Definition: It is the sector the entity belongs to.

Process: We refer to the 'Entities' database and update the entity_sector field.

  • Data Type: String (Categorical)

  • Value Range: All major sectors

  • Analytics Type: Metadata


entity_hits

Definition: Hit word(s) of the entity.

Process: A list of words is generated for the entity hits by the Accern proprietary API. A distinct list of hits is then extracted for the entity_hits field.

  • Data Type: Array of Strings

  • Value Range: All Global Equities, Commodities, Cryptocurrencies, and Forex

  • Analytics Type: Derived


entity_text

Definition: Text surrounding the tagged entity.

Process: Accern’s proprietary API recognizes the relevant text surrounding the tagged entity in order to update the entity_text field.

  • Data Type: Array of Strings

  • Value Range: ~[1, 17] words

  • Analytics Type: Derived


entity_attributes

Definition: Additional information associated with the entity.

Process: We refer to the 'Entities' database and update the entity_attributes field.

  • Data Type: A map of string (key) and type Any (value)

  • Value Range: N/A

  • Analytics Type: Metadata


event_accern_id

Definition: Accern ID of the event

Process: Events are extracted by Accern's proprietary event extraction models. Once an event is extracted from a theme, we retrieve its Accern ID available in our databases and update the event_accern_id value.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Derived


event_relevance

Definition: Scores an event based on the emphasis with which it is mentioned in a document.

Process: To determine the relevance of an event, we consider the following two factors - a) the number of times an event is mentioned in the article, and b) the positions within the text where the event is mentioned. We then combine these two factors into a single relevance score. events that are mentioned frequently and appear earlier in the document receive higher relevance scores than events that are mentioned relatively fewer times and mostly in the later sections of the document. It is important to note that there may be multiple highly-relevant events in a document. On the contrary, we reject any document that does not contain any relevant event.

  • Data Type: Double

  • Value Range: [0, 100]

  • Analytics Type: Derived


event_sentiment

Definition: This is the sentiment value calculated for each event based on the text surrounding it.

Process: We calculate the sentiment using Accern’s proprietary sentiment analysis models.

  • Data Type: Double

  • Value Range: [-100, 100]

  • Analytics Type: Derived


event_group

Definition: Event groups are the broader financial events category that contains multiple related events.

Process: Accern has developed a financial event tree that contains over 25+ financial event groups, 250+ financial events, and over a million financial phrases. A financial event can only be part of one event_group, whereas, each event_group can contain multiple financial events. Once a financial event is extracted by the event extraction model, we search for the parent group in our database and update the event_group field.

  • Data Type: String (Categorical)

  • Value Range: 25+ Unique Financial Event Groups

  • Analytics Type: Derived


event_name

Definition: Financial events extracted from the stories.

Process: We actively maintain this 'Events' database that contains important corporate events. Each signal contains a unique financial event for a specific company.

  • Data Type: String

  • Value Range: N/A

  • Analytics Type: Derived


event_hits

Definition: Text (words/phrases) as the event was found in the document.

Process: A list of words is generated for the event hits by the Accern proprietary API. A distinct list of hits is then extracted for the event_hits field.

  • Data Type: Array of Strings

  • Value Range: All events from Accern’s “Events” database

  • Analytics Type: Derived


event_text

Definition: Text surrounding the tagged event.

Process: Accern’s proprietary API recognizes the relevant text surrounding the tagged event in order to update the event_text field.

  • Data Type: Array of Strings

  • Value Range: ~[1, 17] words

  • Analytics Type: Derived


event_attributes

Definition: Additional information associated with the event.

Process: We refer to the 'Events' database and update the entity_attributes field.

  • Data Type: A map of string (key) and type Any (value)

  • Value Range: N/A

  • Analytics Type: Metadata



Did this answer your question?