Getting Started
Accern's NoCodeNLP platform allows users to create workflows that extract and generate structured data from textual data. Below we present the terminology used to designate each of the components of the output data.
Document
We designate any individual textual information, such as news articles, tweets, financial documents, blog posts, etc., as a document.
Document Cluster (group)
A document cluster is a group of contextually similar documents. A new document cluster is formed when our document clustering algorithm cannot find a related document cluster for a document. A document cluster can contain multiple documents. However, a document can be part of only one document cluster.
Signal
We define signals as text parts that provide similar information about an entity/event or both. A document can contain multiple signals.
Metadata
We define any information extracted directly from a document as its metadata. Metadata includes information like hyperlinks, publish time, etc.
Analytics
Analytics that are generated by Accern's analytics engines using the metadata information is classified as derived analytics. Below we present the list of analytics present on the output data.
Output Analytics
signal_id
Definition: Uniquely identifiable ID that is randomly generated.
Process: Each signal is a theme extracted from the document. While creating a signal, a random unique identifier function is used to create the ID that is used as a string value.
Data Type: String (Unique ID)
Value Range: N/A
Analytics Type: Derived
doc_id
Definition: Unique ID assigned to the document.
Process: For every document that is processed (news, blog, etc.), we have a unique ID that helps us identify specific articles. One document can have multiple signals associated with it.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
doc_source
Definition: The source of the document.
Process: We look up a parent source of information. If the domain information is not available, the doc_source is extracted from the unique URL of the article. For the local database, the doc_source information is either set to 'custom' or any other category value provided to us.
Data Type: String (Categorical)
Value Range: N/A
Analytics Type: Metadata
doc_type
Definition: Classifies document based on where the article/document is published and its mode of access.
Process: To determine the doc_type
, we first look at the mode through which we found the document. For instance, if a document is found in one of the RSS feeds, then the doc_type
will be set to 'Feed.' If it was accessed through a premium news feed API, then the value is set to 'Premium News'; or when it was accessed via SEC's EDGAR database, the value is set to 'SEC Filing,' and so on. Next, we use the doc_source information to classify whether the document is news or blog. Accern actively maintains a mapping between doc_source by doc_type, like news/blogs. When a new source is encountered for which there's no historical data or news/blog mapping, the default value is set to 'blog.' The Accern team then reviews these sources and the mapping files are updated periodically.
Data Type: String (Categorical)
Value Range: [news, blog, dowjones, ache, custom]
Analytics Type: Metadata
doc_content
Definition: Content of the original document. This field is only available in particular scenarios.
Process: The original text of the document as scraped/extracted from the source.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
harvested_at
Definition: When the original document was crawled (at time).
Process: This info is generated when the document is actually crawled.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
published_at
Definition: The time when the doc was published.
Process: If the published date is extractable from the document, it’s used; otherwise, it’s the timestamp of when the document was crawled or provided to Accern.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
provider_id
Definition: ID of the data provider.
Process: N/A
Data Type: Int
Value Range: Positive integers
Analytics Type: Metadata
doc_title
Definition: The title of the original document.
Process: N/A
Data Type: String
Value Range: N/A
Analytics Type: Metadata
doc_url
Definition: Online URL for the original document.
Process: The original URL of the document as extracted from the source.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
provider_doc_id
Definition: Document ID given by a third-party data provider.
Process: ID of the document that was assigned by the original source.
Data Type: String
Value Range: N/A
Analytics Type: Metadata
doc_sentiment
Definition: The average sentiment of all entities and events in a document.
Process: A simple averaging of all the signal_sentiment in a document, which in turn are overall signal sentiments.
Data Type: Double
Value Range: [-100, 100]
Analytics Type: Derived
doc_cluster_id
Definition: It is a unique identifier for clusters to which a given article belongs. By tracking a doc_cluster_id, a user can trace how different articles evolved.
Process: We group similar documents into clusters based on events and entities discussed in the document.
Data Type: String
Value Range: N/A
Analytics Type: Derived
signal_tag
Definition: It is a theme identifier.
Process: It is formed by the combination of the entity_accern_id and event_accern_id.
Data Type: String
Value Range: N/A
Analytics Type: Derived
signal_relevance
Definition: Overall relevance of the signal.
Process: It is calculated as the average of entity and event relevance. The entity and event mentioned are the ones present in the signal itself.
Data Type: Double
Value Range: [0, 100]
Analytics Type: Derived
signal_sentiment
Definition: Overall sentiment of the signal.
Process: It is calculated as the average of entity and event sentiment. The entity and event mentioned are the ones present in the signal itself.
Data Type: Double
Value Range: [-100, 100]
Analytics Type: Derived
primary_signal
Definition: A boolean indicator for whether a signal is one of the most relevant signals of the document or not.
Process: Max signal relevance is calculated for each document. The signal(s) with the signal relevance equal to max signal relevance are classified as primary signals.
Data Type: Boolean
Value Range: [True, False]
Analytics Type: Derived
entity_accern_id
Definition: Accern id of the entity
Process: Entities are extracted by Accern's proprietary entity extraction models. Once an entity is extracted from a theme, we retrieve its Accern ID available in our databases and update the entity_accern_id value.
Data Type: String
Value Range: N/A
Analytics Type: Derived
entity_relevance
Definition: Scores an entity based on the emphasis it receives in the document.
Process: To determine the relevance of an entity, we consider the following two factors - a) the number of times an entity is mentioned in the article and b) the positions within the text where the entity is mentioned. We then combine these two factors into a single relevance score. Entities mentioned frequently and appear earlier in the document receive higher relevance scores than entities mentioned relatively fewer times and mostly in the later sections of the document. It is important to note that a document may have multiple highly-relevant entities. On the contrary, we reject any document that does not contain any relevant entity.
Data Type: Double
Value Range: [0, 100]
Analytics Type: Derived
entity_sentiment
Definition: This is the sentiment value calculated for each entity based on the text surrounding it.
Process: We calculate the sentiment using Accern’s proprietary sentiment analysis models.
Data Type: Double
Value Range: [-100, 100]
Analytics Type: Derived
entity_source
Definition: Identifies which knowledge graph was used for this entity (primary or custom).
Process: Entities in our databases are classified as ‘primary’ and the ones added by the client are classified as ‘custom.’
Data Type: String
Value Range: [“primary”, “custom”]
Analytics Type: Derived
entity_ticker
Definition: It is the traded ticker symbols of the extracted entity.
Process: Entities are extracted by Accern's proprietary entity extraction models. Once an entity is extracted from a theme, we retrieve its traded symbol available in our databases and update the entity_ticker value. Our ticker symbol database gets updated every night with information such as initial public offerings (IPOs), symbol changes, sector/industry updates, entity name updates etc.
Data Type: String (Categorical)
Value Range: Symbols for global equities, commodities, forex, and cryptocurrencies
Analytics Type: Derived
entity_exch_code
Definition: It is the stock exchange code where the entity is traded.
Process: We refer to the 'Entities' database and update the entity_exch_code field.
Data Type: String (Categorical)
Value Range: All Global Exchanges
Analytics Type: Metadata
entity_name
Definition: It is the name of the company as it is registered on the stock exchange.
Process: Accern has access to the list of all companies traded on each stock exchange. We actively maintain this 'Entities' database with important corporate events that may affect an entity's name, sector, stock ticker, etc.
Data Type: String (Categorical)
Value Range: All Global Equities, Commodities, Cryptocurrencies, and Forex
Analytics Type: Metadata
entity_type
Definition: It is the type of entity, such as public equity, commodity, cryptocurrency, etc.
Process: We refer to the 'Entities' database and update the entity_type field.
Data Type: String (Categorical)
Value Range: [US_EQUITY, INTERNATIONAL_EQUITY, FOREX, COMMODITY, CRYPTOCURRENCY]
Analytics Type: Metadata
entity_indices
Definition: A list of popular indices where the entity is a constituent.
Process: We refer to the 'Entities' database and update the entity_indices field.
Data Type: Array of Strings
Value Range: [US_EQUITY, INTERNATIONAL_EQUITY, FOREX, COMMODITY, CRYPTOCURRENCY]
Analytics Type: Metadata
entity_figi
Definition: Figi Code of the entity (asset class).
Process: We refer to the 'Entities' database and update the entity_figi field.
Data Type: String
Value Range: Please see openfigi.com
Analytics Type: Metadata
entity_country
Definition: It is the parent country of the entity.
Process: We refer to the 'Entities' database and update the entity_country field.
Data Type: String
Value Range: Global
Analytics Type: Metadata
entity_share_class
Definition: Share class Figi code for an entity (asset class).
Process: We refer to the 'Entities' database and update the entity_share_class field.
Data Type: String
Value Range: Please see openfigi.com
Analytics Type: Metadata
entity_region
Definition: It is the region where the entity is traded.
Process: We refer to the 'Entities' database and update the entity_region field.
Data Type: String (Categorical)
Value Range: All major regions
Analytics Type: Metadata
entity_sector
Definition: It is the sector the entity belongs to.
Process: We refer to the 'Entities' database and update the entity_sector field.
Data Type: String (Categorical)
Value Range: All major sectors
Analytics Type: Metadata
entity_hits
Definition: Hit word(s) of the entity.
Process: A list of words is generated for the entity hits by the Accern proprietary API. A distinct list of hits is then extracted for the entity_hits field.
Data Type: Array of Strings
Value Range: All Global Equities, Commodities, Cryptocurrencies, and Forex
Analytics Type: Derived
entity_text
Definition: Text surrounding the tagged entity.
Process: Accern’s proprietary API recognizes the relevant text surrounding the tagged entity in order to update the entity_text field.
Data Type: Array of Strings
Value Range: ~[1, 17] words
Analytics Type: Derived
entity_attributes
Definition: Additional information associated with the entity.
Process: We refer to the 'Entities' database and update the entity_attributes field.
Data Type: A map of string (key) and type Any (value)
Value Range: N/A
Analytics Type: Metadata
event_accern_id
Definition: Accern ID of the event
Process: Events are extracted by Accern's proprietary event extraction models. Once an event is extracted from a theme, we retrieve its Accern ID available in our databases and update the event_accern_id value.
Data Type: String
Value Range: N/A
Analytics Type: Derived
event_relevance
Definition: Scores an event based on the emphasis with which it is mentioned in a document.
Process: To determine the relevance of an event, we consider the following two factors - a) the number of times an event is mentioned in the article, and b) the positions within the text where the event is mentioned. We then combine these two factors into a single relevance score. events that are mentioned frequently and appear earlier in the document receive higher relevance scores than events that are mentioned relatively fewer times and mostly in the later sections of the document. It is important to note that there may be multiple highly-relevant events in a document. On the contrary, we reject any document that does not contain any relevant event.
Data Type: Double
Value Range: [0, 100]
Analytics Type: Derived
event_sentiment
Definition: This is the sentiment value calculated for each event based on the text surrounding it.
Process: We calculate the sentiment using Accern’s proprietary sentiment analysis models.
Data Type: Double
Value Range: [-100, 100]
Analytics Type: Derived
event_group
Definition: Event groups are the broader financial events category that contains multiple related events.
Process: Accern has developed a financial event tree that contains over 25+ financial event groups, 250+ financial events, and over a million financial phrases. A financial event can only be part of one event_group, whereas, each event_group can contain multiple financial events. Once a financial event is extracted by the event extraction model, we search for the parent group in our database and update the event_group field.
Data Type: String (Categorical)
Value Range: 25+ Unique Financial Event Groups
Analytics Type: Derived
event_name
Definition: Financial events extracted from the stories.
Process: We actively maintain this 'Events' database that contains important corporate events. Each signal contains a unique financial event for a specific company.
Data Type: String
Value Range: N/A
Analytics Type: Derived
event_hits
Definition: Text (words/phrases) as the event was found in the document.
Process: A list of words is generated for the event hits by the Accern proprietary API. A distinct list of hits is then extracted for the event_hits field.
Data Type: Array of Strings
Value Range: All events from Accern’s “Events” database
Analytics Type: Derived
event_text
Definition: Text surrounding the tagged event.
Process: Accern’s proprietary API recognizes the relevant text surrounding the tagged event in order to update the event_text field.
Data Type: Array of Strings
Value Range: ~[1, 17] words
Analytics Type: Derived
event_attributes
Definition: Additional information associated with the event.
Process: We refer to the 'Events' database and update the entity_attributes field.
Data Type: A map of string (key) and type Any (value)
Value Range: N/A
Analytics Type: Metadata