Proposition to Apply Cosine Similarity in Keywords Search

For several years, we have utilized the broker research reports database (FRC) to conduct keyword searches. We tally the number of keyword hits associated with a specific stock or company, recorded on a specific report at a specific date, then we aggregate these hits per stock. To mitigate skewness, we apply a logarithmic transformation to the hit counts, followed by normalization between 1 and 100. We account for size effects by dividing the log-transformed hits by the company’s total market capitalization in US dollars, then we perform normalization again, then compute KCS (Keyword Composition Score) = 70% * Normalized log(hits) + 30% * Normalized Market Adj. log(hits).

While this approach has proven effective, it naturally invites questions about its rigidity and simplicity. It treats all instances of keyword hits equally, regardless of context. For instance, if we have two separate paragraphs that each contain one occurrence of a keyword, our current method deems them identical in relevance. However, upon closer reading, we might find that one paragraph is more pertinent to the topic “generative AI” than the other.

“The ISS ESG Cyber Risk Score is a concise, empirical, and proactive metric that seeks to convey how well a company manages and maintains its network security. It is a quantitative and data-driven rating that provides visibility into the level of cyber readiness and resilience an organization has implemented based on its ongoing actions to identify, manage, and mitigate cyber risk across its external technology networks, powered by a machine learning model trained to identify the potential for a breach event over the next 12 months. It requires no information to be provided by the company. The Firmographic Max presented below the ISS ESG Cyber Risk Score reflects the organization’s maximum achievable Cyber Risk Score for their organization, considering inherent industry and organizational factors including sector classification and employee count. The ISS ESG Cyber Risk Score is presented in this report for information only” from Synopsys, Inc;

“For example, media sector can benefit from content generation in texts, pictures or video formats. Usage fees is one of the business models. On the other hand, AI customer service can achieve higher efficiencies in financial sectors; (b) In our view, paid subscription is a potential model for individuals, as it opens up to enterprises on product refinements, and expands to more individuals in the future; (c) addressable market is huge across different scenarios; (3) large language model requires huge upfront investment, and it is hard for small players entering into the market. The infrastructure cost is amortized over a number of years, and requirements on training and analysis are different. Acceleration in search revenue growth in 2023. Baidu expects search ads to experience positive YoY growth in 1Q23, and accelerates to about high-single-digit YoY riding on recovery story for full year 2023.” from Baidu.

To address this issue, we could leverage large language models like GPT or BERT. However, these models can be complex to install and integrate. Therefore, we propose an alternative solution: the use of cosine similarity computations. This method is straightforward, efficient, and can be easily implemented using the existing sklearn package with minimal code adjustments.

Take, for example, the similarity to a phrase composed of the same keyword string “Generative AI Creative AI Synthetic Data Generation Generative Adversarial Networks GANs Generative Deep Learning Generative Neural Networks Generative Machine Learning Generative Modeling Artificial Intelligence Machine Learning Generative Pre-training Transformer Large Language Model GrokNet NLP Deep Learning Natural Language Processing Large Model”. For the two paragraphs we discussed earlier, we quickly obtained similarity scores of 0.03591502 and 0.04268509 respectively. These scores align with our manual interpretation of their relevance.

Moreover, since the text in question is already in the relevant field, and broker research rarely employs negative or ambiguous language, a large language model becomes a nice-to-have rather than a necessity. In fact, the simple cosine similarity computation proves to be sufficient for our needs.

We recently created a prototype portfolio for Samsung based on keyword hits, which produced a list of top hit names. Now, let’s add an additional layer of cosine similarity, eliminate the logarithmic step, but maintain the normalization and market-adjusted normalization to determine the final composition score. We can then compare this revised list to the top 30:

Observation: We found an 87% overlap between the two lists. Notably, companies providing upstream products or services, such as NVIDIA and AMD, ranked higher with our new approach. This suggests a stronger underlying relevance to our target theme. Additionally, the KCS score in the new method exhibits greater skewness, reflecting the real-world scenario where a few top players dominate the field.

Naixian Zhang

Proposition to Apply Cosine Similarity in Keywords Search

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply