Introduction
At the end of 2024 it seems like the race of LLMs (Large Language Models) is still one of the most interesting topics around. While small and large companies are finding more and more ways to integrate these powerful neural networks into their own system, in this article we will explore a scenario in which one of the finest (and most human-like) LLMs - Anthropic's Claude family - is used to create brief news summaries in English (or any other language) starting from a list of Serbian news articles.
According to Wikipedia (https://en.wikipedia.org/wiki/Languages_used_on_the_Internet), the Serbian language (mother-tongue of the author) is currently situated around the 30th place, with roughly 0.1% of websites online. Put simply, this is very small compared to English (roughly 50%), but even Spanish, Russian, German and French (all around 4-5%).
In this article we will imagine that we are creating a simple web application aimed to help our (virtual) friend: a foreign correspondent situated in a country whose language is (still) an unknown pleasure for him/her.
The flow of data will be simplified - we will begin with a simple csv file containing the articles. The articles, potentially coming from different news sources, have a minimal structure: a title, a body of content and a date. They are in Serbian. The principle is simple, the articles are short enough that they will not be split and chunked - each article will get its own embedding vector and be stored in a MongoDB collection.
The embeddings that we will use - embedić - are created specifically for the Serbian language.
As stated:
Novak Zivanic has made a significant contribution to the field of Natural Language Processing with the release of Embedić, a suite of Serbian text embedding models. These models are specifically designed for Information Retrieval and Retrieval-Augmented Generation (RAG) tasks.
Embedding models are fundamental when working with the meaning of words, sentences and text in general. Mathematically embeddings are rather simple - they represent high dimensional data (in our case text, but it could be images or sounds or other meaningful data structures) in a lower dimension but denser vector space, while preserving semantic relationships. There are many excellent introductions to embeddings and I won't delve into it. Some LLM model providers, such as OpenAI. provide vector embeddings, while others, like Anthropic do not.
Since we want to use particular embeddings - aptly named _embedić_ - we will use the HuggingFace Sentence Transformer - a library that helps convert text into numerical vectors with semantic meaning. Sentence Transformers excel at the tasks that we need: finding similar sentences, semantic search and text clustering. Sentence transformers provide a myriad of models updated or released very often, they are free to use - there are no API costs involved - and they can be fine-tuned. They might not provide state-of-the art performance like OpenAI's or Voyage.
Once the articles are "translated" into their universally understandable format - vectorized - we will save them to a MongoDB collection on an Atlas Cluster and create a vector index. The MongoDB vector index enables fast and efficient vector searching and is documented (https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/).
Finally, with the vector index in place and after MongoDB Atlas completes the process of indexing, the MongoDB database is ready to become the backbone of our correspondent system. Enter Claude - one of the top-performing LLM families, provided by Anthropic. While you could easily switch this setup with OpenAI or some model provided by GROQ, we will use the Anthropic Claude Opus model, which plays particularly nice with articulated text and summarization.
In order to generate the final text, we will implement a simple version of the RAG methodology, very similar to the one described in the book Building AI-intensive Python Applications. Simply put, RAG, or retrieval augmented generation is an approach in which a number of documents that are similar (or close in a vector space) to the query are returned from a database and then passed to the model with a set of instructions, in order to get a combined response that contains (hopefully) all the relevant data from the retrieved documents.
After a MongoDB pipeline retrieves the relevant news articles - and here we are really only scratching the surface of MongoDB's flexibility and scalability - the pieces are stitched together by Claude through a prompt. Again, these steps could be optimized in many different ways, but that is not the topic of this article. The retrieval process could be improved through reranking (although in this case, when each article is treated as a unit and is assigned a single embedding this is not so important), the retrieval query could create "topics" or "stories" and make references to it maintaining a logical and chronological consistency and so on. The context size of the latest LLM models has increased and it allows us to implement various approaches for this kind of project, but we will stick to a simple, illustrative solution.
This project is organized as a Jupyter notebook and is available [here](https://github.com/freethrow/mongodbnews/blob/main/ForerignCorrespondent.ipynb).
The data and the embeddings
First, let's read the news CSV file with pandas:
articles = pd.read_csv('fa_articles.csv', encoding='utf-8')
articles.head()The articles won't mean much to you unless you are proficient in Serbian - these are articles encompassing politics, economy, social matters and similar. They have been filtered - so no art, entertainment or sports.
Now, let's import and install the sentence transformers and our embedding model of choice - embedić. Yes, that is the letter ć, and it is pronounced like in the italian word cappuccino.
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
from datetime import datetime
from typing import List, Dict
import json
model = SentenceTransformer('djovak/embedic-large')Now we will create a function for concatenating the title and the content of the article and passing it to the
Although this is a simple little project, which can and should be implemented in a much more concise way, it offers plenty of room for experimenting. Claude provides three models: Opus, Sonnet and Haiku, and each of them has different characteristics and is well worth trying out. The art of prompting was already mentioned - there could be many improvements and experimentation. Other models could be used, like OpenAI's or Llama models. The system, as is, is not meant to be running as a server, so trying a locally hosted LLM through Ollama or a similar solution is also something worth trying.
In this type of setup, MongoDB proves to be an ideal database - the created reports could be saved upon generation, revised or reviewed by a human (at least for fact-checking) and they could be linked to the original articles, creating a cascade of more complex narratives that follow a logical and chronological path that can then again be fed to an LLM. The possibilities are really endless, and experimenting with MongoDB and AI unlocks interesting scenarios and is also fun!