The dataset we used is Arxiv’s Open access dataset and we used 15000 records from it which contains ML and AI-oriented papers. We utilized Abstract to create embeddings for each paper. Cohere’s small model was used for creating embeddings which produced 1024 embeddings for each records. Then the 15000 X 1024 Vectors are uploaded to the created collections in qdrant. Finally, the embeddings are generated for the abstract of chosen articles or the given prompt, and the Qdrant searches for similar texts in the collection and outputs the indices of it. The distance metric used to measure the similar vectors is Cosine distance. Besides the recommendation for articles we provided features such as research paper summarization and Translating contents of given paragraphs from English to 8 different languages such as Tamil, Nepali, Indonesia, Thai, Spanish, Russian , Turkish, and French. The model we used for Language Translation is MBART Large-50-one-to-many for multilingual machine translation. The Text Summarization part is done using cohere’s API
Category tags:Education, Summarization, Language and Translation
"Great project for students!"
Olesia Zinchenko
Event Manager & Mentor at lablab.ai