init commit

1 month ago · e8fec0d5be
commit e8fec0d5be
4 changed files with 125 additions and 0 deletions
--- a/Readme.md
+++ b/Readme.md
@ -0,0 +1,5 @@
 ## Проекты
 - [Домашний индекс научных статей](./Индекс.md)
 - [Интеллектуальный reader для научных статей](./Читалка.md)
 - [Граф цитирования](./Граф\ статей.md)
--- a/статей.md
+++ b/статей.md
@ -0,0 +1,23 @@
 Альтернативы, которые уже есть:
 **Цель:**
 По существующей открытой базе  https://www.crossref.org/ собрать web-инструмент визуализации ссылок между статьями.
 **Задачи:**
 - обработка базы crossref, выгрузка в базу для дальнейшего осуществления поиска
 - написание backend на go для 
 - ui (огромный простор для фантазии/экспериментов)
 **Аналоги:**
 - https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/
 - https://www.litmaps.com/attributions
 - https://www.connectedpapers.com/about
 - https://openalex.org/
 **Что будем использовать:**
 PostgreSQL + Go (gin, bun) + TS (angular) 
--- a/Индекс.md
+++ b/Индекс.md
@ -0,0 +1,53 @@
 # Полезные ссылки для построения индекса
 **Цель:**
 Построить векторный индекс хотя бы по 1 млн научных статей. В дальнейшем можно использовать для https://cloud.google.com/use-cases/retrieval-augmented-generation + строить суммаризацию поверх.
 **Задачи:**
 - выбор надежного (?) pdf-парсера для сохранения информации 
 - выбор векторной базы + загрузка содержимого pdf в них
 - написание тестовых запросов
 # Материалы
 ## Базы
 - https://github.com/qdrant/qdrant
 - https://github.com/crate/crate
 - https://github.com/weaviate/weaviate
 - https://github.com/chroma-core/chroma 
 - https://github.com/milvus-io/milvus
 - elastic search
 ## Библиотеки для парсинга
 - https://github.com/Filimoa/open-parse/tree/main
 - https://github.com/jsvine/pdfplumber
 - https://github.com/topics/pdf-parser
 - https://github.com/py-pdf/pypdf
 - https://github.com/smalot/pdfparser
 - https://github.com/jstockwin/py-pdf-parser
 - https://github.com/RDFLib/rdflib
 - https://pypi.org/project/camelot-py/
 - https://pypi.org/project/tabula-py/
 - apach-tika
 **AI парсилки**
 Здесь пример zero-shot pdf extraction на основе gpt-mini: https://github.com/getomni-ai/zerox?tab=readme-ov-file внутри есть ссылки на другие платные альтернативы:
 	- https://aws.amazon.com/textract/pricing/#:~:text=Amazon%20Textract%20API%20pricing
 	- https://cloud.google.com/document-ai/pricing
 	- https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/
 	- https://unstructured.io/api-key-hosted#:~:text=Cost%20and%20Usage%20%0AGuidelines
 Здесь evaluation разных Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
 ## Суммаризация
 ### LLM
 Фреймворк для сборки приложений на основе LLM: https://github.com/langchain-ai/langchain?tab=readme-ov-file
 ### Text embeddings
 - https://qdrant.github.io/fastembed/
 - https://github.com/qdrant/qdrant
--- a/Читалка.md
+++ b/Читалка.md
@ -0,0 +1,44 @@
 **Цель:** собрать инструмент для обработки научных статей.
 Работа в этом направлении: https://arxiv.org/abs/2210.02830
 **Задачи**:
 **Что будем использовать**
 # Доп материалы
 ## Text embeddings
 - https://qdrant.github.io/fastembed/
 - https://github.com/qdrant/qdrant
 ## Библиотеки для парсинга
 - https://github.com/Filimoa/open-parse/tree/main
 - https://github.com/jsvine/pdfplumber
 - https://github.com/topics/pdf-parser
 - https://github.com/py-pdf/pypdf
 - https://github.com/smalot/pdfparser
 - https://github.com/jstockwin/py-pdf-parser
 - https://github.com/RDFLib/rdflib
 - https://pypi.org/project/camelot-py/
 - https://pypi.org/project/tabula-py/
 - apach-tika
 **AI парсилки**
 Здесь пример zero-shot pdf extraction на основе gpt-mini: https://github.com/getomni-ai/zerox?tab=readme-ov-file внутри есть ссылки на другие платные альтернативы:
 	- https://aws.amazon.com/textract/pricing/#:~:text=Amazon%20Textract%20API%20pricing
 	- https://cloud.google.com/document-ai/pricing
 	- https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/
 	- https://unstructured.io/api-key-hosted#:~:text=Cost%20and%20Usage%20%0AGuidelines
 Здесь evaluation разных Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
 **На чем можно писать GUI**
 - https://dioxuslabs.com/
 - https://tauri.app
 - какой-то ultra fast tauri + angular setup https://github.com/maximegris/angular-tauri