projs-2024-autumn/Индекс.md

# Домашний индекс

**Цель:**
Построить векторный индекс хотя бы по 1 млн научных статей.
Дальнейшее развитие: https://cloud.google.com/use-cases/retrieval-augmented-generation + строить суммаризацию поверх.

**Задачи:**
- выбор надежного (?) pdf-парсера для сохранения информации 
- выбор векторной базы + загрузка содержимого pdf в них
- написание тестовых запросов

# Материалы

## Базы

- https://github.com/qdrant/qdrant
- https://github.com/crate/crate
- https://github.com/weaviate/weaviate
- https://github.com/chroma-core/chroma 
- https://github.com/milvus-io/milvus
- elastic search

## Библиотеки для парсинга

- https://github.com/Filimoa/open-parse/tree/main
- https://github.com/jsvine/pdfplumber
- https://github.com/topics/pdf-parser
- https://github.com/py-pdf/pypdf
- https://github.com/smalot/pdfparser
- https://github.com/jstockwin/py-pdf-parser
- https://github.com/RDFLib/rdflib
- https://pypi.org/project/camelot-py/
- https://pypi.org/project/tabula-py/
- apach-tika

**AI парсилки**

Здесь пример zero-shot pdf extraction на основе gpt-mini: https://github.com/getomni-ai/zerox?tab=readme-ov-file внутри есть ссылки на другие платные альтернативы:
	- https://aws.amazon.com/textract/pricing/#:~:text=Amazon%20Textract%20API%20pricing
	- https://cloud.google.com/document-ai/pricing
	- https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/
	- https://unstructured.io/api-key-hosted#:~:text=Cost%20and%20Usage%20%0AGuidelines

Здесь evaluation разных Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
## Суммаризация

### LLM

Фреймворк для сборки приложений на основе LLM: https://github.com/langchain-ai/langchain?tab=readme-ov-file

### Text embeddings

- https://qdrant.github.io/fastembed/
- https://github.com/qdrant/qdrant
Update projects 1 month ago			`# Домашний индекс`
init commit 1 month ago
			`Цель:`
Update projects 1 month ago			`Построить векторный индекс хотя бы по 1 млн научных статей.`
			`Дальнейшее развитие: https://cloud.google.com/use-cases/retrieval-augmented-generation + строить суммаризацию поверх.`
init commit 1 month ago
			`Задачи:`
			`- выбор надежного (?) pdf-парсера для сохранения информации`
			`- выбор векторной базы + загрузка содержимого pdf в них`
			`- написание тестовых запросов`

			`# Материалы`
Update projects 1 month ago
init commit 1 month ago			`## Базы`

			`- https://github.com/qdrant/qdrant`
			`- https://github.com/crate/crate`
			`- https://github.com/weaviate/weaviate`
			`- https://github.com/chroma-core/chroma`
			`- https://github.com/milvus-io/milvus`
			`- elastic search`

			`## Библиотеки для парсинга`

			`- https://github.com/Filimoa/open-parse/tree/main`
			`- https://github.com/jsvine/pdfplumber`
			`- https://github.com/topics/pdf-parser`
			`- https://github.com/py-pdf/pypdf`
			`- https://github.com/smalot/pdfparser`
			`- https://github.com/jstockwin/py-pdf-parser`
			`- https://github.com/RDFLib/rdflib`
			`- https://pypi.org/project/camelot-py/`
			`- https://pypi.org/project/tabula-py/`
			`- apach-tika`

			`AI парсилки`

			`Здесь пример zero-shot pdf extraction на основе gpt-mini: https://github.com/getomni-ai/zerox?tab=readme-ov-file внутри есть ссылки на другие платные альтернативы:`
			`- https://aws.amazon.com/textract/pricing/#:~:text=Amazon%20Textract%20API%20pricing`
			`- https://cloud.google.com/document-ai/pricing`
			`- https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/`
			`- https://unstructured.io/api-key-hosted#:~:text=Cost%20and%20Usage%20%0AGuidelines`

			`Здесь evaluation разных Multimodal Large Language Models: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation`
			`## Суммаризация`

			`### LLM`

			`Фреймворк для сборки приложений на основе LLM: https://github.com/langchain-ai/langchain?tab=readme-ov-file`

			`### Text embeddings`

			`- https://qdrant.github.io/fastembed/`
Update projects 1 month ago			`- https://github.com/qdrant/qdrant`