103 lines
3.0 KiB
Markdown
103 lines
3.0 KiB
Markdown
# Semantic search tool
|
|
|
|

|
|
|
|
A simple CLI and web tool to help you search your PDF files. The search works by [embedding](https://en.wikipedia.org/wiki/Word_embedding) the given query and searching in a vector database for the best match (nearest vector).
|
|
|
|
## Dependencies
|
|
|
|
### Ollama
|
|
|
|
You need to run Ollama server using either local instance or docker image.
|
|
|
|
#### Local instance
|
|
|
|
* install [ollama](ollama.ai)
|
|
* run `ollama serve`
|
|
* pull selected embedding model:
|
|
```bash
|
|
> ollama pull nomic-embed-text
|
|
```
|
|
|
|
#### Ollama: Podman / Docker image
|
|
|
|
* Pull docker/podman [image](https://hub.docker.com/r/ollama/ollama):
|
|
```bash
|
|
> podman run -d -v models:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
|
|
```
|
|
* Then you need to pull the selected model (in our case `nomic-embed-text`):
|
|
```bash
|
|
> podman exec -ti ollama ollama pull nomic-embed-text
|
|
```
|
|
|
|
### UV
|
|
|
|
Install UV from your package manager. You can also run the script as-is without UV, but you will need to install required dependencies manually (pymupdf and ollama packages).
|
|
|
|
## Running
|
|
|
|
### Test
|
|
|
|
Check that the script can reach Ollama server and create / delete databases:
|
|
|
|
```
|
|
> py -m uv run main.py test
|
|
```
|
|
|
|
### Create database
|
|
|
|
```
|
|
> uv run main.py create
|
|
```
|
|
|
|
### Add files
|
|
|
|
```
|
|
> uv run main.py add-file db.pkl ~/docs/*pdf
|
|
```
|
|
|
|
### Query (CLI)
|
|
|
|
```
|
|
> uv run main.py query db.pkl "balanced tree"
|
|
Querying: 'balanced tree' in database: db.pkl
|
|
|
|
Found 10 results:
|
|
============================================================
|
|
|
|
1. Distance: 15.2735
|
|
Document: [Niklaus_Wirth]_Algorithms_+_Data_Structures_=_Programs.pdf
|
|
Page: 236, Chunk: 1
|
|
Text preview: constructed balanced tree? 2. What is the probability that an insertion requires rebalancing? Mathematical analysis of this complicated algorithm is still an open problem. Empirical tests support t...
|
|
----------------------------------------
|
|
|
|
2. Distance: 15.7531
|
|
Document: [Niklaus_Wirth]_Algorithms_+_Data_Structures_=_Programs.pdf
|
|
Page: 230, Chunk: 1
|
|
Text preview: s balanced if and only if for every node the heights of its two subtrees differ by at most 1. Trees satisfying this condition are often called AVL-trees (after their inventors). We shall simply cal...
|
|
----------------------------------------
|
|
...
|
|
```
|
|
|
|
### Query (web interface)
|
|
|
|
Start the server first:
|
|
|
|
```bash
|
|
> uv run main.py host db.pkl
|
|
Starting web server...
|
|
Database: db.pkl
|
|
URL: http://127.0.0.1:5000
|
|
Press Ctrl+C to stop
|
|
* Serving Flask app 'main'
|
|
* Debug mode: off
|
|
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
|
|
* Running on http://127.0.0.1:5000
|
|
Press CTRL+C to quit
|
|
```
|
|
|
|
Then visit http://localhost:5000/ and search there.
|
|
|
|
**Important**
|
|
|
|
If you intend to expose this publicly, use production WSGI server instead of the one built in Flask. Also as I intend to use this internally, not much effort went into securing the server. |