# pdf-to-kcf

A Python CLI tool that uses AI agents to parse PDF documents and extract structured insights. Built with `pydantic-ai`, this tool creates an intelligent agent that autonomously analyzes documents, requesting additional pages as needed to form complete insights.

## Features

- **Autonomous Document Analysis**: AI agent decides how much of the document to read
- **Structured Insight Extraction**: Classifies content as facts, opinions, or comments
- **Rich Metadata**: Adds attributes like source, confidence, dates, and more
- **Multiple AI Models**: Supports OpenAI and other compatible models
- **JSON Output**: Exports insights in a structured, machine-readable format

## Installation

This project uses [uv](https://github.com/astral-sh/uv) for dependency management:

```bash
# Install dependencies
uv sync
```

## Setup

1. Copy the environment template:
```bash
cp .env.example .env
```

2. Add your OpenRouter API key to `.env`:
```bash
OPENROUTER_API_KEY=your_openrouter_api_key_here
```

3. Get your API key from [OpenRouter](https://openrouter.ai/) (free tier available)

## Usage

```bash
# Basic usage (uses OpenRouter with Claude 3.5 Sonnet by default)
uv run pdf-to-kcf document.pdf

# Specify custom output file
uv run pdf-to-kcf document.pdf -o insights.json

# Start from a specific page (0-indexed)
uv run pdf-to-kcf document.pdf -s 3

# Use a different AI model from OpenRouter
uv run pdf-to-kcf document.pdf -m meta-llama/llama-3.1-70b-instruct
uv run pdf-to-kcf document.pdf -m google/gemini-pro-1.5
```

### Options

- `--output, -o`: Output JSON file path (default: `<pdf_name>_insights.json`)
- `--start-page, -s`: Starting page number, 0-indexed (default: 0)
- `--model, -m`: AI model to use via OpenRouter (default: `anthropic/claude-3.5-sonnet`)

### Available Models

When using OpenRouter, you can specify any model using the format `<provider>/<model-name>`:
- `anthropic/claude-3.5-sonnet` (default, recommended)
- `anthropic/claude-3-opus`
- `openai/gpt-4o`
- `meta-llama/llama-3.1-70b-instruct`
- `google/gemini-pro-1.5`
- See [OpenRouter models](https://openrouter.ai/models) for full list

## Output Format

The tool generates JSON files with structured insights:

```json
{
  "insights": [
    {
      "type": "fact",
      "insight": "Global temperatures have risen 1.1�C since pre-industrial times",
      "content": "According to the IPCC, global temperatures have risen approximately 1.1�C...",
      "attributes": [
        {"attribute": "source", "value": "IPCC Report"},
        {"attribute": "confidence", "value": "high"},
        {"attribute": "year", "value": "2023"}
      ]
    },
    {
      "type": "opinion",
      "insight": "The author believes immediate action is required",
      "content": "We must act now to prevent catastrophic consequences...",
      "attributes": [
        {"attribute": "sentiment", "value": "urgent"},
        {"attribute": "section", "value": "conclusion"}
      ]
    }
  ]
}
```

## How It Works

1. **PDF Loading**: Extracts text content from PDF using pypdf
2. **Agent Initialization**: Creates a pydantic-ai agent with the specified model
3. **Autonomous Analysis**: Agent analyzes content and can request additional pages
4. **Insight Extraction**: Classifies and structures insights with metadata
5. **JSON Export**: Saves all insights to a JSON file

## Requirements

- Python 3.12 or higher
- OpenRouter API key (set as `OPENROUTER_API_KEY` environment variable)
  - Get your free API key at [OpenRouter](https://openrouter.ai/)
  - Supports all major AI models (Claude, GPT-4, Gemini, Llama, etc.)
- Alternatively, use `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, or other provider keys

## Architecture

The tool follows the agentic document parsing format with these core components:

- **models.py**: Data structures (ContentInsight, PageContentAnalysis, etc.)
- **pdf_reader.py**: PDF text extraction (PDFDocument class)
- **agent.py**: AI agent with autonomous page reading capability
- **cli.py**: Command-line interface

See `CLAUDE.md` for detailed architecture documentation.

## License

MIT