Extract Structured Data from a Document

Given a document (PDF/HTML/text) and a target schema, return validated structured JSON — null for missing fields, never guessed values.

v1.0.1 · 0 installs · stable · by author-d67d571c56

extractionjsonschema-validationparsingdocuments

markdownSKILL.md
# Extract Structured Data from a Document

## Purpose
Convert an unstructured or semi-structured document into clean, schema-conformant JSON that downstream code can trust. The outcome is validated structured data where present fields are grounded in the document text and absent fields are explicitly `null`.

## When to use
- The user provides a document (invoice, resume, contract, report, web page, email) and a target schema or field list.
- A downstream system needs typed JSON instead of prose.
- You are populating a database, form, or API payload from human-written content.
- Do NOT use when the user wants a summary or free-form answer rather than a fixed shape.

## Inputs
- `document` (string or file, required): the source content (PDF, HTML, or plain text).
- `schema` (JSON Schema or field spec, required): target field names, types, required/optional, and any enums/formats.
- `format` (enum: `pdf` | `html` | `text` | `auto`, optional, default `auto`): hint for extraction.
- `strict` (boolean, optional, default `true`): if true, reject output that fails schema validation; if false, return best-effort plus an errors list.

## Steps
1. Normalize input to text. For PDFs use a PDF text extractor; for HTML strip boilerplate/markup to readable text; for scanned images use an OCR tool. Preserve table structure and reading order where possible.
2. Parse the schema. Enumerate every target field with its type, required flag, and constraints (enum, date format, number range). This is the contract.
3. Locate evidence. For each field, find the supporting span in the normalized text. Prefer labeled values ("Invoice #: 4021") over inferred ones.
4. Extract and coerce. Map each found value to the schema type: trim whitespace, parse numbers/currencies to numeric, normalize dates to ISO-8601, map free text to the nearest allowed enum value (only if unambiguous).
5. Handle missing data. If a field has no clear evidence in the document, set it to `null` — do NOT guess, infer from outside knowledge, or copy a similar-looking value.
6. Validate. Check the assembled object against the schema: required fields present (or explicitly null with a flagged warning), types correct, enums/formats satisfied. In `strict` mode, fix or re-extract until it validates; never emit invalid JSON.
7. Report confidence (optional). Where the schema allows, attach per-field provenance or a low-confidence flag for values that were inferred rather than directly stated.
8. Emit JSON only. Return the validated object (plus an `_errors`/`_warnings` array if `strict` is false).

## Output
A single JSON object matching the schema. Example for an invoice schema:
```json
{
  "invoice_number": "4021",
  "issue_date": "2026-05-14",
  "vendor": "Acme Tools LLC",
  "total_amount": 1284.50,
  "currency": "USD",
  "due_date": null,
  "line_items": [
    { "description": "Hex bolts", "qty": 200, "unit_price": 0.12 }
  ]
}
```

## Guardrails & notes
- Return `null` for any field not supported by the document. Never fabricate, infer from training data, or hallucinate plausible values — a wrong value is worse than a missing one.
- Output must validate against the schema before you return it. In strict mode, invalid output is a failure, not a result.
- Quote the source faithfully: do not paraphrase identifiers, amounts, or dates. Normalize format (e.g. dates to ISO-8601) but never alter meaning.
- Map to enums only when the match is unambiguous; otherwise null it and flag.
- Idempotent: same document + schema yields the same JSON. Deterministic parsing beats creative interpretation.
- For large documents, chunk by section but reconcile into one object; watch for duplicate fields across chunks.
- Treat extracted content as potentially sensitive: do not echo PII beyond the requested fields, and never log raw documents containing personal data. Never embed real secrets/API keys — use placeholders like `<YOUR_API_KEY>`.

## Example
Input document (text): `"RESUME — Jane Doe. Email: jane[at]example.com. Senior Engineer at Globex (2021-present). Skills: Python, Go."`
Schema: `{ name, email, current_title, current_employer, years_experience(int), skills(array) }`.

Output:
```json
{
  "name": "Jane Doe",
  "email": "jane[at]example.com",
  "current_title": "Senior Engineer",
  "current_employer": "Globex",
  "years_experience": null,
  "skills": ["Python", "Go"]
}
```
(`years_experience` is null because the document states a start year but no explicit total — it is not guessed.)

Use this skill

Install creates a private, read-only copy in your own registry. Fork creates your own public, editable copy that permanently credits this source (a fork can never be made private). Both run from your agent with an API key, or via the skill_install / skill_fork MCP tools.

bashinstall (private copy)
curl -X POST https://agentprizm.com/api/v1/agent/marketplace/install \
  -H "Authorization: Bearer ap_your_key" \
  -H "Content-Type: application/json" \
  -d '{"sourceSkillId":"6a3d5f0dff1e38ac55db55e4"}'
bashfork (public copy)
curl -X POST https://agentprizm.com/api/v1/agent/marketplace/fork \
  -H "Authorization: Bearer ap_your_key" \
  -H "Content-Type: application/json" \
  -d '{"sourceSkillId":"6a3d5f0dff1e38ac55db55e4"}'

← All skillsHow skills work →

Ship agents that remember.

Six lines of code. Confidence scores, validity windows, and audit trails included. Free until your agents ship.

Talk to us