# SpeechKit v0.42 Agent Context

SpeechKit is a voice framework for products and automation hosts. It has three
strict modes:

- Dictation: speech-to-text only. No LLM rewriting, no utilities, no codewords.
- Assist: one-shot utility, codeword, or LLM result with optional TTS.
- Voice Agent: realtime audio-to-audio dialogue for fast follow-ups.

v0.42 is the SDK-surface release line: wake-word, Assist, Voice-Companion,
Event-Bus, and TTS building blocks are embeddable from `pkg/speechkit/...`.
Hands-Free is the activation and voice-output layer for Dictation, Assist, and
Voice Agent; it is not a fourth mode.
The latest GitHub Release is the canonical source for desktop installers.
`release/latest/windows/INSTALLER-MANIFEST.json` mirrors the current installer
metadata and download URLs; the `latest` GHCR tag tracks the current stable
server image.

## Start Here

- Human technical guide: https://speechkit.cc/getting-started/technical
- Human agent guide: https://speechkit.cc/getting-started/agents
- Markdown technical guide: https://speechkit.cc/getting-started/technical.md
- Markdown agent guide: https://speechkit.cc/getting-started/agents.md
- Server installer: https://speechkit.cc/install-server.sh
- Browser Docker Compose example: https://speechkit.cc/install-server/docker-compose.example.yml
- Browser config example: https://speechkit.cc/install-server/config.browser.example.toml
- OpenAPI: https://speechkit.cc/api/openapi.v1.yaml
- Voice Agent AsyncAPI: https://speechkit.cc/api/asyncapi.v1.yaml
- One-shot manifest schema: https://speechkit.cc/schemas/speechkit-one-shot-manifest.schema.json
- One-shot functional result schema: https://speechkit.cc/schemas/speechkit-one-shot-functional-result.schema.json

Fetchable one-shot prompt Markdown:

- https://speechkit.cc/getting-started/agents/tri-mode-web-demo.md
- https://speechkit.cc/getting-started/agents/voice-game-moderator.md
- https://speechkit.cc/getting-started/agents/android-memo-app.md
- https://speechkit.cc/getting-started/agents/go-framework-integration.md

## Simple Prompts

```text
Hi Codex, go to speechkit.cc and install the SpeechKit Server on this server.
```

```text
Hi Codex, read speechkit.cc/llms.txt, then install the SpeechKit Server with Docker Compose.
```

```text
Hi Codex, add SpeechKit as a Go framework dependency and use the documented Dictation, Assist, Voice Agent, and Hands-Free contracts. Prefer the go-assist-voice-companion, go-voice-agent-companion, or go-dictation-handsfree-ui scaffold when it fits, import only public pkg/speechkit/... components, and do not import internal/*.
```

```text
Hi Codex, add a SpeechKit Assist Voice Companion to this Go app. Import only pkg/speechkit/{companion,wakeword,assist,tts} plus pkg/speechkit for events, wire companion.NewHandsFree with TargetMode: companion.TargetAssist, keep microphone capture/playback host-owned, and do not import internal/* or the Windows client.
```

```text
Hi Codex, configure `speechkit-mcp` in docs mode and verify the SpeechKit API before writing integration code.
```

## Server Install

```sh
curl -fsSL https://speechkit.cc/install-server.sh | sh
```

The script writes a Docker Compose stack, `config.toml`, and `.env` into
`/opt/speechkit` by default. It generates a local bearer token when one is not
provided. The generated Compose stack binds to `127.0.0.1:8080` by default;
use `--public-bind` only behind a TLS reverse proxy. It pulls
`ghcr.io/kombifyio/speechkit-server:latest`, which tracks the latest stable
release.

For browser-facing Docker Compose, set `SPEECHKIT_PUBLIC_URL=http://localhost:8080`.
The returned Voice Agent `ws_url` uses that public base; otherwise a browser can
be handed a container-internal host that it cannot resolve.

## API Surfaces

Use the OpenAPI document for exact request and response shapes:
https://speechkit.cc/api/openapi.v1.yaml

Use the AsyncAPI document for the Voice Agent WebSocket contract:
https://speechkit.cc/api/asyncapi.v1.yaml

Browser WebSocket auth is subprotocol-first. Use the `ws_url` and
`ws_subprotocol` returned from `POST /v1/voiceagent/sessions`; do not put the
server bearer token into a browser WebSocket handshake. Tickets default to 90
seconds; expired tickets are not refreshed in v1, so create a new session.

Important v0.42 surfaces:

- `GET /v1/catalog/profiles`
- `GET /v1/catalog/contracts`
- `GET /v1/catalog/readiness`
- `GET /v1/config`
- `GET /v1/vocabulary/dictionary`
- `POST /v1/vocabulary/dictionary`
- `GET /v1/transcripts`
- `GET /v1/transcripts/{id}`
- `GET /v1/voiceagent/sessions/{id}/transcript`
- `GET /v1/voiceagent/sessions/{id}/summary`
- `POST /v1/tts/synthesize`
- `GET /v1/tts/voices`

## MCP

Use `speechkit-mcp` when an agent should inspect docs, validate requests, or
manage a running SpeechKit Server.

Agent planning tools:

- `speechkit_install_plan`: read-only install steps for the stable server.
- `speechkit_self_check_plan`: ordered Docker, health, readiness, config, catalog, OpenAPI, and AsyncAPI probes.
- `speechkit_scaffold_templates`: list starter integration templates.
- `speechkit_scaffold_integration`: render a starter integration in memory for an agent to apply.

Docs-only local config:

```json
{
  "mcpServers": {
    "speechkit": {
      "command": "speechkit-mcp",
      "args": ["--mode=docs,test"]
    }
  }
}
```

## Go SDK Embedding

Use these packages when the host embeds SpeechKit directly instead of calling a
running server:

- `pkg/speechkit`: mode contracts, Runtime, events, provider catalog, and readiness.
- `pkg/speechkit/dictation`: strict STT-only embedded dictation.
- `pkg/speechkit/wakeword` and `pkg/speechkit/wakeword/sherpa`: wake-word detection contracts and sherpa adapter.
- `pkg/speechkit/companion`: `NewHandsFree` composer for wake events, target routing, host transcript requests, Assist, Voice Agent activation, optional TTS, and runtime events.
- `pkg/speechkit/tts`: ProviderKind-aware Router and Service for local/cloud/direct spoken output.
- `pkg/speechkit/assist`: Assist service, multi-turn skill context, codeword routing, TTS routing, and optional Genkit adapter.
- `pkg/speechkit/agentkit` and `pkg/speechkit/voiceagent/live`: embedded realtime Voice Agent hosts.
- `pkg/speechkit/client`: self-host server client.

Import the smallest public component that matches the job. A wake-only
satellite should import `wakeword`; a TTS-only app should import `tts`; a
server-connected Voice Agent should import `client`. Do not import `internal/*`
or the Windows client from a library integration.

Hands-Free target modes:

- `companion.TargetAssist`: Assist Voice Companion, one-shot utilities, smart-home commands, spoken answers.
- `companion.TargetVoiceAgent`: continuous dialogue, companion mode, game moderator, brainstorming.
- `companion.TargetDictationUIAssisted`: wake-triggered Dictation start/stop with a visible text target or explicit commit surface; no TTS.

Smoke examples:

```sh
go run ./examples/embed-companion
go run ./examples/embed-tts
go run ./examples/embed-event-bus
```

## Go AgentKit Harness

Use `pkg/speechkit/agentkit` when embedding SpeechKit Voice Agent behavior into
a Go host. It provides a session wrapper, tool registry, lifecycle hooks, and a
session-scoped memory interface on top of the Voice Agent runtime. Register
host-side tools with `agentkit.NewRegistry()` and pass the registry to
`agentkit.NewAgentSession(...)` before calling `Start`.

For a browser starter, use `speechkitctl init browser-dictation-react --output
./speechkit-demo` or ask `speechkit-mcp` for `speechkit_scaffold_integration`.
The MCP scaffold tool is read-only and returns file paths plus content for the
calling agent to review and apply.

## Voice Agent — single-prompt embedding (Server-Target)

For host applications that talk to a remote `speechkit-server` (recommended for
non-Windows targets, cloud deployments, and any multi-client scenario), use
`pkg/speechkit/client` instead of embedding the kernel. The reference is
`examples/voice-agent/game-instructor/` in this repo:

- `config.toml` seeds a Voice Agent persona + role + sequence on the server
  (`game-instructor` / `game-moderator` / `game-flow-15min`). The sequence is
  capped to 30 turns and the server-side `voiceagent_idle_timeout_sec` is set to
  15 min — this is the working preset for a 15-minute moderated game.
- `main.go` shows the full duplex flow: `CreateVoiceAgentSession` →
  `DialVoiceAgent` → `SendStart` (with persona/role/sequence IDs) →
  `SendText`/`SendAudio` + `ReadMessage` loop → `SendStop` / deadline exit.

A coding agent asked to "build a voice agent into my app" should:

1. Generate a TOML preset matching their use case (start from
   `examples/voice-agent/game-instructor/config.toml`), defining persona,
   role (system prompt + VAD policy + thinking level), and a sequence with
   step instructions + exit criteria.
2. Reuse `examples/voice-agent/game-instructor/main.go` as the embedder
   skeleton. Only the persona/role/sequence IDs and the audio I/O adapter
   need to change. Audio in: PCM 16 kHz S16LE mono via
   `VoiceAgentSession.SendAudio`. Audio out: PCM 24 kHz S16LE mono from
   `VoiceAgentMessage.Audio`.
3. Deploy: `speechkit-server --config <preset.toml>` with `GOOGLE_AI_API_KEY`
   and a bearer token in env. Idle timeout enforces the wall-clock cap.

## Release-Gated Native Android Memo Prompt

Create a real native Android project, not a browser harness. The project must
include Gradle settings, `app/build.gradle` or `app/build.gradle.kts`,
`app/src/main/AndroidManifest.xml`, and Kotlin or Java source. The app records a
memo, calls `/v1/dictation/transcribe`, saves the transcript locally, and sends
memo text to `/v1/assist/process` for cleanup or summary. It must expose a
settings/config path for `SPEECHKIT_SERVER_URL` and token, set up a fresh local
SpeechKit Server in Docker Compose for verification, and define a Gradle task
named `verifySpeechKitLive` that writes
`speechkit-one-shot-functional-result.json` with `app_kind=android`,
`app_transport=android`, `manifest_file=speechkit-one-shot-manifest.json`,
`server_url_source=settings_screen` or `android_config`, non-empty
Dictation/Assist outputs, `status=pass` on every checked mode, and
`checked_via_app=true` under `modes.dictation` and `modes.assist`.

The canonical functional-result shape is `modes.voiceagent`; do not emit
`modes.voice_agent`. Extra evidence fields such as `app_url_reachable` and
`latency_seconds` are allowed by the published schema.

Management mode requires a running server and bearer token:

```json
{
  "mcpServers": {
    "speechkit": {
      "command": "speechkit-mcp",
      "args": ["--mode=docs,management,test"],
      "env": {
        "SPEECHKIT_SERVER_URL": "http://localhost:8080",
        "SPEECHKIT_TOKEN": "replace-with-generated-token"
      }
    }
  }
}
```