Multimodal Interaction Kit

Voice input via Web Speech API, image analysis, and drag-and-drop file processing for your agent.

Voice Input

Processing Pipeline

Use the controls above to add multimodal inputs.

Configuration

Supported Formats

Voice: WAV, WebM, OGG

Image: PNG, JPEG, WebP, GIF

File: PDF, CSV, DOCX, XLSX, TXT

Integration Code

import { MultimodalInput, VoiceRecorder, ImageAnalyzer, FileProcessor } from 'agent-tools-kit/ui'

const multimodal = new MultimodalInput({
  voice: new VoiceRecorder({
    language: 'en-US',
    autoTranscribe: true,
    provider: 'web-speech-api',  // or 'whisper', 'deepgram'
  }),
  image: new ImageAnalyzer({
    autoAnalyze: true,
    capabilities: ['ocr', 'description', 'diagram-parsing'],
    maxSize: 10_000_000,  // 10MB
  }),
  file: new FileProcessor({
    formats: ['pdf', 'csv', 'docx', 'xlsx', 'txt'],
    extractText: true,
    chunkForContext: true,    // auto-chunk large docs
    maxChunkTokens: 2000,
  }),
})

// Use with your agent
agent.onInput(async (input) => {
  const processed = await multimodal.process(input)
  // processed.type: 'text' | 'voice' | 'image' | 'file'
  // processed.content: transcribed/extracted text
  // processed.metadata: { format, size, language, etc. }
  return agent.respond(processed.content, { context: processed.metadata })
})