Multimodal Interaction Kit
Voice input via Web Speech API, image analysis, and drag-and-drop file processing for your agent.
Voice Input
Processing Pipeline
Use the controls above to add multimodal inputs.
Configuration
Supported Formats
Voice: WAV, WebM, OGG
Image: PNG, JPEG, WebP, GIF
File: PDF, CSV, DOCX, XLSX, TXT
Integration Code
import { MultimodalInput, VoiceRecorder, ImageAnalyzer, FileProcessor } from 'agent-tools-kit/ui'
const multimodal = new MultimodalInput({
voice: new VoiceRecorder({
language: 'en-US',
autoTranscribe: true,
provider: 'web-speech-api', // or 'whisper', 'deepgram'
}),
image: new ImageAnalyzer({
autoAnalyze: true,
capabilities: ['ocr', 'description', 'diagram-parsing'],
maxSize: 10_000_000, // 10MB
}),
file: new FileProcessor({
formats: ['pdf', 'csv', 'docx', 'xlsx', 'txt'],
extractText: true,
chunkForContext: true, // auto-chunk large docs
maxChunkTokens: 2000,
}),
})
// Use with your agent
agent.onInput(async (input) => {
const processed = await multimodal.process(input)
// processed.type: 'text' | 'voice' | 'image' | 'file'
// processed.content: transcribed/extracted text
// processed.metadata: { format, size, language, etc. }
return agent.respond(processed.content, { context: processed.metadata })
})