We Built Real-Time Voice AI: Here's How It Works

The Problem We Set Out to Solve

Chatbots are powerful — but they're still chatbots. They require the user to type, wait, read, and type again. For many industries — call centers, healthcare reception, government services, and retail — voice is the primary way people expect to interact. We decided to bridge that gap.

Today, we're pulling back the curtain on how we built Mugib's real-time Voice AI system: a full-stack feature that lets you deploy conversational voice agents powered by your own knowledge base, with quota controls, usage tracking, and multilingual support — all in production.

The Architecture: What Powers a Voice Agent

Our Voice AI stack is built on top of Vapi, a best-in-class real-time voice orchestration platform. But Vapi is just the phone call layer. What makes Mugib's voice agents intelligent is everything behind it:

RAG-powered responses: When a caller asks a question, our engine retrieves the most relevant chunks from your project's knowledge base and feeds them as context to the LLM — just like our chat agents.
Vapi Server Hook: We intercept every call event through our /vapi-server webhook. This is where we enforce quotas, log usage, and route calls to the right agent.
ElevenLabs TTS: Voice output is powered by ElevenLabs voices — natural, expressive, and available in Arabic and English.
Sub-second latency: By keeping everything on the same server-to-server path, we achieve end-to-end response times that feel like a real conversation.

"We didn't want to add a voice feature. We wanted to make voice a first-class citizen of our platform — with the same quota controls, analytics, and reliability as our chat system."

How We Enforce Voice Quotas

One challenge unique to voice is billing. Unlike chat messages where you count discrete events, voice usage is measured in minutes. Every plan now includes a voice_minutes_per_month field, and we track usage in a dedicated voice_usage_logs table at the second-level granularity.

When a Vapi call reaches the in-progress status, our server fires a check: How many minutes has this user consumed this month? If they've exceeded their plan limit, we end the call immediately and return a quota-exceeded response. No surprises, no unexpected bills.

Building for Arabic-First

Most voice AI solutions are English-first and Arabic-later. We built Mugib's voice system Arabic-first. Our voice agents detect the caller's language automatically and respond naturally in Arabic or English — with proper RTL awareness in the portal UI and Arabic phonetics handled by ElevenLabs' dedicated Arabic models.

What You Get Today

✅ Real-time voice conversations with your AI agent
✅ Powered by your knowledge base (RAG)
✅ Arabic & English language support
✅ Per-plan voice minute quotas with enforcement
✅ Monthly usage tracking visible on your dashboard
✅ Create and manage voice agents from the portal

What's Next

We're working on inbound phone number provisioning — so your customers can call a real phone number and reach your Mugib voice agent directly. No widget, no embed, just a phone call.

We're also building voice analytics: call duration heatmaps, most-asked questions by voice, and sentiment tracking per call. Stay tuned.

Try Voice AI Free

المشكلة التي أردنا حلها

الشات بوتات قوية — لكنها لا تزال شات بوتات. تتطلب من المستخدم الكتابة، والانتظار، والقراءة، ثم الكتابة مجدداً. في كثير من القطاعات — مراكز الاتصال، الرعاية الصحية، الخدمات الحكومية، والتجزئة — الصوت هو الطريقة الأساسية التي يتوقع بها الناس التفاعل. قررنا سد هذه الفجوة.

اليوم، نكشف الستار عن كيفية بناء نظام Voice AI لحظي في مجيب: ميزة متكاملة تتيح لك نشر وكلاء صوتيين تحادثيين مدعومين بقاعدة معرفتك الخاصة، مع ضوابط الحصص، وتتبع الاستخدام، ودعم متعدد اللغات — كل ذلك في بيئة الإنتاج.

البنية التقنية: ما الذي يُشغّل وكيل الصوت

بُني نظام Voice AI لدينا فوق Vapi، منصة تنسيق الصوت الفوري الأفضل في فئتها. لكن Vapi مجرد طبقة المكالمة الهاتفية. ما يجعل وكلاء مجيب الصوتيين أذكياء هو كل شيء خلفها:

ردود مدعومة بـ RAG: عندما يطرح المتصل سؤالاً، يسترجع محركنا الأجزاء الأكثر صلة من قاعدة معرفة مشروعك ويغذيها كسياق للنموذج اللغوي — تماماً مثل وكلاء الدردشة لدينا.
Vapi Server Hook: نعترض كل حدث مكالمة عبر webhook الخاص بنا. هنا نطبق الحصص، ونسجل الاستخدام، ونوجه المكالمات إلى الوكيل الصحيح.
ElevenLabs للنص-إلى-كلام: مخرجات الصوت مدعومة بأصوات ElevenLabs — طبيعية، معبرة، ومتاحة بالعربية والإنجليزية.
تأخير أقل من ثانية: بإبقاء كل شيء على مسار خادم إلى خادم، نحقق أوقات استجابة شاملة تبدو كمحادثة حقيقية.

"لم نرد إضافة ميزة صوتية. أردنا جعل الصوت مواطناً من الدرجة الأولى في منصتنا — بنفس ضوابط الحصص والتحليلات والموثوقية الموجودة في نظام الدردشة لدينا."

كيف نطبق حصص الصوت

أحد التحديات الفريدة في الصوت هو الفوترة. على عكس رسائل الدردشة حيث تعد أحداثاً منفصلة، يُقاس استخدام الصوت بالدقائق. كل خطة الآن تتضمن حقل voice_minutes_per_month، ونتبع الاستخدام في جدول voice_usage_logs مخصص بدقة على مستوى الثانية.

عندما تصل مكالمة Vapi إلى حالة in-progress، يطلق خادمنا فحصاً: كم دقيقة استهلك هذا المستخدم هذا الشهر؟ إذا تجاوز حد خطته، ننهي المكالمة فوراً ونعيد استجابة تجاوز الحصة. لا مفاجآت، لا فواتير غير متوقعة.

البناء بالعربية أولاً

معظم حلول الذكاء الاصطناعي الصوتية تُبنى بالإنجليزية أولاً والعربية لاحقاً. بنينا نظام مجيب الصوتي بالعربية أولاً. يكتشف وكلاؤنا الصوتيون لغة المتصل تلقائياً ويردون بشكل طبيعي بالعربية أو الإنجليزية.

ما تحصل عليه اليوم

✅ محادثات صوتية لحظية مع وكيل الذكاء الاصطناعي
✅ مدعوم بقاعدة معرفتك (RAG)
✅ دعم اللغة العربية والإنجليزية
✅ حصص دقائق صوتية لكل خطة مع تطبيق فوري
✅ تتبع الاستخدام الشهري مرئي على لوحة التحكم
✅ إنشاء وإدارة وكلاء الصوت من البوابة

ما القادم

نعمل على توفير أرقام هاتفية واردة — حتى يتمكن عملاؤك من الاتصال برقم هاتف حقيقي والوصول إلى وكيل مجيب الصوتي مباشرة. بدون ويدجت، بدون تضمين، فقط مكالمة هاتفية.

نبني أيضاً تحليلات الصوت: خرائط حرارية لمدة المكالمة، والأسئلة الأكثر شيوعاً بالصوت، وتتبع المشاعر لكل مكالمة. ترقبوا المستجدات.

جرّب الذكاء الاصطناعي الصوتي مجاناً