Project 3 – AWS Text-to-Speech Converter

1. Problem Statement

This project addresses the inefficiencies and costs associated with repetitive voice synthesis from dynamic text input. By implementing an intelligent S3-based caching layer for Amazon Polly, the solution significantly improves performance, reduces operational costs, and provides a scalable foundation for web applications requiring on-demand voice generation.

Implementing deterministic audio caching using hashed S3 keys
Eliminating duplicate Amazon Polly charges
Reducing audio retrieval latency with efficient S3 pre-signed URLs
Providing real-time feedback and direct downloadable voice output to users

2. Architecture Overview

This serverless system converts text into lifelike speech, leveraging key AWS services for efficiency and scalability. A frontend interface on sedky.net allows users to input text and select a voice. This request is routed through API Gateway to an AWS Lambda function. The Lambda function intelligently checks for existing audio in an S3 cache. If found, it quickly returns a pre-signed URL for direct download. Otherwise, it invokes Amazon Polly to synthesize new speech, stores this new audio in S3, and then provides the pre-signed URL. Amazon SNS is used to publish notifications for newly generated audio files, and S3 lifecycle rules ensure cached files auto-expire after 7 days.

Overview of the serverless architecture for the Text-to-Speech converter.

3. Live Demo

Experience the converter firsthand. Enter your text, select a voice, and generate speech directly below or open in a new tab.

Interactive demo of the AWS Text-to-Speech converter.

4. Step-by-Step Implementation

Lambda execution role with permissions for Polly, S3, and SNS.

Execution role permissions summary view.

Trust policy for Lambda execution role.

Lambda function configuration summary.

Lambda function code handling audio synthesis and caching.

Successful Lambda invocation result.

API Gateway HTTP endpoint created for POST requests.

Successful API Gateway invocation returning pre-signed URL.

S3 bucket created for storing cached MP3 audio files.

5. Business Impact

Prevents redundant Polly usage via deterministic caching
Improves UX with fast audio load times
Scales well for voice automation use cases (IVR, accessibility, podcasting)

6. Real-World Scenarios

This serverless application could support accessibility features in internal tools, multilingual support bots for customer service, or automated QA for localized audio playback during testing workflows.

7. Cost & Security Considerations

Polly usage billed per character, optimized with caching
S3 auto-expiration reduces storage cost
Pre-signed URLs provide secure access, no public buckets used
IAM roles scoped to minimum privileges
No credentials hardcoded, all configurations passed via environment variables
API Gateway secured with throttling and CORS headers

8. AWS Well-Architected Framework Alignment

Pillar	Implementation Notes
Security	IAM scoped to least privilege, S3 private, pre-signed access only
Reliability	Handles Polly or S3 errors gracefully
Performance Efficiency	Cache-first design using pre-signed URLs
Cost Optimization	Avoids duplicate synthesis charges, auto-expiry cache
Operational Excellence	Monitored with SNS, CloudWatch, structured logging

9. Challenges & Resolutions

403 Forbidden: Resolved with correct Lambda resource policies
S3 access errors: IAM and trust relationship fixed for Lambda
Audio decoding: Polly response stream handled as binary buffer
Duplicate generation: Prevented with hash-based cache key

10. GitHub Repository

📘 View Full GitHub Documentation

PROJECT DOCUMENTATION

AWS Text-to-Speech Converter with Caching and Pre-Signed URLs