Spire Light Projects

Completed Projects

Swedish Speech Data Collection

350 participants

Swedish Speech Data Collection

Goal: Train advanced speech recognition for Scandinavian markets. Methodology: Diverse age groups (18-65) recorded in quiet home environments, covering 5 major Swedish dialects. Output: 350 hours of high-fidelity, validated audio, aligned with phonetic transcription for robust model training.

Italian Speech Data Collection

150 participants

Italian Speech Data Collection

Goal: Enhance voice assistant accuracy for Italian speakers. Methodology: Remote collection platform, ensuring phonetic balance and gender distribution. Output: 150 hours of clean conversational speech, segmented and pre-labeled for emotion and intent recognition.

Dutch Speech Data Collection

150 participants

Dutch Speech Data Collection

Goal: Improve natural language understanding for European-Dutch AI. Methodology: Scripted and unscripted dialogues from native speakers. Output: 150 hours of emotionally diverse speech, phonetically rich, and transcribed to strict guidelines.

Norwegian Speech Data Collection

150 participants

Norwegian Speech Data Collection

Goal: Expand voice search capabilities for regional Norwegian dialects. Methodology: Mobile app-based collection, capturing natural speech patterns from various regions. Output: 150 hours of dialect-specific audio, categorized by region and speaker characteristics, aiding localized AI.

Finnish Speech Data Collection

150 participants

Finnish Speech Data Collection

Goal: Develop robust voice biometrics for secure authentication systems. Methodology: Controlled environment recordings, focusing on unique phonetic characteristics of Finnish. Output: 150 hours of biometric-grade audio, with speaker ID tagging and noise profiles, enabling secure voice verification.

Danish Speech Data Collection

150 participants

Danish Speech Data Collection

Goal: Optimize text-to-speech synthesis for educational applications. Methodology: Professional studio recordings of diverse reading styles (narrative, instructional). Output: 150 hours of crystal-clear speech, precisely segmented for phoneme-level alignment, enhancing synthetic voice naturalness.

Russian Speech Data Collection

220 participants

Russian Speech Data Collection

Goal: Build comprehensive speech models for large-scale customer service chatbots. Methodology: Crowdsourced collection targeting diverse age groups and social backgrounds. Output: 220 hours of varied Russian speech, including domain-specific phrases, transcribed for intent and entity recognition.

Croatian Speech Data Collection

150 participants

Croatian Speech Data Collection

Goal: Enable real-time voice translation services. Methodology: Bilingual participants recorded speaking Croatian and English in conversational settings. Output: 150 hours of parallel Croatian-English speech, time-aligned for translation model development.

Lithuanian Speech Data Collection

100 participants

Lithuanian Speech Data Collection

Goal: Support development of new voice technologies for Baltic markets. Methodology: In-person recordings in various acoustic environments (office, street, home). Output: 100 hours of natural Lithuanian speech, diverse in accent and intonation, used for foundational ASR models.

Multilingual Storytelling Speech Data Collection

240 participants

Multilingual Storytelling Speech Data Collection

Goal: Create culturally rich datasets for emotional AI research. Methodology: Participants recorded narrating personal stories in their native language (12 languages total). Output: 240 hours of emotionally expressive multilingual speech, tagged for sentiment, tone, and key narrative elements.

Serbian Speech Data Collection

150 participants

Serbian Speech Data Collection

Goal: Improve natural language processing for Balkan languages. Methodology: Collection focused on colloquial expressions and common idioms. Output: 150 hours of contextually rich Serbian speech, accompanied by detailed semantic and syntactic annotations.

Italian Speech Data Collection

120 participants

Italian Speech Data Collection

Goal: Enhance multimodal AI for interactive entertainment. Methodology: Participants engaged in interactive scenarios, capturing speech, gestures, and facial expressions. Output: 120 hours of synchronized audio-visual data, enabling advanced multimodal AI development.

Turkish Speech Data Collection

100 participants

Turkish Speech Data Collection

Goal: Develop specialized speech recognition for medical dictation. Methodology: Recordings from medical professionals, covering a wide range of medical terminology and sentence structures. Output: 100 hours of high-accuracy Turkish medical speech, transcribed with domain-specific lexicon for healthcare AI.

Swedish Audio Transcription

650 audio hours

Swedish Audio Transcription

Project: High-accuracy transcription of call center recordings for sentiment analysis. Methodology: Human transcription with 3-pass verification, utilizing custom glossaries for industry-specific terms. Output: 650 hours of timestamped, speaker-diarized Swedish transcripts (99.5% accuracy), delivered in 4 weeks.

Mandarin Chinese Audio Transcription

600 audio hours

Mandarin Chinese Audio Transcription

Project: Academic research transcription of complex discussions. Methodology: Expert linguists transcribed challenging audio with multiple speakers, focusing on subtle nuances and interjections. Output: 600 hours of verbatim Mandarin Chinese transcripts, with non-speech event tagging, used for linguistic study.

Italian Audio Transcription

250 audio hours

Italian Audio Transcription

Project: Transcription of legal proceedings for case review. Methodology: Certified legal transcribers ensured strict adherence to legal formatting and terminology. Output: 250 hours of highly sensitive Italian legal audio transcribed, with speaker identification and custom redaction protocols, crucial for litigation.

German Audio Transcription

250 audio hours

German Audio Transcription

Project: Transcription of focus group interviews for market research. Methodology: Semantic transcription capturing key insights and sentiment from diverse participants. Output: 250 hours of thematic German transcripts, categorized by topic and sentiment, facilitating rapid market insights.

Croatian Audio Transcription

220 audio hours

Croatian Audio Transcription

Project: Transcription of historical oral narratives for archival. Methodology: Specialized team working with varying audio quality and historical dialects. Output: 220 hours of meticulously transcribed Croatian oral histories, with metadata tagging for historical research and preservation.

Romanian Audio Transcription

120 audio hours

Romanian Audio Transcription

Project: E-learning content transcription for online courses. Methodology: Clean verbatim transcription, ensuring clarity and accuracy for educational purposes. Output: 120 hours of Romanian educational audio transcribed, with time-stamping for synchronized video subtitles.

Greek Audio Transcription

265 audio hours

Greek Audio Transcription

Project: Transcription of journalistic interviews for news analysis. Methodology: Rapid turnaround transcription, allowing quick processing of breaking news interviews. Output: 265 hours of Greek interview transcripts, delivered within 24-hour windows, enabling swift content production.

Latvian Audio Transcription

300 audio hours

Latvian Audio Transcription

Project: Transcription of parliamentary debates for public record. Methodology: High-volume, ongoing transcription with strict naming conventions for speakers and topics. Output: 300 hours of public domain Latvian legislative audio transcribed, ensuring transparency and accessibility.

Finnish Audio Transcription

150 audio hours

Finnish Audio Transcription

Project: Transcription of medical consultations for AI diagnosis support. Methodology: Specialized medical transcribers ensuring HIPAA compliance and accurate medical terminology. Output: 150 hours of secure Finnish medical transcripts, aiding AI in preliminary diagnosis and treatment planning.

Spanish Audio Transcription

270 audio hours

Spanish Audio Transcription

Project: Multidialectal podcast transcription for global audience. Methodology: Transcribers specialized in various Latin American and Castilian Spanish dialects. Output: 270 hours of inclusive Spanish podcast transcripts, enabling broader reach and accessibility.

Serbian Audio Transcription

300 audio hours

Serbian Audio Transcription

Project: Transcription of call center interactions for quality assurance. Methodology: Large-scale, rapid transcription to monitor agent performance and customer satisfaction. Output: 300 hours of sentiment-tagged Serbian call transcripts, identifying key customer pain points and service improvements.

Icelandic Podcast Transcription

170 audio hours

Icelandic Podcast Transcription

Project: Transcription of niche cultural podcasts for accessibility. Methodology: Linguistically sensitive transcription capturing unique cultural references and idioms. Output: 170 hours of accurate Icelandic podcast transcripts, making content accessible to deaf and hard-of-hearing audiences.

Street View Data Collection — Spain

Field data capture

Street View Data Collection — Spain

Project: Comprehensive geospatial data collection for updated mapping services across major Spanish cities. Methodology: Deployment of mobile data capture units traversing over 10,000 km, capturing high-resolution street-level imagery and associated sensor data (GPS, LiDAR). Output: Over 500,000 unique panoramic images, precisely geo-tagged, used for autonomous vehicle navigation and urban planning.

Street View Data Collection — France

Field data capture

Street View Data Collection — France

Project: Pedestrian-level data collection for tourist navigation applications in historic districts of Paris and Lyon. Methodology: Backpack-mounted mobile mapping systems captured imagery and 3D point clouds in narrow, inaccessible areas. Output: 250,000 detailed pedestrian street views, with 3D model overlays, enhancing immersive virtual tours and augmented reality navigation.

Street View Data Collection — Greece

Field data capture

Street View Data Collection — Greece

Project: Coastal and island data capture for environmental monitoring and tourism development. Methodology: Specialized marine vessels equipped with panoramic cameras and sonar technology navigated complex coastlines. Output: 300,000 unique maritime street views, providing visual data for ecological studies and marine navigation aids.

Street View Data Collection — Italy

Field data capture

Street View Data Collection — Italy

Project: Comprehensive road network data capture for logistics and infrastructure planning across rural Italian regions. Methodology: Vehicle-mounted panoramic cameras and LiDAR sensors covered 8,000 km of diverse road conditions. Output: 400,000 geo-referenced road segment images, with detailed infrastructure attributes (road signs, lane markings), crucial for logistics optimization and autonomous delivery routes.

Image Annotation for Object Detection

25,000 images

Image Annotation for Object Detection

Project: Training data for agricultural robotics to identify crop health and pests. Methodology: Bounding box annotation of 25,000 satellite images, tagging specific crop types, disease indicators, and insect infestations. Output: 75,000 precise bounding box annotations (3 per image), delivered in COCO format, achieving 98% inter-annotator agreement.

Video Annotation for Action Recognition

9,500 short videos

Video Annotation for Action Recognition

Project: Developing AI models for sports analytics to recognize athlete actions. Methodology: Polygon and keypoint annotation across 9,500 short video clips (10-30 seconds each), identifying specific movements like ‘shooting’, ‘passing’, ‘dribbling’. Output: Over 100,000 action segment annotations, with pose estimation keypoints, for advanced athlete performance analysis.

High-Volume Semantic Segmentation & Bounding Box Annotation

250,000 photos annotated in 3 weeks

High-Volume Semantic Segmentation & Bounding Box Annotation

Project: Rapidly building datasets for autonomous driving perception. Methodology: Hybrid approach using AI-assisted pre-labeling followed by human verification for semantic segmentation of roads, vehicles, pedestrians, and bounding boxes for objects. Output: 250,000 high-resolution images fully annotated (pixel-perfect segmentation + bounding boxes), delivered at an accelerated pace, critical for quick model iteration.

Rapid Keypoint Detection & Attribute Tagging Sprint

100,000 photos annotated in 1 week

Rapid Keypoint Detection & Attribute Tagging Sprint

Project: Urgent need for human pose estimation and attribute tagging for retail analytics. Methodology: Highly specialized annotator team working in shifts, focusing on speed and accuracy for keypoint detection (e.g., 17 points per person) and attributes (gender, clothing type). Output: 100,000 images with precise keypoint annotations and attribute tags, providing real-time data for customer behavior analysis and store layout optimization.

Multilingual Sentiment Analysis Text Data

500,000 reviews across 8 languages

Multilingual Sentiment Analysis Text Data

Project: Building a robust dataset for AI-powered customer feedback analysis across diverse linguistic contexts. Methodology: Collection and annotation of 500,000 customer reviews from various sources (social media, product review sites) in 8 target languages. Annotators labeled sentiment (positive, negative, neutral) and identified key aspects. Output: 500,000 text entries with granular sentiment labels, enabling client to train sophisticated NLP models for global market insights.

Legal Document Entity Extraction (NER)

10,000 legal contracts

Legal Document Entity Extraction (NER)

Project: Automating contract review processes for a legal tech firm. Methodology: Expert legal annotators performed Named Entity Recognition (NER) on 10,000 complex legal contracts, identifying clauses, parties, dates, obligations, and jurisdictions. Output: 10,000 richly annotated legal documents, delivered in JSON format, significantly reducing manual review time and improving legal compliance auditing.

Medical Text De-identification

5,000 clinical notes

Medical Text De-identification

Project: Preparing patient clinical notes for AI research while ensuring privacy compliance (HIPAA). Methodology: Specialized annotators meticulously identified and de-identified Personally Identifiable Information (PII) within 5,000 medical records and clinical notes, including names, dates, locations, and other sensitive data. Output: 5,000 fully de-identified, compliant clinical text documents, enabling safe and ethical use of medical data for machine learning in healthcare.

User-Generated Video Content Moderation

15,000 hours of video

User-Generated Video Content Moderation

Project: Ensuring platform safety and policy adherence for a major social media client. Methodology: Trained human moderators reviewed 15,000 hours of user-generated video content for violations including hate speech, graphic violence, misinformation, and copyright infringement. Real-time queues maintained rapid detection. Output: Identification and classification of policy-violating content, leading to a safer platform environment and reduced legal exposure for the client.

Brand Safety Image Moderation

1,000,000 images

Brand Safety Image Moderation

Project: Protecting brand reputation and advertising integrity for a global ad tech company. Methodology: High-volume human review of 1,000,000 images for sensitive content (e.g., nudity, violence, illegal activities) to prevent ad misplacement and maintain brand safety. AI tools were leveraged for initial filtering, followed by human verification. Output: A sanitized image inventory, ensuring ads are placed only alongside brand-safe content, improving advertising effectiveness and client trust.

Lexicon Development for Chatbots

5 languages, 20,000 terms

Lexicon Development for Chatbots

Project: Enhancing the conversational AI capabilities of a multinational e-commerce platform. Methodology: Creation of domain-specific lexicons (vocabulary lists) for customer service chatbots in 5 new market languages. Linguists identified common customer queries, product terms, and industry jargon, then built comprehensive lists with synonyms and intent classifications. Output: 20,000 unique, validated linguistic terms per language, significantly boosting chatbot understanding and response accuracy.

Machine Translation Post-Editing (MTPE)

500,000 words across 10 languages

Machine Translation Post-Editing (MTPE)

Project: Achieving human-quality translations at scale for a global content provider. Methodology: Professional linguists specializing in post-editing reviewed and corrected 500,000 words of machine-translated content (marketing materials, user manuals) across 10 language pairs. Focus on fluency, accuracy, and cultural appropriateness. Output: High-quality, polished translations delivered rapidly, leveraging MT efficiency while ensuring linguistic integrity and brand voice consistency.

3D Point Cloud Annotation for Robotics

10,000 LiDAR scans

3D Point Cloud Annotation for Robotics

Project: Training autonomous mobile robots for warehouse navigation and object manipulation. Methodology: Annotation of 10,000 LiDAR point cloud scans, identifying and segmenting various objects (pallets, shelves, machinery, forklifts) in 3D space. Used cuboid bounding boxes and semantic segmentation for precise object localization. Output: 10,000 richly annotated 3D point cloud datasets, enabling robots to accurately perceive and interact with their environment.

Audio Event Detection & Classification

2,000 hours of environmental audio

Audio Event Detection & Classification

Project: Developing smart city acoustic monitoring systems. Methodology: Annotation of 2,000 hours of real-world environmental audio, detecting and classifying specific sound events (e.g., sirens, glass breaking, dog barking, human speech, car horns). Precise time-stamping of event occurrences. Output: A large-scale dataset of audio events, enabling AI to monitor urban soundscapes for safety, traffic analysis, and noise pollution assessment.

Handwritten Text Recognition (HTR) Annotation

50,000 historical documents

Handwritten Text Recognition (HTR) Annotation

Project: Digitizing and making searchable large archives of historical handwritten documents. Methodology: Specialized paleography annotators segmented and transcribed handwritten text from 50,000 diverse historical documents (e.g., letters, ledgers, manuscripts). Challenging variations in handwriting style and ink. Output: 50,000 accurately transcribed, searchable handwritten document images, preserving cultural heritage and enabling new historical research.

Satellite Image Feature Extraction

1,000,000 km² land area

Satellite Image Feature Extraction

Project: Monitoring land use and environmental changes for a climate research institute. Methodology: Annotation of high-resolution satellite imagery covering 1,000,000 km² of diverse land areas (forests, urban, agricultural, water bodies). Identified and segmented various geographical features and infrastructure. Output: Geospatial dataset with detailed land cover classifications and feature boundaries, crucial for climate modeling, urban planning, and environmental impact assessments.

Drone Imagery Semantic Segmentation

20,000 aerial images

Drone Imagery Semantic Segmentation

Project: Automating construction site progress monitoring and safety compliance. Methodology: Pixel-perfect semantic segmentation of 20,000 drone-captured aerial images of construction sites. Identified materials, equipment, safety zones, and active work areas. Output: 20,000 segmented aerial images, providing precise data for construction project management, resource allocation, and real-time safety analysis, enhancing efficiency and reducing risks.