XProtect Video Summarization: Webinar recording and Q&A

Article

marzo 12, 2026

Watch our XProtect Video Summarization webinar to learn how this VLM plug-in analyzes video activity, generates concise scene summaries and helps operators quickly understand and search video events. Explore answers to questions submitted during the webinar.

Pricing and licensing

Q: How is XProtect Video Summarization priced?

A: XProtect Video Summarization uses a pay-per-query pricing model. Each query costs $0.15 USD, regardless of the length of the video being analyzed.

There is no separate licensing requirement for the plugin. It can be downloaded and used freely, and customers are only billed for the queries they run. The plugin is distributed through the standard Milestone partner channel.

Q: What counts as a query and how is billing handled?

A: Each time a user submits a prompt to analyze a video clip, it counts as a single query.

Queries are billed monthly, with no prepayment required. The cost is based on the number of queries submitted rather than video duration or data volume.

Q: Does video length affect the cost of a query?

A: No. Pricing is not tied to video length. A query costs $0.15 USD regardless of clip duration. The only technical limitation is that uploaded video files must be 100 MB or smaller.

Deployment and infrastructure

Q: Where is video processed when using XProtect Video Summarization?

A: Currently, video summarization is processed in the cloud using Vision Language Model (VLM) services hosted on Milestone Hafnia cloud instances in the US or EU.

When a user submits a query, the selected video clip is securely transmitted to the cloud service for analysis, and the generated summary is returned to the Smart Client.

The system running the gateway service must have internet connectivity in order to transmit video clips to the cloud service for analysis.

Milestone is actively developing on-prem and air-gapped deployment options for organizations that require fully local processing.

Q: Does video summarization require additional servers or infrastructure?

A: No additional servers are required for the current cloud-based deployment.

Because the Vision Language Model runs in the cloud, existing XProtect system components manage the workflow while the heavy processing takes place remotely. The main infrastructure consideration is sufficient network bandwidth to transfer selected video clips for analysis.

Q: What server resources are required to run video summarization?

A: For the current cloud deployment model, no additional CPU or GPU resources are required on the system since the processing occurs in the cloud. If on-prem deployments become available in the future, local server hardware will be required to run the Vision Language Model.

How the video summarization workflow works

Q: What happens when you run a video summarization query?

A: When a user submits a prompt for a selected video clip, the Vision Language Model (VLM) analyzes the video and generates a description of what is happening in the scene.

The generated response can be stored as a bookmark description in XProtect, allowing operators to search for that activity later using plain-text search.

The VLM requires a prompt and does not generate summaries automatically without user input.

Q: When are video summaries generated and how can they be used for reporting?

A: Video summarization runs on selected video clips, allowing operators to analyze specific moments or incidents. The process can be triggered manually by an operator or automatically through events, enabling summaries to be generated when relevant activity occurs. The VLM can also run at intervals to create short summaries over time, which can later be combined into larger reports if needed.

Integration, automation and triggers

Q: What events can trigger a video summarization query?

A: Video summarization can be triggered by events within XProtect. This includes motion analytics events, rule system triggers, events generated by traditional computer vision models and alarms from external analytics integrated with the system.

For example, an event detection model can trigger the Vision Language Model (VLM) to generate a contextual summary of activity captured around that event.

Q: Can video summarization results be integrated with external systems?

A: Yes. The system can send webhooks containing the generated data, allowing integrations with external alarms, monitoring platforms or other systems and enabling automated workflows based on summarization results.

Q: Can the Vision Language Model be accessed through an API?

A: Yes. The Vision Language Model (VLM) includes a self-serve API, allowing developers to integrate video summarization capabilities into custom workflows and applications.

AI model and technology

Q: What is a Vision Language Model (VLM)?

A: VLM stands for Vision Language Model. It is a type of AI model that analyzes visual content and generates natural language descriptions of what is happening in a video scene.

Unlike traditional analytics that detect specific predefined events, a VLM evaluates the visual context of a scene, identifying objects, activities and relationships within the video in order to generate descriptive summaries. This means it is not limited to simple motion detection and can interpret a broader range of activity captured in the footage.

Q: Will the VLM produce consistent results, and how are hallucinations addressed?

A: If the same scene is analyzed multiple times, the results will generally be very similar, though the wording of the generated description may vary slightly.

Hallucinations are reduced by grounding the model’s responses in actual video frames and detected objects, rather than relying solely on language generation. As with other AI systems, results should be interpreted in the context of the underlying video evidence.

Q: Can the VLM analyze live video streams?

A: Yes, the VLM can run on real-time video streams. However, it is typically more practical to trigger summarization based on events, since most video streams contain long periods without meaningful activity.

AI models and future development

Q: Will other AI models besides NVIDIA be supported in XProtect?

A: Not currently. Milestone is currently fine-tuning NVIDIA Cosmos models for use with this capability.

Q: Where is training data stored for AI models?

A: Data sourcing follows Milestone’s License-to-Data framework, and storage depends on the specific data agreements associated with the model.

Q: What other AI initiatives is Milestone exploring?

A: Milestone continues to explore new ways to apply AI across its video technology portfolio. This includes innovations such as BriefCam AI Search, along with other initiatives designed to help organizations extract more value from video data through intelligent analysis and automation.

Q: Where does Arcules fit in this AI ecosystem?

A: Arcules remains Milestone’s cloud-native VSaaS platform, designed to deliver scalable video management and analytics in the cloud. While XProtect Video Summarization is currently implemented within the XProtect ecosystem, Arcules continues to evolve as Milestone’s cloud-first platform for video management and AI-powered services.

Prompts and customization

Q: Can users create their own prompts?

A: Yes. The plugin includes a library of predefined prompts, and users can also create custom prompts to analyze video clips in different ways.

Q: Can administrators push prompts to Smart Clients?

A: Yes. Prompts are managed through a local gateway service, which allows them to be distributed to Smart Clients.

Video limits and camera compatibility

Q: Will the plugin work with older IP cameras?

A: Yes. The plugin works with any camera that can produce a valid video file format, so it can operate with older IP cameras as long as the video can be exported or accessed in a supported format.

Q: Are there limits on video size or the number of cameras that can be analyzed?

A: There is no limit on video length, but uploaded video files must be 100 MB or smaller.

The Vision Language Model operates as a web service, allowing multiple queries to run in parallel. In practice, the primary constraint is network bandwidth rather than the number of cameras in the system.

Industry use cases

Q: How can video summarization support school and campus environments?

A: Milestone is actively developing models for this domain, though specific school safety examples were not demonstrated during the webinar.

The current VLM can help summarize incidents and provide contextual reporting around detected events in environments such as school districts or campuses.

For example, natural language queries could potentially be used to identify specific activities captured in video. This requires metadata extraction first, which the VLM can generate and make searchable.

Q: Can the Vision Language Model support airport security scenarios?

A: Airport datasets are a priority in the development roadmap, though current models are primarily optimized for traffic-related use cases. As the technology evolves, similar approaches could support additional operational environments where video activity needs to be interpreted and summarized.

Q: Can video summarization help investigate incidents such as gunshot detection events?

A: Yes. For example, a gunshot detection system could trigger the Vision Language Model to summarize activity around the event, such as identifying vehicles leaving the area.

More broadly, video summarization can help investigators quickly review activity surrounding incidents across different environments, supporting forensic analysis and post-event investigation.

Privacy and anonymization

Q: Can anonymized data be de-anonymized for law enforcement?

A: No. The anonymization process protects the video, and the original data cannot be restored.

Q: What happens to video after inference?

A: Video submitted for analysis is processed for inference and immediately destroyed after the inference process is complete.

Language support

Q: Is video summarization available in languages other than English?

A: Currently, video summarization is available in English only, with additional EU languages planned for future support.

Have more questions?

We'd love to hear from you! Please book a demo with our team, and we'll tailor it to your needs.

Stiamo ancora imparando la tua lingua

XProtect Video Summarization: Webinar recording and Q&A