TL;DR
ML-Sharp is an open-source tool that converts a single 2D image into a 3D Gaussian Splat. When applied frame-by-frame to video, it can be used to create experimental 4D Gaussian Splatting pipelines. This article explains how that works, how it differs from KIRI Engine’s multi-view 3DGS, and how to prepare video data for this workflow.
Why ML-Sharp Matters
Gaussian Splatting has become one of the most important representations in modern 3D capture, offering high-quality, view-dependent rendering without the overhead of traditional meshes.
ML-Sharp, an open-source project released by Apple, offers a remarkably simple entry point into this world:
Input: a single 2D image
Output: a 3D Gaussian Splat (.ply)
No CUDA, no paid services, no multi-camera capture
SHARP monocular view synthesis results showing how a single 2D image is converted into a 3D Gaussian Splat and rendered from new camera viewpoints, illustrating single-image 3D Gaussian representation.This makes ML-Sharp especially useful for experimentation, research, and creative exploration—particularly when extended to video.
What ML-Sharp Actually Produces
ML-Sharp does not output a mesh, point cloud, or voxel grid.
Instead, it produces:
A Gaussian Splat (.ply)
Each splat encodes position, scale, orientation, and color
The result behaves like a navigable “3D photo”
The following clip shows how ML-Sharp infers a 3D Gaussian field from a single image and renders a new camera view, compared directly against the original photo, as tested by the KIRI Engine team:
These splats can be viewed and explored immediately in compatible Gaussian Splat viewers such as SuperSplat.
How to Run ML-Sharp
If you would like to try ML-Sharp yourself, you don’t need to piece together instructions from GitHub threads or scattered posts.
Our friend Radiance Fields have already published a clear, step-by-step walkthrough that covers:
Installing ML-Sharp
Running it locally on Apple Silicon
Processing single images
Viewing the resulting Gaussian Splats
Apple ML-Sharp tutorial showing how a single 2D image is converted into a 3D Gaussian Splat using the Sharp model.👉 Follow the Radiance Fields ML-Sharp tutorial here
We link to this guide so readers who want to run ML-Sharp locally have a clear, reliable reference, while this article stays focused on how ML-Sharp fits into modern 3D and 4D Gaussian Splatting workflows.
ML-Sharp 3DGS vs KIRI Engine 3DGS
Although both ML-Sharp and KIRI Engine output Gaussian Splats, they are fundamentally different in how those splats are created.
ML-Sharp operates on a single monocular image.
All depth and structure are inferred by AI, based on learned visual cues such as perspective, shading, and occlusion.
KIRI Engine’s 3D Gaussian Splatting is built from real multi-view geometry:
Dozens or hundreds of images
Solved camera poses
Optional LiDAR depth
Photogrammetric and multi-view optimization
The following clip shows a real-world 3D Gaussian Splat created by a KIRI Engine user, demonstrating how multi-view 3DGS can be deployed in physical exhibition spaces.
This means every Gaussian in a KIRI Engine 3DGS is constrained by real physical measurements, not just neural inference.
This comparison highlights why monocular Gaussian splatting is visually impressive but cannot match the geometric stability of true multi-view 3DGS:
Technical comparison between ML-Sharp’s single-image Gaussian inference and KIRI Engine’s multi-view 3D Gaussian Splatting.ML-Sharp answers:
“What might this scene look like in 3D?”
KIRI Engine answers:
“Where is everything in 3D space?”
The following clip demonstrates KIRI Engine’s end-to-end 3D Gaussian Splatting pipeline, from real-world mobile capture to live 3DGS preview. Unlike monocular AI inference, this 3D Gaussian Splat is reconstructed from real multi-view camera data, giving it correct scale, stable geometry, and consistent parallax.
This distinction becomes even more important when we introduce time.
From 3D to 4D: Using ML-Sharp on Video
ML-Sharp becomes especially powerful when applied to video using a frame-based pipeline:
Split a video into frames
Run ML-Sharp on each frame → infer a 3D Gaussian field
Align and stack those per-frame splats over time
The following clip shows how ML-Sharp converts a single video into a frame-based 4D Gaussian Splat by inferring a 3D Gaussian field for each frame and stacking them over time.
This transforms flat video into a temporal sequence of 3D representations, an experimental form of 4D Gaussian Splatting (3D + time).
Before running ML-Sharp, the critical first step is preparing your video as images.
How to Split a Video into Frames
There are three practical ways to prepare video input for ML-Sharp.
Using Video Editing Software
Tools like Final Cut Pro, Premiere Pro, or DaVinci Resolve allow you to:
Import a video
Export it as an image sequence
Choose frame rate (e.g., 24fps, 30fps)
Exporting a video as an image sequence, which is the first step before running ML-Sharp for frame-based 4D Gaussian Splatting.
This is the most precise and professional option.
Using Online Video-to-Frame Tools
Many web services like ezgif let you:
Upload a video
Download a ZIP of extracted frames
No installation required
Ezgif video to image sequence tool used to extract frames from a video for ML-Sharp and 4D Gaussian Splatting.
This is ideal for quick tests and lightweight experiments.
Using Your Phone (Burst Mode)
You can bypass video entirely:
Move slightly between shots
Each photo becomes a “frame”
This creates a pseudo-video sequence that works surprisingly well for frame-based 4DGS experiments.
Running ML-Sharp on Video Frames (4DGS Workflow)
Running ML-Sharp on Video Frames (4DGS Core Pipeline)
Once a video has been split into frames, each frame can be processed independently by ML-Sharp.
At a high level, the 4D Gaussian Splatting workflow consists of two steps:
Run ML-Sharp (open-source) on each frame → infer a 3D Gaussian field
Align and stack those per-frame 3DGS results over time
The following clip demonstrates how frame-by-frame 3D Gaussian Splats are organized, aligned, and stacked into a time-aware 4D Gaussian Splat:
The first step turns a flat image sequence into a set of time-indexed 3D Gaussian snapshots.
The second step brings them into a shared temporal structure, transforming isolated 3D splats into a coherent 4D representation (3D + time).
This is the simplest way to extend single-image Gaussian inference into video-based 4DGS.
The result is a full 4D Gaussian Splat, a time aware 3D representation reconstructed purely from a single monocular video:
When to Use ML-Sharp vs When to Use KIRI Engine
Both tools work with Gaussian Splats, but they are designed for very different goals.
Use ML-Sharp when you want:
Rapid experimentation
Working with legacy photos or videos
Lightweight 4DGS prototyping
Learning how Gaussian Splats behave over time
Use KIRI Engine when you need:
True multi-view 3D accuracy
Stable geometry and occlusion
Production-ready assets for XR, games, VFX, and digital twins
A simple rule of thumb
If you ask: “What might this look like in 3D?” → ML-Sharp
If you ask: “Where is everything in 3D space?” → KIRI Engine
Final Thoughts & A Small Preview
ML-Sharp shows how far monocular AI inference has progressed—but also clearly reveals its limits.
As a free, open-source tool, it is ideal for:
Learning Gaussian Splat fundamentals
Rapid experimentation
Exploring 4D Gaussian Splatting from video
And as a comparison point, it helps explain why true multi-view 3DGS and LiDAR-based capture—like those in KIRI Engine—are still essential for any serious 3D production pipeline.
And here’s the small preview 👀
We’ve been experimenting internally with ML-Sharp on sequential frames, including filtering and segmentation, and the early results look very encouraging.




/webresources/Blog/1769149368176.png)
/webresources/Blog/1765509623768.png)
/webresources/Blog/1762763867214.png)