Talk Intermediate CC BY-SA 4.0 First Talk

On-Device ML: Runtimes, Challenges and Implementation

Approved

Session Description

Background

Everyday devices, like mobile phones, smart-home devices and laptops have features that require execution of ML models, such as the 'Hey Google' (or 'Hey Siri') detection, portrait mode in the camera or simple text completion in the keyboard. These models execute on the device's hardware to minimize response time and keep the user's data private.

On-device machine learning refers to techniques that enable ML models to execute locally on the user's device i.e. without requiring to communicate to a hosted service across the Internet. Models executing on-device are faster, due to the absence of network calls and also help users build confidence in sharing their personal data.

With the recent surge in LLM-based tools, there have been numerous developments both in computer hardware and software to improve the on-device ML ecosystem. Hardware manufacturers are including specialized accelerators such as NPUs and integrated GPUs that are optimized for matrix multiplication. Open-source developers are building powerful runtimes like llama.cpp, vLLM and onnxruntime that support a wide range of models on different architectures and accelerators. A pure C/C++ runtime like llama.cpp is also easy to integrate in managed languages like Kotlin and Swift, thus making its way to Android and iOS/macOS apps.

On-device ML is largely constrained by the limited amount of memory available on most consumer devices. A smaller memory capacity can only fit smaller models, that may produce inferior outputs when compared to their larger counterparts. Moreover, writing platform-specific code that performs input preprocessing (like converting the text input to tokens) or output post-processing (like upsampling predicted audio signal) for a model might be challenging. As runtimes need to be leaner for on-device deployments, they compromise operator availability for the reduced file-size or memory footprint. Thus, converting complex models to these runtime-specific formats becomes difficult.

Plan

For this talk, I would like to discuss what on-device ML is, its perks/caveats, available tools/runtimes and the challenges a developer might face when deploying a ML model outside the typical Python's 'garden of libraries'.

We start by discussing on what on-device ML, some examples and its advantages to end-users. Next, we'll discuss how deploying models becomes difficult as soon as we leave the 'Python' world behind, highlighting the true constraints a developer might face when running models locally. Brief discussions on different runtimes (like LiteRT, ExecuTorch, CoreML) and how native languages (like C/C++/Rust) interoperate with managed languages (like Kotlin/Swift) are also planned.

Key Takeaways

Learn what on-device ML is and its perks for users
Understand how deployment of ML models is challenging, both from hardware and software perspectives
Glimpse through different on-device ML runtimes and list down their strengths and weaknesses
Have a lean understanding of how native codebases (C/C++/Rust) interact with languages like Kotlin/Swift

References

https://mlsysbook.ai/book/contents/core/ondevice_learning/ondevice_learning.html

https://developer.android.com/ai

https://developer.apple.com/documentation/FoundationModels

https://www.ibm.com/think/topics/neural-processing-unit

Session Categories

Knowledge Commons (Open Hardware, Open Science, Open Data etc.)

Talk License: CC BY-SA 4.0