We are looking for LLM/ ML Infrastructure engineer experienced with Rust/C++ and CUDA for remote work.
Our client is building a decentralized AI infrastructure focused on running and serving ML models directly on user-owned hardware (on-prem / edge environments).
A core component of the product is a proprietary “capsule” runtime for deploying and running ML models. Currently, some components rely on popular open-source solutions (e.g., llama.cpp). Still, the strategic goal is to replace community-driven components with in-house ML infrastructure to gain complete control over performance, optimization, and long-term evolution.
In parallel, the company is developing:
-
its own network for generating high-quality, domain-specific datasets,
-
fine-tuned compact models for specialized use cases,
-
a research track focused on ranking, aggregation, accuracy improvements, and latency reduction.
The primary target audience is B2B IT companies.
The long-term product vision is to move beyond generic code generation and focus on high-performance, hardware-aware, and efficiency-optimized code generation.
ML Direction
1. Applied ML Track (Primary focus for this role)
-
Development of ML inference infrastructure
-
Building and evolving proprietary runtime capsules
-
Porting and implementing ML algorithms on a custom architecture
-
Low-level performance optimization across hardware platforms
2. Research Track
-
ML research with published papers
-
Improvements in answer quality and inference efficiency
-
Experiments with aggregation, ranking, and latency reduction
👉 This position is primarily focused on the applied ML / engineering track.
Role
This is a strongly engineering-oriented ML role focused on inference, performance, and systems-level implementation rather than model experimentation.
📌 Approximately 90% of the work is hands-on coding and optimization.
You will
-
Implement ML algorithms from research papers into production-ready code
-
Port existing ML inference algorithms to the company’s proprietary architecture
-
Develop and optimize inference
-
Optimize performance, memory usage, and latency
-
Integrate and adapt open-source ML solutions (LLaMA, VLMs, llama.cpp, etc.)
-
Contribute to the foundational architecture of the ML platform
Key Responsibilities
Inference Infrastructure Development:
○ Design and implementation of a cross-platform engine for ML model inference
○ Development of low-level components in Rust and C++ with focus on maximum performance
○ Creation and integration of APIs for interaction with the inference engine
Performance Optimization:
○ Implementation of modern optimization algorithms: Flash Attention, PagedAttention, continuous batching
○ Development and optimization of CUDA kernels for GPU-accelerated computations
○ Profiling and performance tuning across various GPU architectures
○ Optimization of memory usage and model throughput
Model Operations:
○ Implementation of efficient model quantization methods (GPTQ, AWQ, GGUF)
○ Development of memory management system for working with large language models
○ Integration of support for various model architectures (LLaMA, Mistral, Qwen, and others)
We are waiting from you
-
Strong proficiency in Rust or C++
-
Hands-on experience with GPU / hardware acceleration, including:
-
CUDA, AMD or Metal (Apple Silicon)
-
-
Solid understanding of:
-
LLM principles
-
core ML algorithms
-
modern ML approaches used in production systems
-
-
Ability to read ML research papers and implement them in code
-
Ability to write clean, efficient, highly optimized code
-
Interest in systems-level ML and low-level performance optimization
-
High level of autonomy:
-
take existing algorithms from research or open-source,
-
understand them deeply,
-
adapt and integrate them into a new architecture
-
-
Fruent English
What The Company Offers
-
Remote-first setup (work from anywhere)
-
Dubai working hours
-
High level of ownership and autonomy
-
Flat structure
-
Salary in cryptocurrency
-
An opportunity to create a great product that will break the AI market