Train your ML models on GPU changing just one line of code
Utilize cuML and ATOM to make your machine learning pipelines blazingly fast- 24340Murphy ≡ DeepGuide
Pro GPU System vs Consumer GPU System for Deep Learning
Why you might consider going pro- 28852Murphy ≡ DeepGuide
Implement Multi-GPU Training on a single GPU
An Advanced Guide for TensorFlow- 26669Murphy ≡ DeepGuide
Deploying PyTorch Models with Nvidia Triton Inference Server
A flexible high-performant model serving solution- 20807Murphy ≡ DeepGuide
Host Hundreds of NLP Models Utilizing SageMaker Multi-Model Endpoints Backed By GPU Instances
Integrate Triton Inference Server With Amazon SageMaker- 22622Murphy ≡ DeepGuide
Matrix Multiplication on GPU
How to achieve state-of-the-art matrix multiplication performance in CUDA.- 28483Murphy ≡ DeepGuide
Apple M2 Max GPU vs Nvidia V100, P100 and T4
Compare Apple Silicon M2 Max GPU performances to Nvidia V100, P100, and T4 for training MLP, CNN, and LSTM models with TensorFlow.- 26645Murphy ≡ DeepGuide
Running a SOTA 7B Parameter Embedding Model on a Single GPU
In this post I will explain how to run a state-of-the-art 7B parameter LLM based embedding model on just a single 24GB GPU. I will cover some theory and then show how to run it with the HuggingFace Transformers library in Python in just a few lines of cod- 20655Murphy ≡ DeepGuide
Unleashing the Power of Triton: Mastering GPU Kernel Optimization in Python
Accelerating AI/ML Model Training with Custom Operators - Part 2- 25958Murphy ≡ DeepGuide
Massive Energy for Massive GPU Empowering AI
Massive GPUs for AI model training and deployment require significant energy. As AI scales, optimizing energy efficiency will be crucial- 22324Murphy ≡ DeepGuide
Programming Apple GPUs through Go and Metal Shading Language
Investigating Go, Cgo, Metal Shading Language, Metal Performance Shaders, and benchmarking different approaches to matrix multiplication- 26588Murphy ≡ DeepGuide
Metal Programming in Julia
Leveraging the power of macOS GPUs with the Metal.jl Framework.- 30149Murphy ≡ DeepGuide
Fine Tuning LLMs on a Single Consumer Graphic Card
Learnings from fine-tuning a large language model on a single consumer GPU- 24293Murphy ≡ DeepGuide
Apple M2 Max GPU vs Nvidia V100 (Part 2): Big Models and Energy Efficiency
Compare Apple Silicon M2 Max GPU performances and energy efficiency to Nvidia V100 for training CNN big models with TensorFlow- 24564Murphy ≡ DeepGuide
Maximizing the Utility of Scarce AI Resources: A Kubernetes Approach
Optimizing the use of limited AI training accelerators- 28693Murphy ≡ DeepGuide
Need for Speed: cuDF Pandas vs. Pandas
A comparative overview- 20266Murphy ≡ DeepGuide
PyTorch Native FP8
Accelerating PyTorch Training Workloads with FP8 - Part 2- 29525Murphy ≡ DeepGuide
Profiling CUDA using Nsight Systems: A Numba Example
Learn about profiling by inspecting concurrent and parallel Numba CUDA code in Nsight Systems- 20563Murphy ≡ DeepGuide
How Bend Works: A Parallel Programming Language That "Feels Like Python but Scales Like CUDA
A brief introduction to Lambda Calculus, Interaction Combinators, and how they are used to parallelize operations on Bend / HVM.- 25922Murphy ≡ DeepGuide
The Mystery Behind the PyTorch Automatic Mixed Precision Library
How to get 2X speed up model training using three lines of code- 21759Murphy ≡ DeepGuide
1 2
We look at an implementation of the HyperLogLog cardinality estimati
Using clustering algorithms such as K-means is one of the most popul
Level up Your Data Game by Mastering These 4 Skills
Learn how to create an object-oriented approach to compare and evalu
When I was a beginner using Kubernetes, my main concern was getting
Tutorial and theory on how to carry out forecasts with moving averag