Raspberry Pi today launched the AI Kit, a $70 addon which straps a Hailo-8L on top of a Raspberry Pi 5, using the recently-launched M.2 HAT (the Hailo-8L is of the M.2 M-key variety, and comes preinstalled).
The Hailo-8L's claim to fame is 3-4 TOPS/W efficiency, which, along with the Pi's 3-4W idle power consumption, puts it alongside Nvidia's edge devices like the Jetson Orin in terms of TOPS/$ and TOPS/W for price and efficiency.
Google's Coral TPU has been a popular choice for a machine learning/AI accelerator for the Pi for years now, but Google seems to have left the project on life support, after the Coral hardware was scalped for a couple years about as badly as the Raspberry Pi itself!
Pineboards offers a $50 Coral Edge TPU bundle as well as a $100 Dual Edge TPU bundle offering 4 and 8 TOPS, respectively. But the Pi AI Kit undercuts those offerings both on price and power efficiency.
The Coral can be had (sometimes) for as little as $25 as a standalone PCIe device, but at 2 TOPS/W, the speed and efficiency of its 6-year-old chip design is a little behind the times. It's still quite useful for projects like a Frigate NVR, but it's far behind even the built-in NPUs on modern chips like the Rockchip RK3588.
I tested the Pi AI Kit in the video embedded below:
The video has a more detailed look at the performance of the AI Kit, but I'll run through some top-level things here.
Raspberry Pi seems to be marketing the AI Kit as a companion to their extensive line of Pi Cameras (they have a ton now, targeted at a variety of use cases). Their picamera2
library (still in beta) already has some TensorFlow examples for the AI Kit, and rpicam-apps
has a number of built-in Hailo AI post-processing stages.
I tested a number of them, like YOLOv5 object detection:
And YOLOv8 pose estimation:
There were no hiccups, and I ran through a few other examples in the video, if you want to see how they worked using Hailo's default models (see their model zoo).
I also ran through a bunch of my own pre-recorded footage at 480p and 720p and it had no issues whatsoever. I did try feeding it a 4K H.264 video file and it kinda didn't like that. It worked, but was a bit sluggish :)
Going for 51 TOPS
Microsoft recently announced the Copilot+ PC standard of at least 40 TOPS of neural compute power. Qualcomm's Snapdragon X has 45 TOPS, Apple's M4 has 38, Intel's Lunar Lake has 48, and AMD's AI 300 series has 50.
So naturally, I wanted to go further—on a Raspberry Pi.
This configuration is completely unsupported by any of the vendors involved—I used a Raspberry Pi 5, two Hailo NPUs (the Hailo-8L with 13 TOPS and Hailo-8 with 26 TOPS), a Coral Dual Edge TPU (8 TOPS), and a Coral Edge TPU (4 TOPS), totaling 51 TOPS.
And the Pi could see everything in this unholy mess on my desk... I just couldn't get the chips to completely initialize. Likely a power issue, as dmesg
showed the drivers dying off after PCIe device enumeration, while the driver was loading.
I didn't have time (due to the tight deadline publishing this post) to go much further, but I suspect I'd have more luck using a single PCIe switch instead of chaining together two of Pineboards' HatBrick! Commander boards.
That, and I could supply external power so I don't tempt fate drawing more than 5W through the Pi 5's PCIe FPC header!
I may make another attempt at utilizing 51 TOPS on a Pi 5 with this board:
It's a 12x PCIe expansion board from Alftel, and I've used it in the past to string together a bunch of NVMe SSDs on the Pi CM4—enough to drive that Pi to its breaking point, in terms of simultaneous active PCIe devices :)
Comments
Where did you get the Hailo-8 with 26 TOPS? I only found one for over $200. If that's the real world price then the 8L in Pi bundle makes much more sense.
Yeah I know if you do a product enquiry, it's something like $150 or $200 for the Hailo-8. They sent me the one I have in this post for testing a few months ago (I've been pestering them about it since 2022, ha!).
Whoah that 12x PCIe board looks like serious business. I wonder what you can run on that. Multiple yoloV5 on security cameras live? An odd LLM?
I would think that running the same model across multiple NPUs would be difficult to achieve and when running multiple models on multiple cameras split across the NPUs you will likely run into issues with the single PCIe lane you have, it may not be able to feed them the data from all the cameras.
could this M.2 Halio module be used with other non-RPi boards?
Yes. The Hailo-8 (on which the 8L is based) has been out a couple years now and works across a variety of edge and server devices.
googled for some sellers and it's ~200 USD, does it make more sense to get this $70 AI Kit even though I don't need the HAT+ on non-RPi5 boards, or is there some place to buy a Halio-8 module at reasonable price? Looks like an upgraded Coral TPU module to me.
Sure, here is the translation:
I saw on Tom's Hardware that they used a PCIe duo board to use booting and Hailo simultaneously. After the EEPROM update on May 17th, is it possible to use NVMe booting and Hailo 8L simultaneously with the dual PCIe board? What is Jeff's opinion on this?"
Have you tried the Hailo with the TuringPi 2 board?
Is that the hailo-8 or hailo-8l? I’m having trouble finding the 8l anywhere and that seems to be the chip that has 13 tops.
Just bought a Hailo-8L from the Raspberry Pi shop in Cambridge, UK. :-) £60.
It looks like today Hailo released their 'Dataflow Compiler' / DFC, and here's an example for retraining with your own data. I haven't had a chance to mess with it yet but may for a project soon!
Will it allow you to actually run llms on the pi5 faster? I know there are ways to get it done, but after multiple failures, most how tos seem to involve apple hw, and horrible slow performance when it does run I went back to an AMD a12 system and it works great. Smaller would be better.
Thanks for all your hard work jeff
That Pi Ai chip lacks the VRAM to be useful for LLMs, however check out the distributed llama project. I currently have the llama 3 8b running with full context in a cluster of 4 raspberry pi 4 (8gbs) at an average of about 1.6 tokens a second after some conservative overclocking.