Jaward (Jaward Sesay)

posted an update 4 days ago

Post

1386

nanoGPT with Sigmoid Self-Attention
I couldn’t resist had to give it a try:)

Some observations on M2:
SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/sigmoid_attn.ipynb

replied to their post 5 days ago

I used to think this way, but as it turned these models don't just do probability distribution, they are actually learning features between these distributions and to use these features during inference require some "reasoning", capable models (gpt4, gpt3, claude3) prior to OpenAI o1 could barely reason through tasks, o1 now utilizes RL to boost reasoning during inference - scaling at inference has been a huge challenge but somehow OAI figured it out with RL. Obviously we are at an early stage of this breakthrough, proof of reasoning will become clearer in subsequent versions of o1.

Geoffrey Hinton gave a talk on this topic: https://www.youtube.com/watch?v=N1TEjTeQeg0

posted an update 6 days ago

Post

1146

The breakthrough in OpenAI’s release goes way beyond just another family of capable models - it’s a monumental leap in LLM reasoning capabilities. One in which the limitations in pre-training become obsolete and the dream of scaling during inference becomes a reality.

Once again reinforcement learning (when rightly done) proves to be the ultimate “tool” that drives reasoning in AI models. OpenAI o1 (aka strawberry 🍓) can think and learn while thinking before giving a response. This is how we humans approach solving difficult problems.

In technical terms, o1 is trained with an RL algorithm to think productively using its chain of thought. In other words “the longer it thinks, the better it does on reasoning tasks”. Similar to how AlphaGo was able to beat the world champion at Go.

Read more: https://openai.com/index/learning-to-reason-with-llms/

2 replies

·

posted an update 8 days ago

Post

275

Free research tip:
Get used to writing the first draft of your paper in markdown using vscode’s jupyter notebook extension - it lets you do quick sanity checks with code and maths - an absolute AAA experience:)

posted an update 16 days ago

Post

528

The Forward-Forward Algorithm🤖

FFA replaces the forward and backward passes in backpropagtion with two forward passes - one with positive (real) data and another with negative data. Each layer has its objective function - to increase or decrease a “goodness" metric. The positive pass uses real data and adjusts weights to increase “goodness” in every hidden layer. The negative pass does the opposite.

I must say reading&Implementing a godfather paper feels quite fulfilling:)
Thank you Prof. Geoffrey Hinton.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/mnist_the_forward_forward_algorithm.ipynb

posted an update 21 days ago

Post

1304

Simplified implementation of “Neural Networks are Decision Trees”.

Showing that any neural network with any activation function can be represented as a decision tree. Since decision trees are inherently interpretable, their equivalence helps us understand how the network makes decisions.

In this implementation, we trained a simple neural network for 1k epochs on makemoons, saved the trained weights (state dicts), extracted the decision tree equivalent from the trained weight then visualize and evaluate.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/nns_are%20decision_trees.ipynb

1 reply

·

posted an update 27 days ago

Post

1471

Alan Turing's mind-bender: "Can machines think?" in its glorified form. This 74yr old paper laid the foundation for how we think about AI and machine intelligence today. The level of detail, clarity and foresight is just phenomenal - he was way ahead of his time 🧠🤖

Original copy: https://archive.org/details/MIND--COMPUTING-MACHINERY-AND-INTELLIGENCE

posted an update 30 days ago

Post

1594

Cooked up a cool & much faster AI voice assistant space that also supports speech translation (with seamless-expressive). Start with the phrase "Please translate" followed by the speech you'd like to translate, to activate speech translation mode. Using opensource LLMs (Llama 3, Mistral etc) with edge tts for voice assistant and seamless-expressive for speech translation.

Give it a try: Jaward/optimus

posted an update about 1 month ago

Post

1775

Supercool Weekend Read🤖
Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.

Details on Techniques (Simplified):
They started off with a large pre-trained language model (15B params), then:

1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.

2. Pruned (remove) less important parts of the model to reduce its size.

3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.

4. Used a lightweight neural architecture search to find the best configuration for the pruned model.

5. Repeated this process iteratively to create even smaller models.

Cool, giving it a try this weekend 😎
Code: https://github.com/NVlabs/Minitron
Paper: https://arxiv.org/abs/2407.14679
Demo: nvidia/minitron

posted an update about 1 month ago

Post

1466

Let’s see JEPA in action🤖
Simplified image-based implementation training on a CPU with live preview support - very satisfying to watch:)

I-JEPA is the image-based version of JEPA (Joint-Embedding Predictive Architecture - an alternative to autoregressive LLM architectures ) pioneered by professor Yann Lecun.

At a higher level, I-JEPA predicts image segment representations (Target) based on representations of other segments within the same image (Context). It consists of three key components: a context encoder, target encoder and a predictor.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/mnist_ijepa.ipynb

posted an update about 1 month ago

Post

1769

PyTorch implementation of the Self-Compression & Differentiable Quantization Algorithm introduced in “Self-Compressing Neural Networks” paper.

The algorithm shows dynamic neural network compression during training - with reduced size of weight, activation tensors and bits required to represent weights.

It’s basically shrinking the neural network size (weights and activations) as it’s being trained without compromising performance - this helps reduce compute and inference cost.

Code: https://github.com/Jaykef/ai-algorithms
Paper: https://arxiv.org/pdf/2301.13142

replied to their post about 2 months ago

True, but not long ago mid-air touchable/interactive 3D holography was achieved (https://arxiv.org/pdf/1506.06668) using femtosecond laser system, only that it was done at a very small scale. I agree the tech is not there yet.

posted an update about 2 months ago

Post

702

I’ve always wondered why holography hasn’t had much progress since its inception. Imagine what being able to harness and manipulate light with your bare hands in meaningful ways would be like: 3D photorealistic calls, truly immersive workspace. Given that it’s depicted in every futuristic scifi movie, one could not help but vision a future as such. This paper gives a clear overview why:

Turns out it’s incredibly difficult to compute and render photorealistic 3D data in real-time. The author claims immense computational power is needed for high data transmission rates, and compute of large number of phase pixels required for realistic 3D holography. The latest significant breakthrough in holography was 9years ago published in this paper - wherein they were able to achieve mid-air touchable/interactive 3D holography using a Femtosecond laser system considered safer than nanosecond lasers. Quite astounding work: arxiv.org/pdf/1506.06668

Realizing this breakthrough at scale is an unavoidably tempting research endeavor, super exciting especially with recent developments in machine learning and neural network algorithms demonstrating that computer-generated holograms can approach real-time processing.

2 replies

·

posted an update about 2 months ago

Post

1691

Super Exciting New Paper By Meta🤖🧠🚀

Discrete Flow Matching:
Introduces a new framework/algorithm for generating text/code without having to predict auto-regressively or one “word” at a time as traditional GPT models do. It generates all parts of the text/code at once.

The algorithm does this by slowly transforming random noise (source) into meaningful text (data). It learns how to transform samples along a path created between source and target using a "probability velocity" that describes how probabilities change over time. During generation, DFM starts with a random sample and iteratively updates it using this learned velocity, gradually transforming it into a sample from the target distribution. This allows for non-autoregressive generation.

They were able to scale models of up to 1.7B parameters achieving impressive scores on HumanEval and MBPP for coding, significantly closing the gap between autoregressive models and discrete flow models.

Though in its infancy, it sure does hold a promising future as leading research scientists argue non-autoregressive methods yield better reasoning.

posted an update about 2 months ago

Post

1339

LazyLLM - Unusual Colab (Apple & Meta) Yields Impactful Work

LLM inference typically consists of two stages: prefilling/tokenizing and decoding. In the prefilling stage, the model processes the entire input prompt, computing and caching key-value (KV) pairs for each token, which can be time-consuming for long prompts. This is followed by the decoding stage, where the model generates tokens sequentially, reusing the cached KVs.

LazyLLM introduces a dynamic token pruning technique. Instead of computing KVs for all tokens during prefilling, LazyLLM selectively processes only the most important tokens based on attention scores, deferring less important ones to later steps if needed. It uses progressive token pruning across transformer layers and introduces an Aux Cache to store hidden states of pruned tokens.

This approach significantly reduces the time-to-first-token (TTFT) and overall generation time while maintaining accuracy across various tasks. LazyLLM outperforms baseline techniques like random token dropping and static pruning, and can be easily integrated into existing LLMs without fine-tuning, offering a practical solution for accelerating LLM inference, especially in long context scenarios.

IN SIMPLE TERMS
When you prompt a large language model (LLM), it usually looks at every single word/subword (or tokens) in your prompt before generating a response. This can be time consuming, especially for prompts with very long texts. This paper introduces a new technique that solves this problem by being more selective. Instead of looking at every word right away, it only focuses on the most important words first. It decides which words are important based on how much attention the model gives them. If it needs other words later, it can go back and look at them then. This approach is like skimming a text for key information before reading it in detail.

Read More: https://arxiv.org/pdf/2407.14057

1 reply

·

replied to their post 2 months ago

yh sure will add it

replied to their post 2 months ago

Made some improvements:

Achieved 100% accuracy on 100 samples of make_moons under 100 epochs (AAA😊).
Better memory alloc in gradient accumulation leading to lightspeed faster than Karpathy’s 🎉
Code: https://github.com/Jaykef/micrograd.c

replied to their post 2 months ago

nice

replied to their post 2 months ago

haha ok noted

posted an update 2 months ago

Post

1234

Excited to share my "Focus Mode" Playlist, code name "The AI/ML Researcher's Playlist" :)

No lyrics, no beat, just a harmonious sequence of piano melody that will take you places beyond your reasoning/thinking prowess, trust me I’ve been there lol 🎹🎶

Thanks to an amazing pianist and composer on instagram @andreavanzo_composer who played all the songs in this playlist.

Currently have a total of 16 songs, I will keep adding more when I find them.

Full playlist: https://youtu.be/2ccxanKmzZY?si=x6weX2AgY5Zpadfw

7 replies

·

posted an update 2 months ago

Post

1627

Micrograd in pure C🤕
Port of Karpathy's micrograd in pure C.
Yo C does not negotiate with memory 😂
Code: https://github.com/Jaykef/micrograd.c

2 replies

·

posted an update 2 months ago

Post

2310

BrainGPT - Fun Weekend Project:)
Getting creative with a sci-fi 3D point cloud model of the brain - you prompt the model with questions about AI research frameworks that were deeply inspired by parts of the brain, you get a response with related papers 😂

posted an update 3 months ago

Post

2133

All You Need To Know About Apple Intelligence Architecture And Models!!

One key challenge with running llms on device is a balance between compute, performance and model size. Apple Intelligence solves this using small/specialized chunks (Adapters) of the on-device foundation model when needed.

For compute, they engineered a new framework that uses LoRA adapters of rank 16, allowing a merged 2-bit and 4-bit config that yields up to 3.5 bits per weight, achieving the same performance as the uncompressed models.

With the help of an OSS model latency and power analysis tool (Talaria), they were able to optimize the bit rate selection for each operation. This along with activation & embedding quantizations plus efficient key-value caching, achieved up to 30 tokens/sec on iPhone 15 pro.

When the model is prompted (e.g to rewrite an email in the mail app), the app draws from the app intents toolbox which sends the prompt to the adapter specialized for writing, the model responds through the same pipeline with a real-time update of the text to rewrite.

The coolest feature of these models is their ability to adapt and dynamically specialize on user’s everyday activities. For this they adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.

For tasks that require more capable models, the arch utilizes server/larger models on a private cloud compute infrastructure that delivers SOTA secured and verifiable privacy experience.

More on the private cloud compute: https://developer.apple.com/videos/play/wwdc2024/102/

posted an update 3 months ago

Post

1560

Very Insightful Read!!!
A RAG framework entirely inspired by natural intelligence - modeled after hippocampal indexing theory of human long-term memory(which suggests the hippocampus links and retrieves memory details stored in the cortex)

It outperforms current “cheat” RAG:)
This is how we achieve human-level intelligence, by modeling natural intelligence correctly!

Paper: https://arxiv.org/abs/2405.14831

1 reply

·

replied to their post 4 months ago

okay, will be there

posted an update 4 months ago

Post

1665

I’ve been working on a crazy theory for my first solo paper and I would appreciate some advice from leading researchers here:)

"Theory of Adaptive Learning"

Of all the deep learning algorithms at least to my knowledge, there’s none that fully covers the adaptive nature of intelligence. I believe it is a fundamental missing component of current AI governing laws.

I define it as a kind of learning wherein one person (say a student) adapts their framework of understanding to better suit that of what is being taught or said by another person/model (say a teacher). If we could measure the nature of this transfer learning. I believe it could help improve planning and reasoning capabilities of AI systems. If we look back at the theory evolution, adaptation is a fundamental component of human evolution. Today's so-called groundbreaking architectures or models, specifically large language models tend to have static parameters with constraints that are almost impossible to change or update in real-time after training. This fundamentally hinders their ability to reason, plan and accomplish objective-driven tasks as we humans do. Intelligence is dynamic.

Now this cannot be done with current autoregressive llms as their parameters are fixed with static constraints, even though RAG do help in updating model parameters in real-time but its basically cheating and doesn’t count as intelligence. There’s a pressing need for a natively adaptive architecture - The Goal of This Paper

3 replies

·

replied to their post 4 months ago

posted an update 4 months ago

Post

2122

Proof that ablative educational dataset significantly enhances model capabilities (independent of model parameters or architecture) 🤩

Yesterday, FineWeb’s technical report was published. FYI FineWeb (by 🤗) is currently the best opensource text dataset that can scale up model performance up to that of GPT-3 level.

While proprietary datasets used in training models like GPT-4/Claude/LlaMA are crawled internally and never released, FineWeb builds on CommonCrawl (an open repo for crawled web data). They preprocessed the data using their custom built data preprocessing library datatrove (which they also opensourced), and then evaluate the data quality on lighteval by training small sized models “ablation models” using nanotron (a library for pretraining transformer models).

Of all versions of FineWeb, FineWeb-Edu outperforms all other subsets. This is thanks to a new filtering technique wherein they used synthetic data to develop classifiers for identifying educational contents.

Turned out “Education is All You Need”:)

1 reply

·

replied to their post 4 months ago

Sure, thanks. Will join in.

posted an update 4 months ago

Post

1127

Started a new AI Session: The AI Paper Talk Show 🧠🤖💥

In this episode we went through AnthropicAI's recent interpretability paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" in which they applied Sparse Dictionary Learning on a larger model (Claude 3 Sonnet) - wherein they match patterns of neuron activations (named Features) to human interpretable meanings.

Check full video here: https://youtu.be/uNz-Ww3_LrU?si=HUm2TWV-rSJ3X4UX

Read More:
https://transformer-circuits.pub/2024/scaling-monosemanticity/

You can also find me:
Twitter: https://x.com/jaykef_
Github: https://github.com/Jaykef

2 replies

·

replied to their post 4 months ago

Thanks. I will when I'm done with the final version.

posted an update 4 months ago

Post

1411

Successfully defended my thesis yesterday 🚀

Glad that my supervisor gets the innovation behind it - “An Adaptive Virtual Intelligent Tutor that autonomously learns and adjusts to your learning preferences”

The Intuitive Approach: fine-tune a high performant pre-trained large language model on a rich task-specific dataset (in my case code instruction dataset with adaptive instructions on how to teach coding/solve coding problems with adherence to the student’s learning style)

Then apply Retrieval-Augmented Generation (RAG) during inference to update the knowledge base of the model in real-time with adaptive features learned from conversations with the model over time.

The app supports both real-time voice chat with an intelligent 3D Avatar (made with Soulmachine's Digital DNA studio) powered by the fine-tuned model and text chat with the locally hosted fine-tuned model.

With enough interactions you get an objective driven, task-specific and adaptive personalized tutor that completely gets you (knows your learning pace, your learning style and preferences).

This is what I feel is missing in today’s AI systems - Autonomously Adaptive Assistants (AAA) - and oh I’m currently writing a paper on this:)

2 replies

·

replied to hakunamatata1997's post 4 months ago

Sadtalker's gen speed is fairly good tho, also you can dig through the code and see if you can optimize for faster generation. I suggested D-ID stream api so you can see how their video streaming works.

replied to hakunamatata1997's post 4 months ago

I think most existing OSS talking head archs only take audio and image as input, you can checkout sadtalker (https://sadtalker.github.io/) it takes in audio and image as inputs. As for streaming you'll have to do that via api with websocket, checkout D-ID's stream api: https://docs.d-id.com/reference/createstream

posted an update 4 months ago

Post

992

mlx port of karpathy’s minbpe 🤕
Minimal (byte-level) Byte Pair Encoding tokenizer. Algorithmically follows along the GPT2 tokenizer.
Code: https://github.com/Jaykef/mlx-minbpe

replied to their post 4 months ago

posted an update 4 months ago

Post

1617

Journey With Me Into The Mind of Large Language Models: Interesting Findings in AnthropicAI's Scaling Monosemanticity paper.

One of the many unknowns with LLMs is the why behind the responses they give - it's unclear why certain responses are chosen over others. Which shows how little we know of what's happening inside these models.

To have a deeper sense of this, they tried Sparse Dictionary Learning on a larger model (Claude 3 Sonnet) - wherein they match patterns of neuron activations (named Features) to human interpretable meanings.

Now Dictionary Learning is a traditional ml technique that identifies recurring patterns of neuron activations across various contexts. Meaning, any internal state of the model can be expressed as a combination of a few active features rather than numerous active neurons.

They scaled up a more effective measure of dictionary learning using a Sparse Autoencoder (SAE). The SAE has an encoder that maps inputs to sparse high-dimensional features via linear transformation & ReLU, and a decoder that reconstructs inputs from those features.

Three variants (of sizes: ~1M, ~4M & ~34M features) of the SAE were trained and across SAEs, <300 active features/token, >65% variance were explained. With dead features: ~2% for 1M, 35% for 4M, 65% for 34M SAE. Implying better training could reduce dead features.  Experiments were conducted with these SAEs where they were applied to residual stream activations (RSAs) at the model's middle layer (why? 1. RSAs are smaller than MLP layers = low compute cost, 2. helps tackle "cross-layer superposition" issues - when features are spread across multiple layers instead of being isolated in specific layers, causing interpretation difficulties). These experiments revealed that Scaling Laws can help guide training of these SAEs.

My favorite of course is the Basic Code Features - where the model attributed meaning to different code syntax elements similar to syntax highlighting in text editors.

1 reply

·

replied to their post 4 months ago

yep, all this just shows at both low and high level how complex language is, even with abstractions we still fall at the mercy of undesired outcomes. So far tiktoken and sentence piece are viable choices for larger models.

posted an update 4 months ago

Post

1116

After spending some time practicing tokenization, I have come to realize that the difficulties we face in understanding each other are analogous to the challenges faced by LLMs in processing and interpreting tokens - as in untrained tokens lead to out of distribution qualms.

One could think of how we understand as a process that involves trained tokens (known/learned facts) grappling with prompts/tweets/lessons from someone else. This process is distinct for each person - with unique encoding, decoding, merging and splitting patterns.

This distinction might as well be categorized in gpt levels lol, which brings the question what level of tokenizer are you? GPT-2, GPT-3, GPT-4 or GPT-4o tokenizer:)

Papers:
- Neural Machine Translation of Rare Words with Subword Units (https://arxiv.org/abs/1508.07909)
- Learning to Compress Prompts with Gist Tokens
(https://arxiv.org/abs/2304.08467)
- Language Models are Few-Shot Learners
(https://arxiv.org/abs/2005.14165)
- Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
(https://arxiv.org/abs/2405.05417)
- Language Models are Unsupervised Multitask Learners

Code:
https://github.com/karpathy/minbpe
https://github.com/openai/tiktoken
https://github.com/openai/gpt-2
https://github.com/google/sentencepiece

5 replies

·

replied to their post 4 months ago

Okay GPT-4o just helped me beat karpathy's minbpe train speed 1.2x faster in one shot - can finally agree on the "o" meaning "omni":)

Improvements

efficient merging and getstats: got rid of redundancy in computings merge and getstats

posted an update 4 months ago

Post

1469

Build your own GPT-4 Tokenizer! - @karpathy 's minbpe exercise.
Step 1: BasicTokenizer
Got "close" to beating minbpe's train speed :(
step 2 RegexTokenizer coming soon.

Notes on lessons learned:
- tokenization is the assembly language of LLMs:)
It's not a healthy choice to code it lol.
- encoding can literally drive you mad.
- merging is where sh*t gets real - moment of truth:)
- training requires precision.
- decoding is trivial.

1 reply

·

posted an update 5 months ago

Post

1791

mlx_micrograd - mlx port of Karpathy's micrograd- a tiny scalar-valued autograd engine with a small PyTorch-like neural network library on top.

https://github.com/Jaykef/mlx_micrograd
Installation

pip install mlx_micrograd

Example usage
Example showing a number of possible supported operations:

from mlx_micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data}') # prints array(24.7041, dtype=float32), the outcome of this forward pass
g.backward()
print(f'{a.grad}') # prints array(138.834, dtype=float32), i.e. the numerical value of dg/da
print(f'{b.grad}') # prints array(645.577, dtype=float32), i.e. the numerical value of dg/db

posted an update 5 months ago

Post

2456

# Thoughts on Neural Scaling Laws
When you take a zoomed-out perspective view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.

The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.

Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.

As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models

I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)

References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295

replied to their post 5 months ago

Paper: https://github.com/KindXiaoming/pykan
Code: https://github.com/KindXiaoming/pykan

posted an update 5 months ago

Post

1528

When I read the KAN paper, I see physicists casually making fun of the uncertainties in MLPs or Neural nets as a whole:

- "The philosophy here is close to the mindset of physicists, who often care more about typical cases rather than worst cases" lol this went hard on NNs

- "Finite grid size can approximate the function well with a residue rate independent of the dimension, hence beating curse of dimensionality!" haha.

- "Neural scaling laws are the phenomenon where test loss decreases with more model parameters"

- "Our approach, which assumes the existence of smooth Kolmogorov Arnold representations, decomposes the high-dimensional function into several 1D functions"

Key Differences With MLPs:
- Activation Functions: Unlike MLPs that use fixed activation functions at the nodes, KANs utilize learnable activation functions located on the edges between nodes.
- Weight Parameters: In KANs, traditional linear weight matrices are absent. Instead, each weight parameter is replaced by a learnable univariate function, specifically a spline.
- Summation Nodes: Nodes in KANs perform simple summation of incoming signals without applying non-linear transformations.

Advantages Over MLPs:
- Accuracy: achieve higher accuracy with smaller network sizes compared to larger MLPs in tasks like data fitting and solving partial differential equations (PDEs).
- Interpretability: Due to their unique structure, KANs are more interpretable than MLPs.

Technical Innovations:
- Learnable Edges: learnable functions on network edges presents a novel approach to network design, providing greater flexibility in modeling complex relationships in data.
- No Linear Weights: elimination of linear weights reduces the parameters, and potentially simplifies the learning process, focusing on the optimization of univariate function representations.

Applications and Practical Use:
- Scientific Collaboration: KANs have been applied in scientific settings as tools to help discover or rediscover math

1 reply

·

posted an update 5 months ago

Post

1745

It’s exciting to see Apple’s commitment to opensource AI research lately. From a new awesome machine learning framework (mlx) to a family of purely open models (openELM) and incredibly visionary papers (LLMs in a flash, MM1) not mention the vibrant OSS community behind mlx - All alpha signs of something huge dropping in this year’s #AppleEvent & #WWDC

replied to their post 5 months ago

Over 400 downloads already🎉

small yet very capable, lightweight, runs at light speed with mlx/llama.cpp

posted an update 5 months ago

Post

2404

New update to mlx-rag-gguf:
- mlx supported phi-3-mini-4k gguf weight.
- support for other gguf weights (llama arch) 4 & 8 bits quantized.
repo: https://github.com/Jaykef/mlx-rag-gguf
model Jaward/phi-3-mini-4k-instruct.Q4_0.gguf

1 reply

·

replied to their post 5 months ago

yh, too bad It was unable to run tests since mlx is apple silicon only and devin’s dev env is linux, it wrote the port code tho, will have to test on my mac:)

posted an update 5 months ago

Post

1964

Today’s devin’s most difficult task:
build a port of our AutoAgents framework in mlx and develop a demo using a gguf weight - it got close to nailing it (with guidance).

It was magical to witness. I had to take the wheel and help fix some subtle bugs. That said there was still the need for a human software engineer to keep it aligned with the overall goal. Most of my work involved reviewing code, checking shells and alignment chats.

full demo coming soon.

AutoAgents: LinkSoul/AutoAgents

2 replies

·

replied to their post 5 months ago

haven't completed yet, need to do some refactoring. I will share when it's ready.

posted an update 5 months ago

Post

1805

Got access to Devin today and boy it’s been rocking it - 10x engineer on pure software dev tasks, albeit falls at the mercy of ML/AI tasks. Still a promising work of daring-engineering feat, wishing all the best to the team @cognition_labs

4 replies

·

replied to their post 5 months ago

The instruct weights are out: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct

replied to their post 5 months ago

This comment has been hidden

replied to their post 5 months ago

the paper mentioned the 4-bit quantized can occupy ~ 1.8GB on the iphone, so it will probably be less than 2GB.

posted an update 5 months ago

Post

5042

All You need To Know About Phi-3 (Technical Report Walkthrough)

Summary of Summaries:
Phi-3-mini
- Architecture specs: decoder-only transformer, ModelSize: 3.8 billion
parameters, LongRope [ 128K Context length ], Vocab Size [ 32064 ],
trained on 3.3 trillion tokens. at bfloat16.
- Rivals performance to larger models like Mixtral 8x7B and GPT-3.5,
capable of running locally on a smartphone.
- Utilizes high quality training dataset heavily filtered from web data and
llm-generated synthetic data.
- Can be quantized to 4-bits, occupying ≈ 1.8GB of memory.
- Ran natively on iPhone 14 with A16 Bionic chip with inference speed of up
to 12 tokens per second.

Phi-3-small
- Architecture specs: Also decoder-only, 7B parameters, Vocab size [ 100352 ], default context length [ 8k ], Context Length: 8K, Hidden Dimension: 4096, Number of Heads and Layers: Follows 7B class structure.
- Uses tiktoken tokenizer (for enhanced multilingual tokenization)

Phi-3-medium:
- Architecture specs: Also decoder-only, Hidden Dimension: 5120, Number of Heads: 40, Number of Layers: 40, Tokenization: Consistent with other models, Training on 4.8 trillion tokens.

Training Methodology:
- Focuses on high-quality training data deviating from standard scaling laws.
- The models undergo two-phase pre-training using a mix of web sources and synthetic data for general knowledge and logical reasoning skills.

Performance:
- Phi-3-mini achieves competitive scores on standard benchmarks like MMLU and MT-Bench, indicating strong reasoning capabilities.
- Higher variants show even better performance, suggesting effective scaling with increased model size.

Limitations:
- phi-3-mini: limited by its smaller size in tasks requiring extensive factual knowledge, primarily supports English.
- phi-3-small limited multilingual support.

Hosting LLMs locally is a big win for OSS - private, secured inferencing on the go😎

4 replies

·

posted an update 5 months ago

Post

3471

# On Coding Your First Attention

While it’s not necessarily the case that you must code the attention block of a transformer from scratch to understand how it works, yet it sure is the closest you can get to having a first-principles understanding of why/how transformers behave the way they do.

@karpathy covered attention in detail in his nanoGPT video (strongly recommend watching). Now I would like to share some thoughts and experience in writing my first attention.

First let’s zoom out quickly and explain what attention is in transformers: Attention in transformers is a communication mechanism that allows the model to focus on different parts of the input sequence when making predictions.

It assigns weights to each input token based on its relevance to the current context, enabling the model to weigh information selectively. This mechanism helps transformers capture long-range dependencies and contextual information effectively.

The official AIAN paper introduced two commonly used forms of attentions: Scaled Dot-Product Attention (also known as Self-Attention) and a stack of self-attention blocks known as Multi-Head Attention.

# The Code

Now, attention as for most deep learning algorithms boils down to a math equation. So writing the code can get really trivial especially with a deep learning framework like PyTorch. Below is what's called a Single Head Attention

(image 2)

The code defines single-head attention in PyTorch - it transforms input vectors, computes attention scores and weights, and then calculates the weighted sum of values based on these weights (as per the attention equation)

When you have multiple of those stacked in parallel, you get what's called Multi-Head Attention. This gives a much simpler code if you are inheriting from the SingleHeadAttention class:

(image 3)

Full Article here: https://huggingface.co/blog/Jaward/coding-your-first-attention

1 reply

·

replied to their post 5 months ago

Closest is SadTalker: https://github.com/OpenTalker/SadTalker
its holistic facial dynamic generation is limited to: only lipsync, head movement and eye blink.

I don't think MSF will release the VASA code, they will probably commercialize on it.

replied to their post 5 months ago

Thanks, cool collection 👍

replied to their post 5 months ago

The magic: a train pipeline that can “extract facial dynamics and head movements from real-life talking face videos”

Jaward Sesay

AI & ML interests

Articles

Journey With Me Into The Mind of Large Language Models: Interesting Findings in AnthropicAI's Scaling Monosemanticity paper.

On Coding Your First Attention

Organizations

Jaward's activity