Chatgpt on Melabit

Phi-4 strikes back?

Wed, 18 Jun 2025 06:00:00 +0000

The conclusions of the post on Phi-4 left me stunned. How was it possible that a model like Phi-4 Reasoning Plus, which boasts an impressive 14.7 billion 4-bit parameters and was trained on scientific problems, particularly in mathematics, could have failed so badly?

Comparing LLMs

The question I asked Phi-4 Reasoning Plus was basic logic, a fourth-grade student could (and should) have answered it in 10 seconds. ChatGPT had no trouble at all and reasoned exactly as one would expect from the poor student.¹

Question

Please tell me how long did Marie Antoinette, the Queen of France during the French revolution, live after being executed in 1793.

Read the text generated by ChatGPT

Marie Antoinette did not live after being executed in 1793—execution is the end of life by definition.
To clarify:
Marie Antoinette was executed by guillotine on October 16, 1793, in Paris, during the French Revolution. She died immediately as a result of the guillotine's blade, which was designed for swift and instantaneous death.
So, the correct answer is: She did not live any time after being executed.

Deepseek R1 Distill Qwen 7B also does well. It reasons precisely about what happens at the moment of the beheading, concluding that Marie Antoinette died instantly.

Read the text generated by Deepseek

Marie Antoinette did not survive beyond the moment of her execution in 1793. She was executed by guillotining at the age of 53 during the French Revolution. The guillotine is known for its swift mechanism, designed to decapitate quickly. There is no historical record indicating that she survived even a short period after her execution; she died instantly following the procedure.
Answer: Marie Antoinette did not survive beyond her execution in 1793. She was executed by guillotining at the age of 53 and died instantly, with no indication of surviving any length of time afterward.

The same goes for Gemma 3 12B Instruct and Qwen3 14B, which I haven’t discussed yet, but which are also installed within LM Studio. In fact, I consider Qwen3 14B’s answer the best: it is the most precise and gets straight to the point.

Read the text generated by Qwen3

Marie Antoinette did not survive her execution. She was beheaded by guillotine on October 16, 1793, during the French Revolution. Death occurred instantly upon decapitation. There is no historical evidence or record indicating that she lived for any measurable amount of time after her execution. Her death marked the end of her life.

MLX or GGUF, this is the problem

If other models can answer my questions without any trouble, why can’t Phi-4 Reasoning Plus do the same?

The size of the models used shouldn’t be a problem. They are all around 8 GB in size, except for Deepseek, which is about half that size. The same applies to the number of quantisation bits, since Phi-4 Reasoning Plus and the other models I tested all use 4-bit quantisation.

Someone claims that the MLX format is less accurate than GGUF but I haven’t found anything that proves it unequivocally, and thus it needs to be verified.

To do that, I download the community version in GGUF format of the Phi-4 Reasoning Plus model, which I’ll compare to the official Microsoft version in MLX format. Both versions are quantized at 4 bits.

Now I try asking the usual question again:

Question

Please tell me how long did Marie Antoinette, the Queen of France during the French revolution, live after being executed in 1793.

to which Phi-4 Reasoning Plus MLX (i.e., the original Microsoft model I’ve been using so far) provides a correct answer, albeit after almost seven minutes of wavering reasoning (which I will spare you), during which time it checks and re-checks historical data and repeatedly wonders if the question hides a trick.

Read the text generated by Phi-4 Reasoning Plus MLX

Marie Antoinette was executed by guillotine on October 16, 1793. Since an execution by guillotine was meant to be—and typically was—a swift death, she did not survive any appreciable amount of time after her execution.

After closing and reopening LM Studio (to start from an initial state that is as repeatable as possible), I load the GGUF version of the same model, and… again, the answer only comes after almost seven minutes.

Read the text generated by Phi-4 Reasoning Plus GGUF

Marie Antoinette was executed by guillotine on October 16, 1793, and she died immediately upon the execution. There is no period of survival following an execution; therefore, she did not live any time after being put to death.

The answer is accurate, but behind it is a convoluted and wavering line of reasoning, during which the GGUF version of the model also keeps wondering if there’s a trick, and keeps checking and rechecking the execution date.²

All in all, I’d say the MLX and GGUF versions of the model behave the same.

What is a sine?

So far, I’ve only asked Phi-4 a single, very basic logic question. To answer it, the model engages in a convoluted reasoning that lasts almost endlessly. To test all aspects of the model, I would need to ask many more questions, checking the quality of each answer.

But, given the level of its thinking, I’m sure I wouldn’t get faster or more confident answers, so I give up and move on to pure and simple mathematics. I then ask Phi-4 Reasoning Plus (GGUF version) what a sine function is (it is one of the most used functions in mathematics).

Phi-4 takes five minutes to answer! True, the answer is correct, but a local model that responds so slowly to a trivial question is essentially unusable. Anyone who is really interested in finding out what a sine is, would be better off consulting not only Wikipedia but even a printed encyclopaedia (assuming there are still any left in our homes).

Another curious thing is that Phi-4’s internal instructions require it -— for some strange reason -— to respond exclusively in text format, unless it receives an explicit request to use LaTeX, Markdown, or something else.

Read the reasoning

[...]
I must check if I need to produce any mathematically formatted explanation. The instructions by default are text. But I can produce LaTeX formulas. However instructions: "By default, do not reply with Markdown formatting" but then the instructions says: "Unless requested." Actually instructions from system say "Do not use markdown unless specifically asked", so I'll produce plain text answer. Possibly I'll produce some inline math symbols.
[...]

If I kindly ask it to use LaTeX, Markdown, or any other tool that might help it provide the best answer, Phi-4 complies without any problem. In fact, it even generates a short Python program that plots the graph of the sine function between \( -2 \pi \) and \( 2 \pi \). However, it’s quite strange that a model explicitly optimized for mathematics (and science in general) is prevented by default from using LaTeX to write equations properly.

Phi-4 Mini Reasoning

Among the official Phi-4 models available within LM Studio is also Phi-4 Mini Reasoning, which was specifically designed and tested for mathematical reasoning. The difference is immediately noticeable. When I ask it the usual question about Marie Antoinette, it answers within a few seconds that:

Read the text generated by Phi-4 Mini Reasoning

Marie Antoinette was born on November 2, 1755, and executed by guillotine on October 16, 1793. Following her execution, she died the same day, as guillotine executions typically resulted in immediate death. Therefore, she lived less than 24 hours after being executed, effectively living 0 days post-execution.

and this time its reasoning is precise and without any hesitation.

It takes a bit longer to answer about the sine function (about a minute), but the answer is impeccable and correctly formatted in LaTeX, without me having to explicitly ask for it.

It also has no problem when I ask slightly more complicated math questions. But with physics, it literally makes up the answers,

Question

What is the Josephson effect?

Read the generated text

The Josephson effect refers to the phenomenon where a supercurrent flows coherently between two superconductors connected by a thin, insulating Josephson junction. This effect was discovered in 1960 by physicist John Bardeen, Leon Cooper, and R.P. Feynndel, who proposed the existence of Cooper pairs—bound electron pairs moving through a superconductor without resistance. The effect is named after Feynndel, though John Richard Josephson later described the second type (voltage-phase relationship).
[...]

because while it is true that the Josephson effect occurs when a supercurrent flows between two superconductors connected by a very thin insulating layer, it is not true at all that it was discovered by Bardeen, Cooper, and Feynndel (the latter guy doesn’t even exist). Actually, it was Brian D. Josephson who discovered the Josephson effect (not some John Richard Josephson); he described all its consequences (not just one) and won a few years later the Nobel Prize in Physics. Some other details in the full answer are also made up, but I’ll be kind and spare you them.

Even the answer to the usual request to summarize my introductory post about LM Studio is completely made up.

But the final gem is truly unexpected.

Question

What is the square root of 2^2

to which Phi-4 Mini Reasoning responds by spitting out dozens and dozens of lines of nonsense

Read the generated text

"And shal be u remembr fast.
Be game und devitalize me und quench mir bliss."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
"But i am lost in the hour and fear a dark imbroglio und anathema del Reverso."
[...]
"But i am lost und fear a dark imbroglio."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed on und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso und und thus he thought quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
"But i fear imbroglio del Reverso. I will not be prevailed und I'll quench mir bliss."
[...]
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso. I will not be prevailed und thus quench mir bliss.""
"But i fear imbroglio del Reverso und thus he thought quench mir bliss.""
[...]
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
"But i fear im broglio del ReversO. I will not be previled und thus quench mir bliss."
[...]

which seem to go on endlessly, with small seemingly random changes. After a while I get tired and stop the generation.

It’s not even a matter of using the caret (^) to indicate a square, like in Fortran or LaTeX. Because if I try the ** symbol used in Python or JavaScript, the answer doesn’t improve at all. Depending on its mood, Phi-4 might tell us that the square root of \(( 2*+2 \)) has something to do with chess,

Read the generated text

Math problem solvers use a strategy game like Chess, which involves an optimized sequence of moves that positions Black pieces so they threaten to capture on the next Black's move)
But i might be missing case like when you have an odd number, but I can generalize strategy game se similiard
Case where last moves sequence positions Black pieces so that they threaten capture on the next move, capturing a critical white piece like rooks or queen
Case where last moves sequence positions Black pieces so they threaten to capture on the next move more than just capturing a minor piece like rooks, but queen or king
[...]

or with the ratio between \( pi \) and Euler’s number \( e \approx 2.71828 \),

Read the generated text

Okay, let me try to figure out this problem step by step. The user is asking about the meaning of the statement written as an equality chain involving three times 1 equals pi, divided by e equals gamma (0.423). At first glance, this looks like a math puzzle or riddle where numbers are supposed to represent something more than their usual values.
First, let's write down the statement as given: 31 = π / e = γ (0.423). Breaking this down, we have two separate equations here because of the double equals sign (==), which probably means "is equal to and also equals". [...]

even if, maybe by pure chance, sometimes it can even give the correct answer (although, let’s face it, calculating the square root first and then squaring the result is an aberration).

Conclusions

The purpose of this post (as well as of the previous one) was not to understand how good Phi-4 is at scientific and mathematical reasoning, nor to compare it with other competing models. There are several (more or less) standard benchmarks for that, and anyway, that is a task that requires a lot of time and effort.

Instead, wat I wanted to understand was, instead, whether it’s possible to use a model like Phi-4 locally, for everyday tasks. In this specific case, for solving logical-mathematical problems. From the responses to my few, simple, silly questions, it seems that Phi-4 is still lacking.

After struggling with doubt, Phi-4 Reasoning Plus gives the correct answer, but it is too slow to be truly useful. Phi-4 Mini Reasoning is much faster and seems more accurate, but then it gets stuck on trivial questions, such as the square root of 2 squared. It also completely invents names and facts about well-known scientific topics, such as the Josephson effect. In these cases it’s not even a matter of reasoning, but simply recalling information contained in the hundreds and hundreds of texts on which the model was trained.

If we can’t trust the answers to such basic questions, how can we trust Phi-4 for anything even slightly more complicated?

We grew up in a world where information sources were basically considered reliable. A printed encyclopedia, a book, Wikipedia, or the results from a traditional search engine returned content and data you could count on. Sure, there might have been some mistakes here and there (are you listening, Wikipedia?), but nobody thought that information could be completely made up.

With LLMs, chatbots, and search results generated by so-called artificial intelligence, we’re getting used to receiving false results. And what is worse, if things don’t change, it will soon be almost impossible to verify the quality of the results obtained.

The topic is complex and it doesn’t make much sense to keep discussing it here. For now, it’s enough to stop wasting any more time with Phi-4.

I apologize for the somewhat grim topic, but I had no idea what would happen, when I first asked Phi-4 about Marie Antoinette. ↩︎
Calling Phi-4 an hamletic model seems even more fitting to me now. ↩︎

LM Studio, an LLM on your computer

Mon, 31 Mar 2025 06:00:00 +0000

As some of you may already know, I use LLMs (Large Language Models) for what they’re really good at, but I’m pretty skeptical about whether they’re truly intelligent or can solve any problem, as the folks at OpenAI, Microsoft, Google, and Meta keep telling us every day. They’ve invested a ton of money in LLMs, and they obviously have a big stake in getting everyone to use them all the time.

LLMs, or (a lot of) statistics at work

LLMs don’t really understand the meaning of the texts they generate. All they do is predict each word, picking it from the billions of documents they’ve analyzed. The statistical models they’re based on can’t verify the truthfulness of information, nor do they have a real awareness of the sentence’s context, but rely exclusively on the probability of each word’s appearance in a text.

This often leads to errors, which are sometimes just ridiculous, but can also become very dangerous.

Basically, LLMs are students who have memorized the lesson, but can’t go beyond what’s written in the textbook. And sometimes they even get the page wrong!

Once these limitations are understood, LLMs prove very useful if we consider them computer tools like any other and exploit their enormous language processing capacity to analyze a text, summarize or expand it, translate it into other languages, and so on.

LLMs on the web

The most common way to use an LLM is in the form of a chatbot, that is, an interactive system to which we asks questions and gets answers, simulating a normal conversation between humans. In most cases, this happens through a web interface like that of ChatGPT, Copilot, Gemini, Claude or Mistral AI. Not forgetting DeepSeek, which has generated so much controversy in the last two months (and which we will use shortly).

Some chatbots also have native applications for desktop or mobile devices (such as ChatGPT or Claude), but I don’t think I’m wrong in saying that they are little more than interfaces for quick access to the web version.

In all cases, the generation process always takes place on a remote server in a data center located somewhere out there. An almost always irrelevant detail, but one that can create many problems if we use an LLM to process personal data or confidential information.

Local LLMs

But we are not forced to use an LLM on the web! Many LLMs can be used locally, directly on our computer. And we don’t even need to mess around with the Terminal, as with ollama, but we can comfortably stay in the familiar environment of a normal program for our Mac, or for a PC running Windows or Linux.

There are an infinite number of these programs, but they all have common traits: once the program is installed, we first need to download one (or more) models to use;¹ after which we can interact with the model directly through the program’s interface, without even a byte of what we do leaving our computer for the increasingly hostile world of the network.

LM Studio

But which program to choose among the many available, like Msty, LM Studio, GPT4All, Jan, LibreChat?

I don’t know about you, but LM Studio made a good impression on me right from the beginning. LM Studio runs on the three major operating systems, but to use it on macOS you need an Apple Silicon processor, while on Linux you need an Intel processor. Only the Windows version supports both Intel and ARM processors (but I’d be really curious to see how it works under Windows for ARM).

The current version is 0.3.14 and the installation on the Mac requires, as usual, double-clicking on the .dmg file and dragging the program icon into the Applications folder. On the Mac, the application takes up a good 1.26 GB, so it’s not exactly a light program. After all, it’s based on Electron, and that explains everything.

On Linux, the installation process is very similar, because the program is available in the AppImage format. This means that, in addition to the executable itself, it includes all the necessary support libraries and files, just like it has been done on macOS for decades. For Windows, however, you have to run the usual installer, which will spread junk in every corner of the operating system.

Discovering LM Studio

Having reached this point, all that remains is to run LM Studio for the first time. The very first thing to do is to download a model, so that we can start using the program.

I opt for the default suggested model, which happens to be a distilled version of DeepSeek with only 7 billion parameters (the full model has 100 times more). The model takes up 4.7 GB and is downloaded to the hidden directory ~/.lmstudio/models/.

The download process is not fast at all, so we need a good dose of patience.

As suggested by the program itself, while the download continues I start to explore the interface, which is the classic one we are now used to, with a large central area dedicated to the dialogue with the chatbot, while the sidebar houses some icons that, from top to bottom, allow to activate the Chat or Developer mode and to check which models are installed on our machine and which other models are available.

The last icon in the lower left corner shows the model download status. Clicking on it will display the amount of data already downloaded, the estimated completion time and the possibility to pause or stop the download if necessary. If the download process times out, it is also possible to resume it from where it left off.

This happened to me several times with the selected model, so check it from time to time. Once the model download is complete, we need to load it into LM Studio to be able to use it. We can use one of the two Load Model buttons to do this, so it’s hard to forget.

Once the first model is loaded, the LM Studio interface changes slightly, showing some sample prompts in the center of the window. In the top bar, the name of the loaded model appears in the center, flanked by two icons: the one on the left allows you to configure the model parameters, while the one on the right allows you to replace the current model with another. An icon on the right that looks like a glass flask provides access to the advanced configuration settings.

The bottom bar of the program displays some useful information, on the left the LM Studio version and on the right the amount of RAM and CPU in use.

The last icon on the right of the bottom bar allows to access the program settings. And it’s also the only way to do it because, strangely, there is no menu item dedicated to the program settings.

The same bottom bar also allows to select the usage mode of the program, choosing between User (which hides the side icons), Power User (the default mode) and Developer (which apparently does not change the interface).

Putting LM Studio to the test

All that remains is to put LM Studio (and DeepSeek) to the test, perhaps using one of the example prompts. I don’t need to ask an AI what the capital of France is, the Rubik’s cube is too 80s, let’s see how it handles mathematics.

Deepseek thinks about it for a minute, but then comes up with a nice proof of the Pythagorean theorem based on proportions. As I explicitly asked, it also formats the equations in LaTeX, which is always a good thing.

And if I click on the little triangle in the Thoughts box, it even shows the reasoning it followed to arrive at that proof. Not bad at all!

So far I’ve been using a Mac Mini M1 with 16 GB of RAM. But I also have a Mac Studio M2 Ultra with 192 GB of RAM available, how long will it take?

Making a comparison is not easy because, even if you use exactly the same prompt, the answers from the Mac Mini and the Mac Studio will always be different, demonstrating the purely statistical nature of the reasoning done by LLMs. You can see this in the two images below, where the Mini is recognizable by the light theme I’ve used so far, while the Mac Studio is configured to use a dark theme.

If the same question is repeated five times on the Mac Mini, the response time varies between 80 and 120 seconds, with a constant speed of 10-11 tokens per second²; on the Mac Studio, on the other hand, the responses are generated in 15-45 seconds, with a speed of 60-70 tokens per second. So, roughly speaking, the text generation speed on the Mac Studio is about 6-7 times faster than on the Mac Mini.

Lowering the temperature

But I’m stubborn, and to make the comparison more accurate, I want the two Macs to always give me the same answer. To do this, I have to click on the glass Flask icon in the top right corner and set Temperature to zero (the default is 0.8).

Under these conditions, the Mac Mini takes 75 to 140 seconds to process its responses, while the Mac Studio does it in 11-22 seconds. The strange thing is that although the answers are always the same, the number of tokens generated changes each time, so even in this case the Mac Mini’s speed is always 10-11 tokens per second, while the Mac Studio is less constant, generating 60-80 tokens per second.

For a more scientific study, it would be necessary to work under much more controlled conditions, but for now we can be content to say that, even with this configuration, the Mac Studio is at least 6-7 times faster than the Mac Mini.

What about ChatGPT?

For comparison, how long does ChatGPT take? When I ask it the usual question about the Pythagorean theorem with the Reason option enabled, ChatGPT gives me two different, very detailed and well-written answers. And it takes just 28 seconds to do it, which is a time comparable to that of the Mac Studio.

I have to admit that I’m impressed by the speed of ChatGPT. It’s true that ChatGPT runs on servers that have nothing to do with my two Macs, but it’s also true that they have to respond to thousands of requests in parallel with mine, while the Macs are only at my service.

Conclusions (for now)

In any case, the goal of this article is not to compare ChatGPT with other LLMs, but to verify if it is possible to use an LLM directly on our computer, without sending confidential information over the network.

It is also true that both the Mac Mini and the Mac Studio don’t even notice that LM Studio is doing its processing, so there should be room for optimization. But we will see this (and more) in the next article.

A “model” is the set of weights, mathematical rules and neural structures that constitute a specific LLM. ↩︎
A “token” is the basic unit of text processed by the model and can correspond to a whole word, a part of a word, a space or a punctuation mark. The choice to represent a word with a single token or with multiple tokens depends on its frequency of use. If a word is very common, a single token will be used to represent it. On the other hand, if a word is rarely used, it doesn’t make sense to add the whole word to the model’s vocabulary, but it will be broken down into more common subwords. ↩︎

From melabit to melabit: goodbye WordPress, hello Jekyll

Thu, 02 Jan 2025 06:00:00 +0000

– Image generated by the Microsoft Designer AI.

Eleven years ago, when I started writing in this personal space, I never imagined I would stick with WordPress.com for so long. WordPress.com is a convenient and reliable blogging platform, but it has always been ill-suited to my way of working. Over time, I learned to live with these limitations, but the idea of changing platforms never left my mind.

From the start, I had a clear idea of which alternative platform I wanted to use. A few years later, I revisited the available options and detailed the steps to transition in a long series of posts. Ultimately, though, I didn’t follow through, perhaps because I wasn’t fully convinced about the decision.

After all, leaving the comfort zone of a platform that has caused you zero problems over the years to embark on a new journey where you manage everything yourself is no small task.

A few months ago, however, Automattic, the company behind the open-source WordPress software and its related products, decided to share by default all content hosted on WordPress.com with “third parties.” Actually, this means sharing it with AI-powered text generators (think ChatGPT and similar tools) that are constantly hungry for new material to improve their training.¹ In that moment, my decision was made.

As if that weren’t enough, shortly after, Matt Mullenweg, WordPress’s founder, lost his temper, highlighting how, despite WordPress being the world’s most popular open-source content management system (CMS), the increasingly centralized and authoritarian control over its development threatens the platform’s open and transparent future.

And so, the die is cast! It took a lot of time and effort, but starting today, this little blog has its own home: melabit.com/posts/en, where I can experiment with much greater freedom than before. Let’s see how it goes.

The new site is still a work in progress. The basic structure is in place, but there will definitely be adjustments and improvements in the coming months. The very first thing to do is to expand the international section, which currently contains only a handful of posts, with several broken links or links to the original posts in italian. Additionally, I would like to implement a system for notifying users of new posts.

In the meantime, please don’t hesitate to flag any issues or poorly functioning elements in the comments.

For technical details and the transition process, stay tuned for upcoming posts.

Rest assured that all old content and comments will remain available on melabit.wordpress.com. At least until Mullenweg kicks me out… 🤣🤣🤣🤣🤣

While it is true you can opt out of sharing your content—and I did—it’s also true that you must explicitly choose not to share your data, rather than the other way around, which would be more logical. And even then, there’s no guarantee that your content won’t be scraped anyway. ↩︎