Choosing an LLM

AI users face a difficult choice in what model they use. Cryptotalks has hundreds of models available, so it can be overwhelming to choose the right one. There isn't a clear winner, so the choice is determined by your specific needs. I'm going to try to break down some of the less obvious considerations in this article.

The Current State of AI Models

Both open source and proprietary models have made significant advances in recent years. While proprietary models like GPT-4 and Claude set early benchmarks, open source alternatives like Llama, Mistral, and DeepSeek have rapidly closed the gap in capabilities. To most users, the choice of what model to use is determined by the performance.

Performance

Model benchmarks have seen dramatic improvements over the past year, with many models now achieving near-perfect scores on traditional benchmarks. However, these benchmarks may not tell the whole story. Many standard tests like MMLU, BBH, and GSM8K are becoming saturated, with top models scoring above 90%. This saturation makes it increasingly difficult to meaningfully differentiate between models.

The latest generation of models like GPT-4o, Claude 3.5 Sonnet, and DeepSeek v3 each claim superiority in different areas. However, their real-world performance is remarkably similar, likely because they're trained on largely overlapping datasets, and each other's outputs.

More importantly, benchmark tasks often don't reflect typical usage patterns. While a model might excel at complex mathematical proofs or advanced coding challenges, most users primarily need help with simpler tasks like writing, basic programming, or general analysis. In these common scenarios, the differences between top models become negligible.

The models also tend to fail in similar ways. They all struggle with: - Complex multi-step reasoning - Maintaining consistency in long conversations - Understanding implicit context - Admitting uncertainty when appropriate

This pattern of shared strengths and weaknesses suggests that current performance differences between leading models are minimal for most practical applications.

I feel that the best way to find the right model is to try a few out and see which one works best for your specific needs. Prompting is going to be more important than the model itself, for now.

The Training Data Dilemma

A critical issue facing both types of models is training data transparency. Proprietary models such as GPT-4o and Claude 3.5 Sonnet are not transparent about their training data. It is clear that companies like OpenAI and Anthropic are training models on copyrighted materials. What isn't clear is the legality of this practice. Users hold the power of their wallet, and the question is if people really care if models are trained on copyrighted materials. Personally, I don't, and I don't think courts will consider it illegal. I believe training on copyrighted materials falls under the fair use doctrine. But that is yet to be officially determined. AI is going to be used for more and more tasks, and the ability to train on copyrighted materials is going to be important to prevent hamstringing the technology. What should be illegal is using model outputs that themselves copyright material, which is already illegal. My only concern in this area is that it may distrupt the economic incentives for generating human art and other copyrighted material.

More concerningly, some models trained in certain regions show clear political biases in their responses. For instance, models trained in China consistently deny historical events like the Tiananmen Square protests or provide heavily skewed perspectives on geopolitical issues. This demonstrates how training data and fine-tuning can be used to embed specific viewpoints or censorship into AI models, regardless of their open source status.

Many models show signs of being trained on outputs from other AI systems. For example, DeepSeek Coder v3, despite being open source, occasionally identifies itself as ChatGPT and exhibits similar content filtering patterns. This raises questions about the authenticity and originality of model responses.

Content Filtering and Censorship

Proprietary models typically implement strict content filtering policies in the name of safety. While this likely helps prevent misuse, it can also hinder legitimate research and discussion. Models like GPT-4o and Claude often get confused about the true context of a conversation and will refuse to complete a reasonable request.

Open source models generally offer more flexibility. Models such as Dolphin are trained specifically to be less censored. They will still refuse dangerous requests, but they are more likely to complete a request that is well formed.

Cost and Accessibility

Today, most LLMs are cheap for personal use, so cost isn't a big factor. There are even plenty of free options available.

The financial model for AI access varies significantly:

For heavy users, this difference can be substantial. A typical conversation might use 1,000-2,000 tokens, meaning costs can quickly accumulate with proprietary models.

Also, new models using inference time compute are becoming increasingly available (currently only GPT-o1). These models are more expensive because they use more compute to generate each token. Do you need an extra 10% in performance for 10x the cost? You be the judge.

Performance and Reliability

While proprietary models once held a clear lead in capabilities, the gap has narrowed significantly. Recent open source models have demonstrated impressive performance:

Privacy Considerations

The privacy implications of AI model choice are significant:

At CryptoTalks, we don't store any chat history data. However, we have to send your request to the model provider, so we can't guarantee that your data is private. Different providers have different policies on data retention and deletion.