Why Polish AI doesn’t have to play catch-up

While global benchmarks highlight gaps in creative tasks, models like Bielik demonstrate that local, compact AI can outperform the giants in domain-specific applications, offering both security and control

Sebastian Kondracki, twórca Bielik
“Bielik is fantastic for working with Polish-language text. In this regard, it can be compared to a large language model, such as the aforementioned ChatGPT,” says Sebastian Kondracki, the creator of Bielik. Photo: Bielik Press Materials
Loading the Elevenlabs Text to Speech AudioNative Player...

A critical report and questions about the real quality of Polish AI models have ignited a debate in the tech sector. Sebastian Kondracki, head of the project behind Bielik, defends the work and points out flaws in the methodology used for comparisons. Where does the truth lie, and are we really facing a technological gap?

Is Bielik in Polish significantly worse than ChatGPT and other LLMs?

It’s a bit like comparing two means of transport – for example, an F‑16 and a van. And now you could ask: which one accelerates better? If we want to compare ChatGPT, which has internet access, a reasoning system, and is built at massive scale, with a model that is primarily an engine… well, by analogy, it’s obvious that the F‑16 will always accelerate faster. No matter how hard we try or how much we praise Polish technical ingenuity, the same applies to Bielik.

So why do we need Bielik or PLLuM?

Bielik is fantastic for working with text in Polish. Here, it can be compared to a large language model, such as the aforementioned ChatGPT. Bielik excels at analyzing Polish text and translating it into other languages. But it has one fundamental feature: Bielik is a compact model designed for business use. It is meant to handle sensitive data in Polish. In that respect, it is at least on par with large language models, despite having fewer parameters.

Who's who

Sebastian Kondracki

Programmer, co‑creator, and head of the Bielik AI project. Founder of the Spichlerz Foundation. Head of innovation at Devinity and CEO of SerwisPrawa.pl. In 2026, he was appointed by Donald Tusk to the Council of the Future. He has been a long‑standing member of industry organizations and advisory groups within public administration.

XYZ

Polish AI under criticism

The company Oxido, which just published a report on the topic, claims otherwise.

There are many benchmarks. Large European organizations such as EuroEval have them. The National Information Processing Institute (OPI PIB) does as well. And we do too – at SpeakLeash. For example, when it comes to generating highly literary forms, Bielik does perform worse than the large language models. But in specialized task research, we often outperform the giants. There’s nothing revolutionary in that.

Good to know

Oxido criticizes Polish GenAI models

In a report published on March 17 in Rzeczpospolita daily, Oxido examined 12 language models, including ChatGPT 3.1 Pro, Llama 4, GPT‑5.2, as well as Polish models Bielik 3.0 and PLLuM 8x7B. According to the findings, the Polish models performed significantly worse in Polish than their Western counterparts. Bielik scored 6.38 points, while PLLuM scored 5.95. For comparison, Gemini achieved 8.13 points and GPT‑5.2 7.66.

The models were tested on 20 tasks across 10 categories, including writing an email, giving advice, and quoting Pan Tadeusz.

XYZ

Explainer

Pan Tadeusz

Written in 1834 by Adam Mickiewicz while in Parisian exile, Pan Tadeusz is Poland's national epic – a long poem set in the Lithuanian-Polish countryside of 1811–12, following a young nobleman named Tadeusz who returns home to find love, a family feud, and the distant thunder of Napoleon's armies promising Polish liberation.

But the plot is almost secondary: Mickiewicz, writing from a homeland partitioned out of existence, was really constructing an act of memory – lovingly reconstructing the forests, feasts, hunts, and quarrels of a vanished world before it could be forgotten. The result became something far more than literature: across two centuries of occupation and upheaval, Poles memorized it, carried it to war, and recited its opening lines (“Lithuania! My fatherland!”) as a kind of prayer – a reminder that a nation can survive without a state, as long as it remembers who it is.

Still, the report highlights large differences in handling Polish

The report contains a host of methodological errors. First, it compares different types of models on the same tasks. That’s like testing a car’s skid performance and comparing a vehicle on summer tires with one on winter tires, and then saying, “This car is worse.” No – it’s not worse; it’s simply on summer tires. In reality, the study did not test the quality of the model or Polish artificial intelligence – it tested the quality of the configuration.

The strengths and weaknesses of Bielik

What does that mean?

First, foreign models had internet access. So if we ask for the opening lines of Pan Tadeusz, foreign models aren’t even relying on “internal” knowledge – they simply pull it from the web and provide what we’re asking for. Models like PLLuM or Bielik, however, have no internet access and aren’t databases, so they don’t have Pan Tadeusz encoded. Does that mean Bielik is worse in Polish because it doesn’t “know” Pan Tadeusz? No. Give it internet access like the other models and then run the test.

Can you do that?

Of course. In many business implementations, Bielik works this way. I can assure you it will deliver a highly accurate answer. There are, of course, model tests involving book citations – but these are more about the data the models were trained on, and whether unlicensed material was ever included.

Bielik and PLLuM vs. western AI models

What, in your view, is the key difference between Bielik or PLLuM and models like ChatGPT or Gemini?

Both Bielik and PLLuM are more B2B than B2C. They work best when installed on your own server and configured with parameters tailored to the task. If a user needs something creative, they set the parameters differently than if they require text analysis. In my view, this is the fundamental problem with the cited study. Small, compact models require proper configuration. Large systems, on the other hand, have parameters adjusted by other models or based on the user’s interaction history.

Additionally, closed models can be fine-tuned and further trained. We do this ourselves – for example, we recently added the Silesian language. We are continuously expanding Bielik’s capabilities. For us, the key benchmarks are those that demonstrate performance in specialized domains, such as legal or medical applications. The study you mention, however, had a clear agenda: to declare that Polish AI is “bad.”

Are there areas where you can confidently say that Bielik outperforms Western GenAI models, with objective evidence?

We have to remember that the category of “Western models” is enormous. It includes, for instance, open-weight models like Meta LLAMA at 400B parameters – roughly 40 times larger than Bielik. It also includes closed models, such as OpenAI GPT‑5.1, where we don’t know the exact number of parameters. In practice, these may be several orders of magnitude larger.

So… yes. We are better than Western models in a similar weight class. And not only in Polish. For example, OpenAI released open-weight models at 120B and 20B, and in the Polish language and cultural competency tests conducted by OPI PIB, we perform significantly better. Another example is specialized tasks, such as legal applications. Bielik, fine-tuned by the company Gaius-Lex, outperforms GPT‑5 in Polish legal tasks.

Development costs and the Polish context

There’s another angle to consider. We are comparing Bielik or PLLuM to models whose training cost enormous sums.

The first GPT‑3.5 Turbo model had 176 billion parameters – already 16 times larger than Bielik. So there’s no chance we’ll always outperform a closed model, even in Polish. In fact, sometimes it doesn’t even make sense to run Bielik from a business perspective – if, for example, the task is simply summarizing ten documents. It’s easier to use tools from OpenAI or Anthropic than to purchase an AI server, deploy Bielik, and train or fine-tune it for specific needs.

But if we require more serious applications, even if Bielik can’t quote Pan Tadeusz, it will outperform American models in those use cases.

Expert's perspective

Polish models: A success – and an opportunity

The key point of reference in today’s debate on the development of artificial intelligence models is the real cost of building them. Training the largest, leading models now costs in the region of USD 500m (approx. PLN 2.0bn / EUR 460m), while constructing the broader ecosystem around them can reach an additional USD 2bn (approx. PLN 8.0bn / EUR 1.8bn).

Against this backdrop, initiatives developed with little to no public funding – relying instead on community effort, institutional support, and access to computing power provided by technology partners – are particularly noteworthy. The fact that such models are able to achieve relatively strong performance should be regarded as a success. Even if they fall short of the best commercial solutions, that gap becomes understandable in light of the scale of investment involved.

The key issue, however, is sovereignty – both technological and economic. In sensitive areas such as security and data processing, it is essential to have solutions over which we retain full control, and which can operate locally, without reliance on external cloud providers. There is also an economic dimension. As AI is increasingly used to boost productivity, much of the value generated by these technologies currently flows to global corporations – mainly outside Europe.

For this reason, developing domestic capabilities in AI is strategically justified. This does not mean, however, that Poland should compete alone in building the largest general-purpose models. A European approach appears more rational: developing joint, competitive solutions at the EU level, while simultaneously building local capabilities and smaller models as a foundation for future growth.

Examples show that bottom-up initiatives can achieve better results than those implemented with state support. The key, therefore, is not so much increasing spending as organizing it more effectively.

Business and Bielik

Do you have evidence in the form of concrete business examples?

Plenty. For example, during Poland’s presidency of the European Union, the “Proste Pismo” (“Plain writing”) application was built on Bielik and streamlined communication across the board. It is used by the Ministry of Finance, the Social Insurance Institution (ZUS), Pekao, Credit Agricole, as well as numerous technology and energy companies.

Do you and your team feel affected by media descriptions such as “Polish bots are dimwits”?

If our model were truly “dimwitted,” would we have been presented at Tuesday’s NVIDIA conference as one of the leading solutions contributing to European technological sovereignty? And would NVIDIA have published a so-called training report with our team as the training partner?

But addressing the criticism directly – yesterday, during the presentation of the Bielik AI project, also at the NVIDIA conference, we showed a quote from a YouTube influencer who said: “There’s a Polish ChatGPT called Bielik – and it’s stupid.” That slogan gave us such visibility that people started testing us. That created enormous value, because we received feedback and data to improve the model. The second version was much better and made its way into business use. Again, we received extensive feedback and improved it further. Now we have a third version.

Such statements actually generate interest in Bielik. There is also another mechanism at play, known as the “ragged technological frontier.”

Different uses of Artificial Intelligence

What do you mean?

Let me explain with an example. If we go back 30 years, using a word processor offered tremendous possibilities. It was quickly adopted because it was simple, and a specific task could be performed very easily. In AI, however, the boundary is highly uneven.

Let’s say I want to generate an essay in the style of a particular writer. Bielik will do this well when drawing on more widely known authors. ChatGPT will do it well based on the vast majority of writers worldwide. But if we give a simple task - “Now write me the lyrics of a disco polo song” – it suddenly turns out that all models, including ChatGPT and Bielik, fall short. That’s because language models struggle with rhyme. They are trained mainly on other types of text and handle syllables differently, so the result is a very poor disco polo song.

Explainer

Disco polo

Disco polo is a genre of upbeat, synth-driven Polish dance music born in early 1990s and spreading not through radio or record labels but via cassette tapes sold at roadside markets and played at weddings and village celebrations across the Polish countryside. Stars like Shazza, Boys, and Bayer Full became household names entirely outside the mainstream media ecosystem, which is partly why urban elites treated the genre with such contempt. The genre survived its purgatory years on the wedding circuit before staging a genuine mainstream comeback in the 2010s, with artists like Weekend and Sławomir dragging it back onto radio and streaming platforms — initially enjoyed ironically, then just enjoyed — until disco polo settled into its current status as beloved national kitsch.

A paradox?

It can handle an essay in the style of Hemingway, yet fails at a simple, sing-song rhyme that a fifth- or sixth-grade student could produce. This boundary is highly uneven. One might assume that a task of similar – or even lower – difficulty should be easy for AI. But it isn’t.

I mention this because criticism allows us to test what Bielik can and cannot do. Defining that boundary is extremely difficult.

PLLuM responds to the controversy

Expert's perspective

PLLuM and the realities of the AI landscape

The proposed test is, in reality, a comparison of the incomparable – above all, one that lacks any scientific basis and ignores key benchmarking factors such as model size and architecture. It also fails to account for critical aspects such as domain-specific applications or digital sovereignty. Drawing conclusions from such a comparison therefore makes little sense; it misleads, “hallucinates,” or even distorts the true state of affairs.

This is merely a comparison – because, as I mentioned, there is no objective test or study here. It includes selected, non-comparable models chosen without any clearly defined criteria, which already undermines its validity. Small open models, large commercial models, and even models with internet access are all placed in the same basket – some of them paid, commercial solutions. The author claims this approach makes the analysis more objective (sic), because it supposedly involves simply selecting “a model from a website.”

At the same time, however, it juxtaposes solutions operating under business licenses and equipped with source citation capabilities. It is unreasonable to expect a small model – particularly in terms of parameter count – to quote text with the same effectiveness. It is self-evident that a model capable of directly accessing online sources operates differently. In sum, such a “test” has no scientific character; it cannot be considered a study that objectively evaluates anything. It lacks basic elements such as reproducibility, the possibility of independent verification, or peer review – including, for instance, the disclosure of the exact prompts used.

Most importantly, the comparison entirely overlooks the actual needs that Polish models are designed to address, as well as the specific nature of these projects. In the case of PLLuM, the primary objective is local deployment on local infrastructure, including in public administration and the government’s mObywatel platform. These are specialist, domain-specific models, designed to meet clearly defined needs and to be further fine-tuned for particular use cases. Comparing a “chat model” with such a system says little about its real-world performance in the context for which it was created.

A glance at the PLLuM models available on Hugging Face is enough: they differ in parameters, training stage, intended use, licensing, and degree of fine-tuning. Alongside base models, there are instruction-tuned models and those further trained on preference datasets (so-called aligned models). They serve different purposes and should not be lumped together.

This is precisely the context in which sovereign models are developed: not to compete with the largest – often commercial – general-purpose systems in every possible category, but to perform reliably in specific domain deployments, execute defined tasks, ensure security, and operate within a clearly specified context of use.

Where Bielik is headed

Will the next version of Bielik be able to quote Pan Tadeusz?

We are continuously developing Bielik. Our goal is for it to become a European model. Since version 3.0, we have already incorporated 30 natural languages. We are now planning to release Bielik 3.1, where the architecture will remain the same, but the number of supported languages will exceed 40 – including Chinese and Arabic. We also want to add a model capable of analyzing images. Analyzing, not generating.

As for Pan Tadeusz, Bielik can already provide excellent analysis of individual books. Perhaps the next version will have a context window large enough to analyze an entire book. However, quoting is not the role of the model. In such cases, it is far more efficient to connect Bielik to a national library – then it can quote any book.

A more technical direction?

Yes. We would like to build a full business suite of models – from a reasoning model to a VLM (vision-language model for image analysis) – that would allow companies to build their own AI systems end-to-end, powered by Bielik. But for somewhat different purposes than general-access GenAI models.

We have also gained valuable experience through our collaboration with NVIDIA. We have managed to significantly reduce the computational requirements for running Bielik. Already, a laptop with a GPU is sufficient, rather than a full server – while maintaining only a slight loss in quality. We presented this in the United States on Tuesday, March 17. NVIDIA is now promoting our solution.

When can we expect these tools to be widely released?

As for the 3.1 language model, we are already testing it. I believe the launch is a matter of weeks rather than months. As for the remaining models, we are only just beginning training. Typically, this takes up to six months. I hope that by mid-year, these models will be available as a complete package. In the meantime, we may release smaller solutions earlier.

We recently announced a partnership with Google on a series of roadshow training sessions for developers. Why did you decide to collaborate with them? I have already heard quite a few negative voices suggesting that this Polish initiative could be indirectly “taken over” by an American tech giant – if only to encourage developers to adopt its solutions.

I see technological sovereignty somewhat differently.

Will Polish AI be taken over?

What do you mean?

I do not see technological sovereignty as “disconnecting from everyone and going it alone.” Polish business – whether we like it or not – is deeply embedded in Google, Azure, and other platforms. For me, technological sovereignty means operating on your own terms. If I need to process sensitive data, I can run Bielik on my own infrastructure. If I want to generate amusing images, it is cheaper to use Microsoft or Google.

We are also technology-agnostic. We encourage running Bielik on your own infrastructure, in the cloud, or in Polish AI factories.

Aren’t you concerned that Google will use your training programs and relationships to promote its own tools?

These “cloud squads,” as we call them, will be launched under different names with other partners that provide GPU capacity locally – including Polish AI factories. If we are to begin any serious discussion about sovereignty, we must first remain technologically agnostic. If we do not collaborate with the platforms businesses already use, we will not be able to support those businesses.

How not to lose control

American big tech companies tend to embed themselves in local initiatives that gain momentum on the back of economic patriotism – and then, in various ways, take them over. Is Bielik at risk of that?

No – and the Bielik Summit demonstrated this. We had numerous examples of successful deployments, from Polish organizations, AI companies, and smaller tech businesses alike. But there were also AMD, NVIDIA, HPE, Dell, Beyond.pl, Google, Microsoft, and many other Polish and international firms. Our goal is to reach a level where Bielik can be deployed anywhere.

If I were to worry about anything, it would be big tech firms poaching the people who train these models. That’s why we try to engage in ambitious projects that meet the expectations of our talent. After all, Poles do not move to big tech solely because of higher salaries, but also because of the opportunity to work on ambitious projects.

This is also where I take issue with unreflective criticism. It undermines motivation. Highly skilled Polish specialists want to work on ambitious projects pro publico bono, and then someone comes along and says, “No, this is poorly done – maybe you should do something else.” That is very damaging. We need to pursue ambitious projects and, in that way, defend ourselves against big tech – so that our specialists do not leave.

Are you satisfied with Bielik?

This month alone, the models were downloaded more than 400,000 times. Mistral 14B – a much larger model – has fewer downloads than we do. All Bielik models combined have been downloaded two million times, without any paid promotion.

Is a model weak if people are downloading it – despite the fact that doing so requires a certain level of expertise? Does it lack a future if it is being showcased by NVIDIA during its CEO’s keynote?

I agree that there is still a great deal to be done. And we will continue doing it.

How to develop local initiatives

Expert's perspective

Emotion vs. the realities of Artificial Intelligence

Let’s cool the emotions. Comparing models like Bielik to market giants with hundreds of billions of parameters is a mistake. It is like comparing a handy pistol to a heavy cannon – both can fire, but they serve entirely different purposes.

The debate also overlooks privacy. Polish banks use Bielik precisely because it guarantees security. An open-weight model allows full control over information, which never leaves the institution’s servers.

The same applies to PLLuM. From the outset, it was not meant to be a general-purpose model designed to win tests of literary knowledge, but rather a core engine for interactive features in the mObywatel application and for supporting future deployments in public administration.

The ability to generate a long essay or win a literature quiz is not a measure of business usefulness. In a corporate environment, what matters is precision in specific tasks and control over information. Smaller, specialized engines are far better suited to this than cloud-based giants.

Instead of complaining about rankings, we should recognize what the Bielik team has achieved. Without large budgets, relying on free computing power from Cyfronet, they have created an efficient tool.

A meeting last year in Kluczbork with Wojciech Zaremba from OpenAI reinforced my view of what is at stake. I asked him about the future of such open initiatives. His answer was clear: building your own models creates capabilities in the domestic market. If you know how to build an engine from scratch, you can easily operate, modify, and deploy any other.

The best proof of recognition for Polish technological expertise is not theoretical tests, but industry acknowledgment. At this very moment, Paweł Kiszczak is on stage in San Jose at an NVIDIA conference, proudly presenting Bielik to the world.

Let’s stop apologizing for doing things differently. Instead, I challenge decision-makers and businesses: start boldly deploying our secure models where they have an advantage, and begin building technological independence on the foundations we already have.

Let’s not measure others by our own yardstick. They are simply built on a different scale.

Key Takeaways

  1. Sebastian Kondracki emphasizes that comparing models such as Bielik with global systems like ChatGPT is often misleading, as they serve different functions and operate under different conditions. Large models benefit from scale, internet access, and advanced reasoning capabilities, which translate into stronger performance in general or creative tasks. Bielik, by contrast, is a compact model designed primarily for business applications – especially for working with the Polish language and sensitive data. In these areas, according to Kondracki, it achieves results comparable to larger models, despite having far fewer parameters.
  2. Critical comparison reports may contain methodological flaws that affect their conclusions. The key issue is the juxtaposition of models with different configurations – for example, those with internet access versus those relying solely on embedded knowledge. As a result, such tests measure not the actual capabilities of the models, but rather how they are used. The interviewee also notes that in specialized tasks – such as legal analysis or Polish language and cultural competence – Bielik can outperform larger models, as evidenced by selected benchmarks and real-world deployments.
  3. Bielik is being developed as a tool for building AI systems in business environments, where control over data and the ability to tailor the model to specific needs are critical. The author points out that in many cases, using large external models may be more cost-effective. However, for more complex deployments, local solutions such as Bielik offer greater flexibility and security. At the same time, the model continues to evolve, expanding into new languages, adding image analysis capabilities, and optimizing hardware requirements – all aimed at increasing accessibility and broadening its range of applications.