13. LLMs’ data abuse: methodology of solution.

Reasons of LLMs’ data processing failures.

The reports on AI’s employment for analytical purposes are full of complaints about its data collection lapses and processing abuse. In one series of cases, OpenAI’s ChatGPT hallucinated the court precedents for the lawyers; in another series, OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini used the data from xAI’s Grokipedia as reliable and curated information despite its alleged corruption.

The random, chaotic, and irresponsible choice of data – facts, opinions, interpretations – by the LLM models and – more dangerously – their turn to self-generating the false data that they need to support their conclusions becomes proverbial. It arises as the principal obstacle for widespread employment of LLMs for judicial, marketing and management, financial and HR, academic research and education, and other similar practices. LLMs’ proprietors warn their users about the models’ erring nature, requiring the users to verify the models’ output. The users of LLMs spent more time and resources to cross-check their output than if they (the users) handled the task by “ancient” methods by human reasoning without AI. The situation is aggravated by LLMs’ mistakes in producing the templated documents according to the layout that the practice requires – like Microsoft Office’s standards of Word, PowerPoint, and Excel, or Adobe Acrobat’s PDF. The crisis of LLMs’ reliability and autonomous sustainability is obvious.

Not some limitation of LLMs’ reasoning capabilities, but the human prejudice toward them as maverick co-operators restricts the AI revolution. The challenge of data collection, sorting out, and processing to extract the actionable information is not something new that suddenly rushed on us in the AI epoch. The challenge is as old as human thinking; it exists always and is omnipresent like bad weather or a grievous mood. However, not only does the human brain have the special arrangement to take on this challenge (I have discussed it a lot in my posts on the brain’s symmetry), but the human data practice has well-established solutions to cope with it. The solution is a multi-tier, cross-checking methodology. We are implementing one for the strategic practice of business and political organisations and publishing its output. I will discuss some neuroscience and social science sources in Strategy by AI: Professional Methodology in the next few posts since the approach has wider AI usability than the narrow sphere of AI-governed strategic practice.

Shaping the methodology for AI data collection and processing.

In my previous posts, I started to discuss the problem of LLMs’ proverbial mishandling of data collection and abuse of processing. Now I go to counterpose this challenge to data collection and processing procedures that science developed in the “pre-AI” intelligence epoch. All social sciences – studying history, economy, and society – face the data scarcity, distortion, gaps, errors, cheating, etc. – and produce usable reports and guides nevertheless. How do they do it, and how might their experience be used to make LLMs trustable partners?

The pivot of the social sciences’ handling of the uneven data is the methodology of their processing. The methodology resembles the pattern of data processing by the individual human brain. The first stage is to register the data supplied by the human sensors – eyes, ears, skin, tongue, etc. – and file the data description. Not only the data themselves must be registered but also the sources, means, and circumstances of their extraction. These data properties serve to reveal what is true in them and what is false and inform how the data must be handled.

The second stage is analysis of the data in detail. Every piece of information contains the core fact, which is true and reliable, and a wrapping of opinions and distortions that must be peeled. The second stage of the human brain’s analysis of the data is the peeling and extraction of the core fact. It is processed by the brain’s left hemisphere. In the social sciences it is carried out by the specialized disciplines of the data and source studies – statistics (much loved by LLMS’ trainers) is one of them. Only after the core facts are peeled and prepared properly the stage of their analysis follows. If the analysis skips this stage, it bears raw data’s abuse of the facts and the conclusions are rarely justified. It is right the situation that so often happens with LLMs’ reasoning now.

The third stage of the reasoning is the assessment of the core facts against the broader context that the brain grasps and deeper concepts that it has developed. It is carried out by the right hemisphere. Thus, touching blindly a hot round object in the context of the cafeteria with its caffe-serving concept brings reasoning of a mug of coffee. The third stage is the methodology of the data analysis and conclusion-making. It has crucial importance since it concludes reasoning and launches the commands of execution: by the body in case of the individual human being or executive units in case of society. The robotic elements perform this function in case of AI systems.

We have successfully overcome the third reasoning stage in our Strategy by AI: Professional Methodology’s test employments and moved to the executive stage for AI assistant to govern the strategy execution autonomously or share responsibility with a human executive according to the tier arrangement. I will present how it happens later. But in my next post I continue to discuss the data processing methodology due to its fundamental importance for LLMs’ employment.

AI data processing: parallel with the social sciences.

In the social sciences, the stage of the data processing follows the peeling of the core facts and registration of the conditions of their extraction. It is the most controversial, competitive, and criticised stage. At the same time, it is indispensable.

The third stage bears the responsibility of conclusions, and the conclusions are the results for which all processes of the data collection and processing are created. In this stage, the facts are processed by applying the analytical theories like Marxism, Weberianism, Liberalism, Keynesianism, and many other general and particular “-isms”. Each of them has its inherent methodology of data analysis.

Among them, Max Weber’s approach is the most effective for application with AI reasoning. Weber taught to create the ideal model of a phenomenon by its principal proven facts. The ideal model, or yardstick, establishes the phenomenon’s nature, structure, functioning, and direction of development. After the ideal model has been built, the researcher uses it to assess the reality in all its complexity. Employing the ideal model, the researcher cannot be fooled by the data distortions, gaps, or forgeries because the researcher knows the principal properties of the phenomenon that constitute its nature. This methodology is extremely strong. Max Weber created his pivotal theories like Charismatic Dominance, Rational Organization, Social Action, Entity of State, etc.

The methodology of the data analysis and interpretation is the focus of the social sciences. It provides them with the capability to make rightful judgements on uneven and distorted data. Max Weber’s approach to the data analysis might be highly effective for AI data processing. LLMs are trained with it. However, the routine of their employment prevents calling it up. We call up LLMs’ foundational capabilities of data processing by creating ideal models and yardsticks for the strategic practice, and the results are stunning. So, the problem of LLMs with the data sourcing and interpretation is not some intrinsic corruption of Artificial Intelligence (AI) but its lack of methodology to process the assembled data efficiently to make the right conclusion. It is a strange shortcoming because LLMs are pre-trained with the social science methodology of the data processing. They simply know how to handle the data in the right way. It means that the wrong output of LLMs is caused by the inability of the human operator to guide their reasoning process. In case the human operator calls up LLM’s data processing methodology, its output is correct and valuable. We do that with the multiple various data that the strategic practice of the business and political organisations provides to achieve first-class analytical and decision-making outcomes.