AI powered voice office: Workers start muttering to their computers

Economic Observer Follow 2026-05-29 18:38

Economic Observer reporter Zheng Chenye

A keyboard product has recently become popular on Taobao, but it only has 4 keys, 1 paddle, and 1 microphone interface, no letter keys, and cannot type. The price starts at 269 yuan, and the version with DJI microphone is priced at over 400 yuan. This product is called AhaKey-X1, developed by Nanjing Jinxinwan Technology Co., Ltd. (hereinafter referred to as AhaKey), and was only launched around the Spring Festival this year.

Its purpose is simple: to facilitate users to speak to AI.

Users only need to press the voice button, speak work instructions into the microphone, and AI will convert the voice into text and send it to AI tools such as Claude, ChatGPT, DeepSeek, and cursor for execution. Whether it's writing code, modifying proposals, or organizing meeting minutes, users don't need to type, just say it out loud. AI will automatically organize colloquial expressions into structured text.

AhaKey co-founder and CTO Zhang Xinyang told Economic Observer reporters that since the product was launched, monthly sales have doubled. During the "6.18" period, the company had nearly 1000 units in stock and is currently in talks with multiple industry capital and investment institutions for financing.

A keyboard without letter keys can sell well because more and more people are using voice instead of typing to give work instructions to AI. This office style was first popular among programmers, who used voice to describe requirements to AI and AI generated code. But now, product managers, lawyers, and content creators are also starting to do so.

Zhang Xinyang told the Economic Observer reporter that there was a user who left a deep impression on him, a lawyer in his 40s. "He doesn't even use Windows computers very smoothly," but after buying AhaKey, he can communicate with AI to complete work without typing. Zhang Xinyang said that this made him and his team realize that the demand for voice office in the AI era may be much greater than they expected.

Actually, voice input is not a very new thing. As early as 1997, IBM launched the commercial Chinese speech recognition system ViaVoice, with a nominal maximum recognition rate of 95%, which was pre installed on mainstream PCs at that time. Over the next thirty years, companies such as iFlytek, Sogou, and Baidu have continued to invest in the field of voice input, extending their products from PC to mobile devices. However, voice has never become the mainstream input interaction method.

Zhang Xinyang believes that changes occur after the maturity of AI models. In the past, voice input solved the problem of converting text, not understanding language, "he said. In the past, voice input methods recorded what you said word by word, and if you made a mistake, you had to manually correct it. The output text was in oral voice, which was difficult for humans to read. However, AI models have changed the receiving end. Even if you speak intermittently with slip of the tongue, AI can still understand your meaning and output a smooth passage of text.

Or, in other words, when the recipient of voice input changes from humans to AI, the requirement for recognition accuracy is significantly reduced, and voice office can truly be achieved.

According to incomplete statistics from Economic Observer reporters, as of the end of the first quarter of 2026, the total financing amount of startups in the global voice AI direction has exceeded 7 billion US dollars.

Currently, overseas voice dictation app Wispr is undergoing a new round of financing with a target valuation of nearly $2 billion, compared to $700 million six months ago; On May 12th, Google integrated the AI dictation function Rambler into the default keyboard Gboard, covering hundreds of millions of Android phones and available for free use; On May 7th, Alibaba Qianwen launched AI voice input function on PC in China; On May 28th, iFlytek (002230. SZ) released AI glasses, which are equipped with intelligent agents that can automatically organize colloquial expressions into structured text.

Over the past two decades, voice input has been a less user-friendly auxiliary feature in input methods, but now AI models are turning it into a trendy way of working.

AI cannot feel pain

Despite the high recognition accuracy of various voice input tools and the launch of functions such as simultaneous interpretation and multilingual translation, voice input has not yet become a mainstream interaction method. Most people still choose to type when communicating, working, or interacting online, and the problem is obviously not in recognition accuracy.

Lin Huijie, General Manager of the Wearable Device Business Department of iFlytek, mentioned in an interview with the Economic Observer that there is an obvious problem with traditional voice input - after the voice transcription is completed, "you cannot directly send it because others can tell at a glance that it is voice typing, and the appearance is not good. Although it is convenient for yourself, it hurts others.

Chinese speaking speed is usually about three times faster than typing speed, and the speed advantage is clear, but "fast" only solves the efficiency of the sending end. A colloquial text with tone words, repetition, and jumping logic is a burden for readers. For example, receiving a 60 second voice message on WeChat can give people a headache, and the reason is also here - the person who speaks is happy, while the person who listens is painful.

This is a common problem faced by traditional speech input methods: even if the recognition accuracy reaches 99%, the output text is still in oral voice, without punctuation, paragraphs, and often accompanied by "um", "ah" or half a sentence of nonsense, making it difficult for people to read.

But AI cannot feel this pain - spoken words that are unbearable for humans, no comprehension barriers for AI, no matter how chaotic or fragmented a person's words are, it can extract intentions from them. The problem of voice input being "convenient for oneself but painful for others" disappeared from the moment the receiver became AI.

So, voice office quickly spread out in two types of scenarios. The first scenario is when a user speaks and gives instructions to Claude, DeepSeek, or ChatGPT, and the AI directly understands the intention and executes the task, without the need to produce a coherent text for the user to read. This is a situation that voice input has not encountered in the past few decades: when the receiver changes from human to AI, the standardization requirements for language expression are greatly reduced.

In the words of Zhang Xinyang, 'understanding intention is more important than word for word accuracy'.

Programmers were the earliest group to enter this mode on a large scale. OpenAI co-founder Andrej Karpathy publicly proposed the concept of "Vibe Coding" in February 2025- developers describe requirements in natural language, AI generates code, and developers review and modify it. Karpathy mentioned at the time that he used the voice dictation tool SuperWhisper to dictate programming instructions to the AI. By December 2025, Karpathy will have completely stopped typing and coding, relying 100% on voice input.

From the end of February to early March 2026, OpenAI's programming agent Codex and Anthropic's programming agent Claude Code will be launched in native voice mode within less than a week. Developers can speak by holding down the space bar, and AI can receive programming instructions.

AhaKey-X1 is designed for this workflow. Zhang Xinyang said that when using AI programming tools such as Claude Code, the AI will frequently request user approval for operations. Pushing the lever up automatically approves, and pulling it down confirms one by one. "Like automatic transmission, everything that needs approval is automatically approved. Three out of the four buttons correspond to speaking, confirming, and rejecting, while the fourth button is left for users to customize.

According to Zhang Xinyang, the team initially discovered a problem when using AI for office work: sitting upright in front of the computer typing can sometimes limit their thoughts. "Many ideas come out of a flash of inspiration, maybe when you're lying on the sofa in the study. So, since communicating with AI has become talking, why do we have to sit in front of the computer?

So they first created an open source project and put it on the open source community GitHub. Some people came to buy components and kits after seeing it, and later some people hoped to receive the assembled finished product directly. It's the users pushing us forward, "Zhang Xinyang said. On Xiaohongshu, there are already many users who spend 69 yuan to buy a three key keyboard, a microphone, and create similar devices by hand.

The second scenario for the rapid deployment of voice office is that even if text still needs to be produced for people to see, AI adds a layer of semantic processing after voice transcription: automatically deleting mood words, correcting grammar, streamlining logic, adjusting sentence structures, and outputting a fluent text that can be directly used. The delay caused by this process is usually only one or two seconds.

Even if there were errors in what you said earlier and corrected later, AI can help you sort them out completely and form an effective copy content. ”Lin Huijie told reporters like this. This also means that in the past, speech input required extremely high recognition accuracy to be barely usable. Nowadays, even if the accuracy of speech input is average, large models can output better results than word for word transcription with their understanding ability.

In fact, in the past two years, a group of startups focused on AI voice dictation have grown rapidly, with the highest valuation being Wispr, located in San Francisco, USA. This company was founded in 2021, initially producing brain computer interface wristbands (for silent voice input), and transitioning to developing voice dictation software in mid-2024.

Public information shows that as of early 2026, Wispr has completed approximately $81 million in financing. According to data disclosed by Wispr, users who have been using the product continuously for more than 6 months have 72% of their daily input completed through voice rather than keyboard; Since the launch of the product, the user base has grown more than 100 times year-on-year, and 70% of users who have been using it for 12 months are still active.

In September 2025, Reid Hoffman, co-founder of LinkedIn, claimed on social media that he had been "voicepilled," calling it "a completely new way of amplifying abilities.

As of May 2026, Wispr's target valuation has approached $2 billion, nearly tripling within six months. A dictation application is valued at $2 billion, and the capital market is clearly betting on the scenario of voice replacing some keyboard input.

IFlytek Input Method is also following this direction. At the end of 2025, iFlytek Input Method will add an AI key to the keyboard interface. Users can long press this key to directly issue commands to AI through voice without switching to other applications. According to the 2025 annual report of iFlytek, the user penetration rate of iFlytek's input method big model service has increased by 900%, and the input efficiency has increased by 77%.

This may indicate that the demand for voice office is penetrating from the geek community to a wider range of professionals in the workplace.

Speak Quietly

The speed advantage of voice office is clear, but office work is not just about pursuing speed. Writing a carefully worded email, modifying a logically complex code, and polishing a solution for clients require precise control rather than quick expression. One of the key issues determining how far voice office can go is whether it can cover these scenarios.

During an interview, a reporter from the Economic Observer asked Zhang Xinyang: Some people believe that typing prompt words on a keyboard is more organized, and the typing process itself helps you organize your thoughts. Can voice input replace this process? Zhang Xinyang's answer to this is, 'The value of typing will always exist.'.

He distinguished the two very clearly: the voice is on the expressing side, and the keyboard is on the organizing side. "When you want to modify something, the thought process itself is valuable to you. Voice solves the problem of quickly "pouring" ideas out, while editing and deep thinking still require a keyboard.

Zhang Xinyang also mentioned a change: two years ago, "prompt word engineer" was a popular recruitment position, and users needed to carefully design input formats to make AI give satisfactory results. But now, this position has basically disappeared, and AI can structure, break down, and schedule scattered colloquial input on its own. "From a purely performance perspective, there is no need for people to edit and type anymore.

AI's tolerance for input formats is increasing, and the method of giving instructions to AI is becoming less and less important. Under this premise, the input method with the fastest speed and lowest cognitive burden will naturally win, and there is no need to translate ideas into written language when speaking. Or, to put it another way, AI's understanding of natural language has reached its current level, and for the first time, office products with voice as the core interaction method have the conditions to be established.

But in fact, the idea of using voice to operate computers appeared earlier than AI big models.

On May 15, 2018, Hammer Technology held a press conference at the Bird's Nest in Beijing, where founder Luo Yonghao demonstrated the Nut TNT workstation on stage. TNT's full name is Touch and Talk, which focuses on voice and touch operated desktop computers. Users can complete operations such as searching, editing documents, and sending emails by speaking to the screen. This product, defined by Hammer Technology as a cross era product, was widely ridiculed after the press conference, with netizens joking, 'Quiet! You're making me use TNT!'! ”At that time, it became a "famous stem" widely spread on the Internet.

The core reason why netizens mock TNT is that the voice interaction experience demonstrated by Luo Yonghao on site is not good. Although speech recognition technology in 2018 has been able to achieve high accuracy, there is no big model to understand the intention. Every recognition error is a friction point that needs to be manually corrected by the user - the user must speak clearly and logically in order for the machine to give the correct response. A slight ambiguity can ruin the experience.

Or, in 2018, the recipient of voice interaction was a traditional software system that required precise input to operate and lacked tolerance for colloquial expression. Even if the accuracy of speech recognition itself has reached over 95%, the remaining 5% of errors will become a breakpoint in user experience without a big model to support them.

Under the technological conditions at that time, a desktop computer that relied primarily on voice commands was unable to fulfill its promises or provide the imagined experience. If TNT is equipped with a large model that can understand natural language and is released today, it will face a different situation.

The big model solves the problem of "not understanding", but the problem of "not convenient to say" still exists. In Zhang Xinyang's opinion, the first problem faced by voice office in practical promotion is noise. "In an open plan office, seven or eight people are muttering to the computer at the same time, even if everyone is lowering their volume, it's enough to give people a headache when they gather together.

Edward Kim, co-founder of Gusto, a human resources software company in the United States, recently stated in a media interview that he is promoting voice office tools within the company and that he is "almost always talking to the computer", but it is "a bit awkward" to continue doing so in the office.

Zhang Xinyang introduced that AhaKey paired with DJI microphone can achieve low voice recognition, maintaining 99% accuracy at a volume of 20 decibels, which is about equivalent to whispering in the bedroom at night. Colleagues sitting next to you can hardly hear what you are saying.

Of course, there are also other technical solutions to this problem. On May 28th, Kong Changqing, Director of the Speech Translation Line at iFlytek Research Institute, introduced in an interview with Economic Observer that iFlytek's latest AI glasses adopt a multimodal noise reduction solution combining lip movement recognition and microphone array. In high noise scenes such as exhibitions, subways, and restaurants, the recognition accuracy can be improved by 30% to 40%.

Lip movement recognition and low voice recognition are two different technological paths, but they face the same market demand: being able to use voice office in crowded and noisy environments. Especially for some extremely noisy scenes that were previously completely unusable, (lip movement recognition) has basically reached the threshold for use, "said Kong Changqing.

The second issue faced by voice office is privacy - oral content becomes sound waves, and email content, code logic, and business ideas can be heard by people around them; In addition, voice data processed in the cloud also raises security concerns.

In November 2025, a user discovered on a community forum that the AI voice dictation software Wispr Flow, while claiming to have "zero data retention," was actually storing user screenshots and uploading them to the server. The incident quickly escalated, and Wispr CEO Tanay Kothari subsequently publicly apologized and updated the privacy policy. When Google released the AI voice dictation feature Rambler in May 2026, it also emphasized that "voice recordings are not stored, and audio is only used for transcription".

The issues of noise and privacy have not been fully resolved, but this has not prevented hardware manufacturers from quickly entering the market - from recording cards and headphones to glasses and keyboards, office hardware centered around voice and AI is emerging intensively, and the categories and price bands are rapidly expanding.

For example, in August 2025, DingTalk released its first AI hardware DingTalk A1, priced at 799 yuan and 499 yuan respectively. It is equipped with 6 microphone arrays and supports transcription in over 120 languages; In January 2026, Feishu and Anke Innovation jointly released AI recording beans, weighing 10 grams and priced at 899 yuan; In addition, iFlytek and 360 have also launched similar products.

Regarding this, Lin Huijie's feeling is very direct: "Tears streaming down my face at the keyboard. I can think of what it is and say what it is, but typing it out is painful." He believes that there is always a layer of translation between ideas and text, from the thoughts in my mind to the characters typed by my fingers on the keyboard, there is both information loss and time loss in between, but AI models are changing this situation. According to its introduction, iFlytek's GlassClaw intelligent agent can automatically organize colloquial expressions into fluent text, "completing the entire process from querying information to writing proposals to sending emails with just one sentence".

Zhang Xinyang also stated that his team is exploring local agents and privacy computing capabilities. If this direction holds true, the combination of voice and AI may give rise to a new category of office hardware independent of PCs and smartphones. Of course, the keyboard will still exist, but its role will change - from a primary input tool to an editing tool.

Disclaimer: The views expressed in this article are for reference and communication only and do not constitute any advice.

Zheng Chenye

Senior journalist. Pay attention to new industries such as new energy, semiconductors, and intelligent vehicles. If you have any inquiries, please feel free to contact: zhengchenye@eeo. cn， WeChat: zcy096x.

Bookmark

Hot News

Video recommendation

In the live broadcast of the 2026 Haier Hope Primary School Children's Day Science and Technology Experience Day, Mei Feng, Chairman of the China Youth Foundation, praised Haier: cooperating with Haier for public welfare can be divided into three levels: public donation, resource linkage, and value co creation. Haier adheres to long termism and a sense of social responsibility, making public welfare projects more in-depth and effective.

CEO of Bangyao Biotechnology responds: has never been subject to regulatory interviews due to BD

Computing power leasing 'grabbing the beach' in 2026

Epaper

Click To Enter

Username login/Mobile number login Don't have an account yet?Sign up for free

AI powered voice office: Workers start muttering to their computers

Hot News

Video recommendation

Epaper

Username login/Mobile number login

Don't have an account yet?Sign up for free