Post

TinyChat15M: A Small Conversational Language Model

The development of large language models (LLMs) like GPT-4 and Google’s Gemini marks a significant breakthrough in artificial intelligence. These models, with hundreds of billions of parameters—such as GPT-3 with 175 billion and Google’s PaLM 2 with 540 billion—are capable of handling a wide range of tasks, from creative writing to code generation. However, these advancements come at a considerable cost. For example, training GPT-3 required approximately 3640 petaflop/s-days, amounting to millions of dollars in expenses and demanding over 700 GB of VRAM to run efficiently.

img Meta’s AI Research SuperCluster (Photo: Meta Platforms Inc.)

The high resource demands and environmental impact of large models have spurred the development of smaller, more efficient language models, known as small language models (SLMs). These SLMs are designed to perform specific tasks while using only a fraction of the resources required by their larger counterparts. For instance, training GPT-3 consumed 1287 MWh of electricity, resulting in 502 metric tons of carbon emissions—equivalent to the annual emissions of 112 gasoline-powered cars. In environments with limited computational resources, such as mobile devices and edge computing platforms, SLMs provide a practical and efficient alternative.

This blog introduces TinyChat15M, a 15-million parameter conversational language model built on the Meta Llama 2 architecture. TinyChat15M is designed to operate on devices with as little as 60 MB of free memory and has been successfully run on a Sipeed LicheeRV Nano W, a mini-sized RISC-V development board with just 256 MB of DDR3 memory. Inspired by Dr. Andrej Karpathy’s llama2.c project, TinyChat15M demonstrates how small conversational language models can be both effective and resource-efficient, making advanced AI capabilities more accessible and sustainable.

TinyChat: A Synthetic Dataset for Short Chat Conversations

TinyChat15M was trained on the TinyChat dataset, a synthetic collection of 1,000,000 short chat conversations created using BASIC English—a backronym for British American Scientific International and Commercial English, designed for clear and simplified communication. The dataset was generated with a specialized version of GPT-4o, referred to as GPT-4o mini, which focuses on constructing dialogues primarily using BASIC English words and grammar. However, to maintain the coherence and fluidity of the conversations, some non-BASIC English words were also included. The development of this dataset was inspired by the TinyStories dataset, which explores how small language models can produce coherent English text. Following methodologies outlined by Eldan et al. in their work, “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, the TinyChat dataset carefully balances simplicity with linguistic coherence through the selective use of vocabulary and structure.

The TinyChat dataset hosted on Hugging Face

BASIC English is a simplified version of standard English, designed by linguist and philosopher Charles Kay Ogden. It features a greatly reduced vocabulary and grammar, intended as an international auxiliary language and a tool for teaching English as a second language. Ogden introduced it in his 1930 book, Basic English: A General Introduction with Rules and Grammar. The core of BASIC English consists of 850 words, categorized into operations (100 words), things (400 words), picturable things (200 words), general qualities (100 words), and opposite qualities (50 words). These word roots are expanded using a defined set of affixes and forms to cover a broader range of expression.

To generate the TinyChat dataset, these 850 core words were divided into nouns, verbs, and adjectives, following a methodology similar to that used by Eldan et al. in their TinyStories paper. In their approach, they created a vocabulary of about 1500 basic words, mimicking the vocabulary of a typical 3-4-year-old child, separated into nouns, verbs, and adjectives. For each story generation, three words (one verb, one noun, and one adjective) are randomly selected, prompting the model to create a story that combines these words, thereby increasing the diversity of the dataset.

Using this approach, the nouns, verbs, and adjectives from BASIC English were combined, resulting in over 8 million possible word combinations. Due to the cost of generating the synthetic dataset, only a million of these combinations were randomly sampled and used as prompts for the GPT-4o mini model. The prompt template for this process is as follows:

System Prompt

You are a helpful, respectful, and honest assistant. Always aim to provide the most helpful and safe answers. Your responses should not contain any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Ensure that your answers are socially unbiased and positive. If a question does not make sense or is not factually coherent, explain why rather than providing an incorrect answer. If you do not know the answer, do not share false information. Use only BASIC (British Academic Scientific International Commercial) English words, along with words that a typical three to four-year-old would understand. Follow these specific grammar rules: add -es or -ies to form plural nouns, -ing or -ed to turn verbs into adjectives, -ing or -er to turn verbs into nouns, -ly to turn adjectives into adverbs, and -er or -est to compare amounts. Use un- to create opposites of adjectives. Form questions by inverting the word order with ‘do’. Use normal English changes for verbs and pronouns, create compound words from two nouns (e.g., ‘football’) or a noun and a direction (e.g., ‘sundown’), and write measures, numbers, money, days, months, years, and clock times in English words with no symbols or special characters, except for commas and periods. Do not use any punctuation marks except for commas and periods. Spell out all numbers as simple English words. Use industry-specific or scientific terms as needed, such as ‘plural’, ‘conjugation’, ‘noun’, ‘adjective’, ‘adverb’, ‘qualifier’, ‘operator’, ‘pronoun’, and ‘directive’, which are necessary in language teaching but not part of BASIC English.

User Prompt

Generate a factually and logically coherent small talk conversation between a User and an Assistant, where they exchange 6-10 sentences in total. Each sentence spoken by the User and the Assistant should be between 10-20 words long. The conversation should be formatted as a single paragraph without line breaks or other textual formatting. It should begin with a {starting}, convey a general feeling of {feeling}, and conclude with a {ending} ending. The conversation must include the noun ‘{noun}’, the verb ‘{verb}’, and the adjective ‘{adjective}’. Use the following text format: [INST] generate User’s sentence here [/INST] generate Assistant’s sentence here.

In addition to the noun, verb, and adjective combinations, each prompt was carefully crafted to include a specific starting tone, convey a particular general feeling, and conclude with a defined ending. The starting tones are categorized as greetings, questions, or suggestions. The general feelings expressed include happiness, surprise, badness, fearfulness, anger, disgust, and sadness. The endings are designed to be either conclusive, reflective, or open. These combinations of starting tones, feelings, endings, verbs, nouns, and adjectives introduce both tonal and content diversity into the generated textual data. To ensure compatibility with Meta’s Llama 2 chat template, the conversations are encapsulated within the Llama 2 instruction tags “[INST]” and “[/INST]”. The dataset is open source and hosted on Hugging Face.

Training the TinyChat15M Small Language Model

TinyChat15M was built upon Dr. Andrej Karpathy’s llama2.c project but was modified to suit the development and training of a conversational small language model. Although the main C file, run.c, includes functionality for chatting with the language model, both the tokenization process and chat functionality required modifications to better accommodate a conversational language model. In the TinyChat dataset, each conversation is already formatted with Llama 2’s instruction tags, where the user’s dialogue is encapsulated within “[INST]” and “[/INST]” tags, and the model’s responses follow. However, to help the model distinguish between individual dialogues, Beginning of Sequence (BOS) and End of Sequence (EOS) tags are dynamically added during tokenization. The BOS and EOS tags have token piece values of <s> and </s>, with corresponding token IDs of 1 and 2. Training the model in this manner enables it to predict and complete sentences on its own, making it a true conversational small language model that can be used with the llama2.c’s main C file.

img Raw text from the TinyChat dataset and the tokenized text with BOS and EOS tags highlighted in red

Additional changes were also necessary in the llama2.c’s main C file. Two key modifications include handling the BOS tags after the model generates a sentence and filtering out the instruction tags. Since the model learns to generate these tags in sequence, it might occasionally produce sentences that resemble those originally spoken by the user in the TinyChat dataset. This happens because the model cannot inherently distinguish between sentences spoken by the user and those generated by itself, as the instruction tags only separate one question, suggestion, or response from another. With these two modifications, TinyChat15M can be trained using the framework provided by the llama2.c project.

TinyChat15M was trained on Google Colab, a hosted Jupyter Notebook service from Google that provides free access to computing resources, including GPUs and TPUs, with no setup required. Although TinyChat15M is a small language model, it was trained on NVIDIA’s A100 Tensor Core GPU, available through Google Colab Pro. The training process ran for 200,000 iterations and took approximately 5 hours to complete. At the end of the training, I obtained both the original PyTorch .pt file and the llama2.c format .bin file. The .bin file is used for inference with the run.c file. The models are open source and available on Hugging Face.

Running TinyChat15M on the Sipeed LicheeRV Nano W

The Sipeed LicheeRV Nano W, equipped with 256MB of DDR3 memory, serves as an ideal testbed for running TinyChat15M. Demonstrating TinyChat15M on this compact device highlights the model’s ability to deliver real-time conversational AI despite limited memory resources. This setup is perfectly suited for scenarios where computational and memory efficiency are critical, yet functional AI capabilities remain essential. Despite its small size, the LicheeRV Nano W is capable of running Debian, an open-source Linux distribution.

img The Sipeed LicheeRV Nano W next to an old stamp from Sri Lanka (sourced from my dad’s stamp collection)

The TinyChat15M model achieves an average generation speed of 9.40 tokens per second across five runs on the LicheeRV Nano. The video below demonstrates TinyChat15M in action on the LicheeRV Nano, where I, as the User, engage in a fictional conversation with an Assistant powered by the TinyChat15M model in a corporate setting. Despite having no prior context, the Assistant responds appropriately, mimicking a conversation one might have with a real colleague.

In the demo above, we can observe that when the Assistant is generating a reply, the CPU usage spikes to about 96%, while memory usage reaches approximately 27% of the available 228 MB, which equates to around 62 MB. Despite these compute and memory constraints, the TinyChat15M model is able to generate contextually coherent text at a reasonable speed.

I have fully open-sourced this project, and the code for generating the synthetic TinyChat database, as well as the code for training and inferring the TinyChat15M model, can be found on my GitHub. If you have any suggestions or improvements, feel free to raise an issue or submit a pull request on GitHub. With this project, I hope to inspire the AI community to focus on developing more general-purpose conversational small language models. Advances in small language models would democratize access to AI, making it more affordable and accessible for everyone.

This post is licensed under CC BY 4.0 by the author.