Jina AI has recently unveiled Reader-LM-0.5B and Reader-LM-1.5B, which are small language models designed to convert raw HTML into clean markdown format. These models aim to address the challenges faced in processing modern web content. In the past, Jina AI introduced Jina Reader, an API for converting URLs into markdown suitable for large language models. However, it encountered issues with content filtering and complex HTML structures. To overcome these limitations, Jina AI developed the Reader-LM models to handle HTML-to-markdown conversion more efficiently. The newly released Reader-LM models, Reader-LM-0.5B and Reader-LM-1.5B, are specifically trained to efficiently convert raw HTML into markdown. They offer superior performance without requiring expensive infrastructure, outperforming larger models in HTML-to-markdown conversion while being much smaller in size. These models are designed to handle long-context inputs and perform selective copying from HTML to markdown. They support a context length of up to 256K tokens, enabling the processing of lengthy and noisy HTML content found on the web. Their multilingual capabilities make them versatile for global applications. Extensive evaluations have shown that Reader-LM-0.5B and Reader-LM-1.5B deliver superior results in generating clean, accurate markdowns from HTML when compared to several large language models. The training of Reader-LM models involved preparing high-quality data pairs of raw HTML and corresponding markdown. Optimizations were made to ensure effective handling of the conversion task without unnecessary computational overhead. Reader-LM models have practical applications in both individual and enterprise environments, offering efficient data processing and multilingual capabilities that make them suitable for various industries and regions. The release of Reader-LM-0.5B and Reader-LM-1.5B signifies a significant advancement in small language model technology, providing a powerful tool for developers and enterprises seeking to optimize their data workflows.
No comments:
Post a Comment