About 16 results
Open links in new tab
  1. LLaVA

    LLaVA Model. We introduce LLaVA (L arge L anguage- a nd- V ision A ssistant), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose …

  2. LLaVA-Plus - GitHub Pages

    🌋 LLaVA-Plus Model. We have developed LLaVA-Plus, a general-purpose multimodal assistant that extends LLaVA by incorporating a large and diverse set of external tools that can be …

  3. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

    Jan 30, 2024 · It is natural to further tune the model on video data for performance boost. Our analysis reveals that a mixed training regimen of video and image data is essential for …

  4. LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large …

    May 25, 2024 · This task enables the model to interact with a 3D environment to solve problems or answer questions by navigating and manipulating its surroundings, which are essential for …

  5. LLaVA-Interactive

    Jun 10, 2024 · LLaVa-Interactive is an all-in-one demo that connects three LV models in one interactive session for image chat, segmentation and generation/editing, which can complete …

  6. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge

    Jan 30, 2024 · We're currently training our next-gen model Llama 3, and we're building massive compute infrastructure to support our future roadmap, including 35k H100s by the end of this …

  7. LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities …

    May 10, 2024 · On January 30, 2024, we unveiled LLaVA-NeXT, a state-of-the-art Large Multimodal Model (LMM) developed using a cost-effective training method leveraging open …

  8. LLaVA-OneVision: Easy Visual Task Transfer

    Aug 5, 2024 · Our experimental results demonstrate show that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three …

  9. LLaVA-NeXT: What Else Influences Visual Instruction Tuning …

    May 25, 2024 · The model size scaling of LLM is more effective than image encoder in yielding improved performance. The success of the latter is more related to its visual input …

  10. LLaVA-Grounding

    We present an end-to-end model, which connects a Large Multimodal Model (LMM) with a grounding model to facilitate grounded visual chat. Our model supports both object and pixel …