Web-based entity resolution, particularly in the context of online marketplaces and e-commerce ecosystems, is a critical task for accurately identifying and matching similar product offers across the web. Traditional approaches to entity resolution have primarily relied on textual information, but the increasing availability of diverse data modalities has led to the adoption of a multimodal approach. We work on an innovative intermediate fusion architecture for multimodal product matching that combines
textual information and visual information.