Moving seamlessly between spoken number words and Arabic digits is common in everyday life. In this study, we systematically investigated the correspondence between auditory number words and visual Arabic digits in adults. Auditory number words and visual Arabic digits were presented concurrently or sequentially and participants had to indicate whether they described the same quantity. We manipulated the stimulus onset asynchronies (SOAs) between the two stimuli (Experiment 1: −500ms to +500ms; Experiment 2: −200ms to +200ms). In both experiments, we found a significant cross-modal distance effect. This effect was strongest for simultaneous stimulus presentation and decreased with increasing SOAs. Numerical distance emerged as the most consistent significant predictor overall, in particular for simultaneous presentation. However, physical similarity between the stimuli was often a significant predictor of response times in addition to numerical distance, and at longer SOAs, physical similarity between the stimuli was the only significant predictor. This shows that SOA modulates the extent to which participants access quantity representations. Our results thus support the idea that a semantic quantity representation of auditory and visual numerical symbols is activated when participants perform a concurrent matching task, while at longer SOAs participants are more likely to rely on physical similarity between the stimuli. We also investigated whether individual differences in the efficiency of the cross-modal processing were related to differences in mathematical performance. Our results are inconclusive about whether the efficiency of cross-format numerical correspondence is related to mathematical competence in adults.