Speech Communication, cilt.175, 2025 (SCI-Expanded)
There is an intricate interplay between third-party AI application programming interfaces and adversarial machine learning. The investigation centers on vulnerabilities inherent in AI models utilizing multiple black-box APIs, with a particular emphasis on their susceptibility to attacks in the domains of speech and text recognition. Our exploration spans a spectrum of attack strategies, encompassing targeted, indiscriminate, and adaptive targeting approaches, each carefully designed to exploit unique facets of multi-modal inputs. The results underscore the intricate balance between attack success, average target class confidence, and the density of swaps and queries. Remarkably, targeted attacks exhibit an average success rate of 76%, while adaptive targeting achieves an even higher rate of 88%. Conversely, indiscriminate attacks attain an intermediate success rate of 73%, highlighting their potency even in the absence of strategic tailoring. Moreover, our strategies’ efficiency is evaluated through a resource utilization lens. Our findings reveal adaptive targeting as the most efficient approach, with an average of 2 word swaps and 140 queries per attack instance. In contrast, indiscriminate targeting requires an average of 2 word swaps and 150 queries per instance.