Posts tagged:

automation

Autonomy

Robot Vision (and action)!

Hello Gemini Robotics! Google surprised me a couple weeks ago with some cool tech. I’m watching carefully for the events that get AI into hardware and Vision Language Action has a lot of promise in creating more general and adaptable robots. It’s a good name — intelligence that integrates vision, language, and action—allowing robots to understand natural language commands and execute complex tasks in dynamic environments. This represents a move away from narrowly focused, task-specific systems toward robots that can generalize skills across different scenarios.

Google’s PaLM-E debuted in 2023 as one of the first attempts to extend large language models into the robotics domain, using a multimodal setup (text and vision) to instruct simple robot behaviors. While groundbreaking for its time, PaLM-E’s capabilities were limited in terms of task complexity and real-world robustness. Fast-forward to 2025, and Gemini Robotics takes these ideas much further by harnessing the significantly more powerful Gemini 2.0 foundation model. In doing so, Gemini’s Vision-Language-Action (VLA) architecture not only understands language and visual inputs, but also translates them into real-time physical actions. According to DeepMind’s official blog post—“Gemini Robotics brings AI into the physical world”—this new system enables robots of virtually any shape or size to perform a wide range of tasks in dynamic environments, effectively bridging the gap between large-scale multimodal reasoning and real-world embodied intelligence in a way that PaLM-E never could.

However, while its potential is transformative, there are still challenges to overcome in trust and safety, dealing with hardware constraints, and fine-tuning physical actions that might not perfectly align with human intuition.

Introduction

Google DeepMind’s Gemini Robotics is pretty cool. It’s an AI model designed to bring advanced intelligence into the physical world of robots. Announced on March 12, 2025, it represents the company’s most advanced vision-language-action (VLA) system, meaning it can see and interpret visual inputs, understand and generate language, and directly output physical actions. Built on DeepMind’s powerful Gemini 2.0 large language model, the Gemini Robotics system is tailored for robotics applications, enabling machines to comprehend new situations and perform complex tasks in response to natural language commands. This model is a major step toward “truly general purpose robots,” integrating multiple AI capabilities so robots can adapt, interact, and manipulate their environment more like humans do.

Origins and Development of Gemini Robotics

The development of Gemini Robotics is rooted in Google and DeepMind’s combined advances in AI and robotics over the past few years. Google had long viewed robotics as a “helpful testing ground” for AI breakthroughs in the physical realm. Early efforts laid the groundwork: for example, Google researchers pioneered systems like PaLM-SayCan, which combined language models with robotic planning, and PaLM-E, a multimodal embodied language model, hinted at the potential of integrating language understanding with robotic control. In mid-2023, Google DeepMind introduced Robotics Transformer 2 (RT-2), a VLA model that learned from both web data and real robot demonstrations and could translate knowledge into generalized robotic instructions. RT-2 showed improved generalization and rudimentary reasoning – it could interpret new commands and leverage high-level concepts (e.g. identifying an object that could serve as an improvised tool) beyond its direct training. However, RT-2 was limited in that it could only repurpose physical movements it had already practiced; it struggled with fine-grained manipulations it hadn’t explicitly learned.

Gemini Robotics builds directly on these foundations, combining Google’s latest large language model advances with DeepMind’s robotics research. It is built on the Gemini 2.0 foundation model, which is a multimodal AI capable of processing text, images, and more. By adding physical actions as a new output modality to this AI, the team created an agent that can not only perceive and talk, but also act in the real world. The project has been led by Google DeepMind’s robotics division (headed by Carolina Parada) and involved integrating multiple research breakthroughs. According to DeepMind, Gemini Robotics and a companion model called Gemini Robotics-ER were concurrently developed and launched as “the foundation for a new generation of helpful robots”. The models were primarily trained using data from DeepMind’s bi-arm robot platform ALOHA 2 (a dual-armed robotic system introduced in 2024). Early testing was focused on this platform, but the system was also adapted to other hardware – demonstrating control of standard robot arms (Franka Emika robots common in labs) and even a humanoid form (the Apollo robot from Apptronik). This cross-pollination of Google’s large-scale AI (Gemini 2.0) with DeepMind’s embodied AI research culminated in Gemini Robotics, which was formally unveiled in March 2025. Its development reflects a convergence of trends: ever-larger and more general AI models, and the push to make robots more adaptive and intelligent in uncontrolled environments.

Google DeepMind’s announcement comes amid a “race to make robots useful” in the real world, with several companies striving for similar breakthroughs. Notably, just a month prior, robotics startup Figure AI ended a collaboration with OpenAI after claiming an internal AI breakthrough for robots – underscoring that multiple players are working toward general robotic intelligence (awesome). Within this context, Gemini Robotics emerged as one of the most ambitious efforts, leveraging Google’s unique assets (massive language models, vast web data, and prior robotics investments) to create a generalist “brain” for robots.

The Vision-Language-Action (VLA) Model Explained

At the heart of Gemini Robotics is its vision-language-action (VLA) architecture, which seamlessly integrates perception, linguistic reasoning, and motor control. In practical terms, the model can take visual inputs (such as images or camera feeds of a scene), along with natural language inputs (spoken or written instructions), and produce action outputs that drive a robot’s motors or high-level control system. This tri-modal integration allows a robot equipped with Gemini to perceive its surroundings, understand what needs to be done, and then execute the required steps – all in one fluid loop.

Under the hood, Gemini Robotics leverages the powerful representation and reasoning capabilities of the Gemini 2.0 LLM to interpret both the visual scene and the instruction. The visual input is processed by advanced vision models (DeepMind has indicated that PaLI-X and related vision transformers were adapted as part of earlier systems like RT-2, enabling the AI to recognize objects and understand spatial relationships in the camera view. This is paired with Gemini’s language understanding, which can comprehend instructions phrased in everyday, conversational language – even in different wording or multiple languages. By fusing these inputs, the model forms an internal plan or representation of what action to take. The crucial innovation is that the output is not text, but action commands: Gemini Robotics generates a sequence of motor control tokens or low-level instructions that directly control the robot’s joints and grippers. These action outputs are encoded similarly to language tokens – for example, a series of numeric tokens might represent moving a robotic arm to certain coordinates or applying a specific force. This design treats actions as another “language” for the AI to speak.

An intuitive example of how the VLA model works is the task: “Pick up the banana and put it in the basket.” Given this spoken command, a Gemini-powered robot will use its camera to scan the scene, recognize the banana and the basket (drawing on its visual training), understand the instruction and the desired outcome, and then generate the motor actions to reach out, grasp the banana, and drop it into the basket. All of this can occur without hard-coding specific moves for “banana” or “basket” – the model’s general vision-language knowledge enables it to figure it out. Another striking example demonstrated by Google DeepMind was: “Fold an origami fox.” Without having been explicitly trained on that exact task, the system was able to combine its knowledge of origami (learned from text or images during pre-training) with its ability to control robot hands, successfully folding a paper into the shape of a fox. This showcased how Gemini’s world knowledge can translate into physical skill, even for tasks the robot has never performed before. The VLA architecture effectively allows the robot to generalize concepts to new actions – it “understands” what folding a fox entails and can execute the fine motor steps to do so, all guided by a single natural-language instruction.

Technical Functionality and Capabilities

Gemini Robotics is engineered around three key capabilities needed for versatile robots: generality, interactivity, and dexterity. In each of these dimensions, it represents a leap over previous systems, inching closer to the vision of a general-purpose helper robot.

Generality and World Knowledge: Because it is built atop the Gemini 2.0 model (which was trained on vast internet data and encodes extensive commonsense and factual knowledge), Gemini Robotics can handle a wide variety of tasks and adapt to new situations on the fly. It leverages “Gemini’s world understanding to generalize to novel situations,” allowing it to tackle tasks it never saw during training. For example, researchers report that the model-controlled robot could perform a “slam dunk” with a toy basketball into a mini hoop when asked – despite never having seen anything related to basketball in its robot training data. The robot’s foundation model knew what a basketball and hoop are and what “slam dunk” means conceptually, and it could connect those concepts to actual motions (picking up the ball and dropping it through the hoop) to satisfy the command. This kind of cross-domain generalization – applying knowledge from one context to a new, embodied scenario – is a hallmark of Gemini Robotics. In quantitative terms, Google DeepMind noted that Gemini Robotics more than doubled performance on a comprehensive generalization benchmark compared to prior state-of-the-art VLA models. In other words, when tested on tasks, objects, or instructions outside its training distribution, it succeeded more than twice as often as earlier models. This suggests a significant improvement in robustness to novel situations.
Interactivity and Adaptability: Gemini Robotics is designed to work collaboratively with humans and respond dynamically to changes in the environment. Thanks to its language proficiency, the robot can understand nuanced instructions and even follow along in a back-and-forth dialogue if needed. The model can take instructions given in different ways. Crucially, it also demonstrates real-time adaptability: it “continuously monitors its surroundings” and can adjust its plan if something changes mid-task. An example shown by researchers involved a robot told to place a bunch of grapes into a specific container. Midway through, a person shuffled the containers around on the table. Gemini’s controller tracked the target container as it moved and still completed the task, successfully following the correct container in a game of three-card monte on the fly. This shows a level of situational awareness and reactivity that previous static plans would fail at. If an object slips from the robot’s grasp or is moved by someone, the model can quickly replan its actions and continue appropriately. In practical terms, this makes the robot much more resilient in unstructured, dynamic settings like homes or workplaces, where surprises happen regularly. Users can also update instructions on the fly – for instance, interrupt a task with a new command – and Gemini will smoothly transition, a quality of “steerability” that is important for human-robot collaboration.
Dexterity and Physical Skills: A standout feature of Gemini Robotics is its level of fine motor control. Many everyday tasks humans do – buttoning a shirt, folding paper, handling fragile objects – are notoriously hard for robots because they require precise force and coordination. Gemini Robotics has demonstrated advanced dexterity by tackling multi-step manipulation tasks that were previously considered out of reach for robots without extensive task-specific programming. In tests, its dual-armed system could fold an origami crane, pack a lunchbox with assorted items, carefully pick a single card from a deck without bending others, and even seal snacks in a Ziploc bag. These examples illustrate delicate handling and coordination of two arms/hands – akin to human bimanual tasks. Observers noted that this capability appears to “solve one of robotics’ biggest challenges: getting robots to turn their ‘knowledge’ into careful, precise movements in the real world”. It’s important to note, however, that some of the most intricate feats were achieved in the context of tasks the model was trained on with high-quality data, meaning the robot had practice or human-provided demonstrations for those specific skills. As IEEE Spectrum reported, the impressive origami performance doesn’t yet mean the robot will generalize all dexterous skills – rather, it shows the upper bound of what the system can do when given focused training on a fine motor task. Still, compared to prior systems that could barely grasp a single object reliably, Gemini’s leap to complex manipulation is a significant advancement.

Multi-Embodiment and Adaptability to Hardware: Another technical strength of Gemini Robotics is its ability to work across different robot form factors. Rather than being tied to one specific robot model or configuration, the Gemini AI was trained in a way that makes it adaptable to “multiple embodiments”. DeepMind demonstrated that a single Gemini Robotics model could control both a fixed bi-arm platform and a humanoid robot with very different kinematics. The team trained primarily on their ALOHA 2 twin-arm system, but also showed the model could be specialized to operate the Apptronik “Apollo” humanoid – which has arms, hands, and a torso – to perform similar tasks in a human-like form. This hints at a future where one core AI could power many kinds of robots, from warehouse picker arms to home assistant humanoids, with minimal additional training for each. Google has in fact partnered with Apptronik and other robotics companies as trusted testers to apply Gemini in different settings. The generalist nature of the model means it encodes abstract skills that can transfer to new bodies. Of course, some calibration is needed for each hardware – the system might need to learn the dynamics and constraints of a new robot – but it doesn’t need to learn the task from scratch. This is a big efficiency gain: historically, each new robot or use-case required building an AI model almost from zero or collecting a trove of demonstrations, whereas Gemini provides a strong starting point.
Embodied Reasoning (Gemini Robotics-ER): Alongside the main Gemini Robotics model, Google DeepMind introduced Gemini Robotics-ER (with “ER” standing for Embodied Reasoning). This variant emphasizes enhanced spatial understanding and reasoning about the physical world, and it is intended as a tool for roboticists to integrate with their own control systems. Gemini Robotics-ER can be thought of as a vision-language model with an intuitive grasp of physics and geometry – it can look at a scene and infer things like object locations, relationships, or potential affordances. For instance, if shown a coffee mug, the model can identify an appropriate grasping point (the handle) and even plan a safe approach trajectory for picking it up. It marries the perception and language skills of Gemini with a sort of built-in physics engine and even coding abilities: the model can output not only actions but also generated code or pseudo-code to control a robot, which developers can hook into low-level controllers. In tests, Gemini-ER could perform an end-to-end robotics pipeline (perception → state estimation → spatial reasoning → planning → control commands) out-of-the-box, achieving 2-3× higher success rates on certain tasks compared to using Gemini 2.0 alone. A striking feature is its use of in-context learning for physical tasks – if the initial attempt isn’t sufficient, the model can take a few human-provided demonstrations as examples and then improve or adapt its solution accordingly. Essentially, Robotics-ER acts as an intelligent perception and planning module that can be plugged into existing robots, complementing the primary Gemini Robotics policy which directly outputs actions. By separating out this embodied reasoning component, DeepMind allows developers to leverage Gemini’s spatial intelligence even if they prefer to use their own motion controllers or safety constraints at the execution layer. This two-model approach (Gemini Robotics and Robotics-ER) gives flexibility in deployment – one can use the full end-to-end model for maximum autonomy, or use the ER model alongside manual programming to guide a robot with high-level smarts.
Safety Mechanisms: Given the high level of autonomy Gemini Robotics enables, safety has been a paramount concern in its functionality. The team implemented a layered approach to safety, combining traditional low-level safety (e.g. collision avoidance, not exerting excessive force) with higher-level “semantic safety” checks. The model evaluates the content of instructions and the consequences of actions before executing them, especially in the Robotics-ER version which was explicitly trained to judge whether an action is safe in a given scenario. For example, if asked to mix chemicals or put an object where it doesn’t belong, the AI should recognize potential hazards. DeepMind’s head of robotic safety, Vikas Sindhwani, explained that the Gemini models are trained to understand common-sense rules about the physical world – for instance, knowing that placing a plush toy on a hot stove or combining certain household chemicals would be unsafe. To benchmark this, the team introduced a new “Asimov” safety dataset and benchmark (named after Isaac Asimov, who famously penned robot ethics rules) to test AI models on their grasp of such common-sense constraints. The Gemini models reportedly performed well, answering over 80% of the safety questions correctly. This indicates the model usually recognizes overtly dangerous instructions or outcomes. Additionally, Google DeepMind has implemented a set of governance rules (“robot constitution”) for the robot’s behavior, inspired by Asimov’s Three Laws of Robotics, which were expanded into the ASIMOV dataset to train and evaluate the robot’s adherence to ethical guidelines. For instance, rules might include not harming humans, not damaging property, and prioritizing user intent while obeying safety constraints. These guardrails, combined with ongoing human oversight (especially during testing phases), aim to ensure the powerful capabilities of Gemini Robotics are used responsibly and do not lead to unintended harm.

Is Gemini Robotics the Future of Robotics?

So are we done? Time for our robot overlords? The introduction of Gemini Robotics has sparked debate about whether large VLA models like this represent the future of robotics – a future where robots are generalists rather than specialists, and where intelligence is largely driven by massive pretrained AI models. Google DeepMind certainly positions Gemini as a game-changer, stating that it “lays the foundation for a new generation of helpful robots” and touting its success on previously impossible tasks. Many experts see this as a natural evolution: as AI systems have achieved human-level (and beyond) performance in language and vision tasks, it was inevitable that those capabilities would be extended to physical machines. Generative AI models are getting closer to taking action in the real world, notes IEEE Spectrum, and the big AI companies are now introducing AI agents that move from just chatting or analyzing images to actually manipulating objects and environments. In this sense, Gemini Robotics does appear to be a glimpse of robotics’ likely trajectory – one where embodied AI is informed by vast prior knowledge and can be instructed as easily as talking to a smart speaker. The ability to simply tell a robot what to do in natural language (e.g. “please tidy up the kitchen” or “assemble this Ikea chair”) and have it figure out the steps is a long-standing dream in robotics, and Gemini’s demos suggest we are closer than ever.

Proponents argue that such generalist models are necessary to break robotics out of its current limits. Up to now, most robots excel only in narrow domains (an industrial arm can repeat one assembly task all day, a Roomba can vacuum floors but do nothing else, etc.). To be broadly useful in human environments, robots must deal with endless variability – new objects, unforeseen situations, and evolving human instructions. Hard-coding every contingency or training a custom model for every task is impractical. Therefore, an AI that can leverage world knowledge and reason on the fly is needed. Google’s team emphasizes that all three qualities – generalization, adaptability, dexterity – are “necessary… to create a new generation of helpful robots” that can actually operate in our homes and workplaces. In other words, without an AI like Gemini, robots would remain too rigid or labor-intensive to program for the messy real world. By giving robots a “common sense” understanding of the world and the ability to learn tasks from description, Gemini Robotics indeed could be a cornerstone for future robotics platforms.

That said, whether Gemini’s approach is the future or just one branch is still a matter of discussion. Some robotics researchers caution against over-relying on massive pretrained models. One limitation observed even in Gemini is that its human-like biases may not always yield optimal physical decisions. For example, Gemini-ER identified the handle of a coffee mug as the best grasp point – because in human data, people hold mugs by handles – but for a robot hand, especially one immune to heat, grabbing the sturdy body of the mug might actually be more stable than the thin handle. This shows that the AI’s “common sense” is derived from human norms, which isn’t always ideal for a machine’s capabilities. Overcoming such discrepancies may require integrating pure learning from trial and error or additional physical reasoning beyond internet-based knowledge. Additionally, the most dexterous feats Gemini achieved were ones it specifically trained for with high-quality demonstrations. This indicates that while the model is general, it still benefits from practice on complex tasks – true one-shot generalization to any arbitrary manipulation is not fully solved. In essence, Gemini Robotics might be necessary but not sufficient for the future of robotics: it provides the intelligence and adaptability, but it must be paired with continued advances in robot hardware and in task-specific learning.

Another consideration is the computational complexity. Gemini 2.0 (the base model) is extremely large, and running such models in real-time on a robot can be challenging. Currently, much processing might happen on cloud servers or off-board computers. The future of robotics may depend on making these models more efficient (something Google is also addressing with smaller variants like “Gemma” models for on-device AI. The necessity of a model like Gemini might also depend on the application: for a general home assistant or a multipurpose factory robot, a broad intelligence is likely crucial. But for simpler, well-defined tasks (say, a robot that just sorts packages), a smaller specialized AI might suffice more economically. Therefore, we might see a hybrid future – in high-complexity domains, Gemini-like brains lead the way, whereas simpler robots use more lightweight solutions. Nonetheless, the techniques and learnings from Gemini (like treating actions as language tokens, and using language models for planning) are influencing the entire field. Competing approaches will likely incorporate similar ideas of multi-modal learning and large-scale pre-training because of the clear gains in capability that Gemini and its predecessors have demonstrated. In summary, Gemini Robotics does appear to be a harbinger of robotics’ future, providing a path to move beyond single-purpose machines. Its advent underscores a paradigm shift where generalist AI systems control robots – a notion that until recently belonged to science fiction – and it illustrates both the promise and the remaining challenges on the road to truly adaptive, helpful robots.

Comparisons to Alternative Approaches in Robotic AI

Gemini Robotics’s approach can be contrasted with several other strategies in robotic AI, each with its own philosophy and trade-offs. Below, we compare it to some notable alternatives:

Modular Robotics (Traditional Pipeline): The classic approach to robot intelligence breaks the problem into separate modules: perception (computer vision to detect objects, SLAM for mapping), decision-making (often rule-based or using a planner), and low-level control (path planning and motor control). In such systems, there isn’t a single monolithic AI “brain” – instead, engineers program each component and often use specific AI models for each (like a convolutional network for object recognition, a motion planner for navigation, etc.). The advantage is reliability and predictability; each module can be tuned and verified. However, the drawback is rigidity – the robot can only do what its modules were explicitly designed for. Adding new tasks requires significant re-engineering. In contrast, Gemini Robotics is more of an end-to-end learned system. It attempts to handle perception, reasoning, and action generation in one model (or at least with tightly integrated learned components). This holistic approach allows for more flexibility – e.g. interpreting an unfamiliar instruction and devising a behavior – which a modular system might not handle if not pre-programmed. Of course, Gemini still relies on some low-level controllers (for fine motion, safety constraints), but the boundary between modules is blurred compared to the traditional pipeline. Essentially, Gemini trades some transparency for vastly greater adaptability. As robots move into unstructured environments, this trade is often seen as worthwhile, because no team of engineers can anticipate every scenario the way a large trained model can generalize.
Reinforcement Learning and Imitation Learning: Another approach is to teach robots via trial and error (RL) or by imitating human demonstrations, training a policy neural network for specific tasks. OpenAI famously used reinforcement learning plus human teleoperation demos to train a robot hand to solve a Rubik’s Cube, and Boston Dynamics uses a mix of planning and learned policies for locomotion. These systems can achieve remarkable feats, but they are typically narrow in scope – the policy is an expert at one task or a set of related tasks. Gato, a 2022 DeepMind model, was an attempt at a multi-task agent (it learned from dozens of different tasks, from playing Atari games to controlling a robot arm), but even Gato had a fixed set of skills defined by its training data. Gemini Robotics differs by leveraging pretrained knowledge: rather than learning from scratch via millions of trials, it starts with a rich base of semantic and visual understanding from web data. This gives it a form of “common sense” and language ability out-of-the-box that pure RL policies lack. In essence, Gemini moves the needle from “learning by doing” to “learning by reading/seeing” (plus some doing). That means a Gemini-based robot might execute a new task correctly on the first try because it can reason it out, whereas an RL-based robot would likely flail until it accumulates enough reward feedback. The flip side is that RL can sometimes discover unconventional but effective strategies (since it isn’t constrained by human priors), and it can optimize for physical realities (dynamics, wear and tear) if given direct experience. A future approach might combine both: use models like Gemini for high-level reasoning and understanding, but allow some reinforcement fine-tuning on the robot itself to refine the motions for efficiency or reliability. In fact, fine-tuning Gemini on robotic data (like how RT-2 was fine-tuned on real robot experiences is exactly what Google did, and further RL training could be a next step beyond the initial demo capabilities.
Other Foundation Model Approaches: Google is not alone in applying foundation models to robotics. Competitors and research labs are pursuing similar visions. For example, OpenAI has been exploring ways to connect its GPT-4 model to robotics (though no official product has been announced, internal experiments and partnerships have been noted). The startup Figure AI, which is building humanoid robots, initially partnered with OpenAI to leverage GPT-style intelligence, but as Reuters reported, Figure decided to develop its own model after making an internal breakthrough. This suggests that a number of players are working on language-model-driven robotics, each with their own twist. Another example is Microsoft and Google’s PaLM-E (embodied language model) which was a precursor integrating a vision encoder with a language model to output robot actions. Meta (Facebook) has also done research on combining vision, language, and action – though they’ve been quieter in announcing a unified model, they have components like visual question answering and world model simulators that could feed into robotics. There is also a distinction in whether the model is end-to-end or generates code for the robot. Some approaches have the AI output code (in Python or a robot API) which is then executed by a lower system – this can be safer in some cases, as the code can be reviewed or sandboxed. Gemini-ER’s ability to produce code for robot control falls into this category. Companies like Adept AI take a similar approach for software automation (outputting code from instructions); applied to robotics, one could imagine an AI that writes a script for the robot (e.g. using a library like ROS – Robot Operating System – to move joints) rather than directly controlling every torque. The drawback is that if the situation deviates, the code might not handle it unless the AI is looped in to rewrite it. Gemini’s direct policy is more fluid, but code-generation approaches offer interpretability.

In comparison to these alternatives, Gemini Robotics distinguishes itself by its scale of general knowledge and its demonstrated range of capabilities without task-specific training. A Tech Review article headline concisely states it “uses Google’s top language model to make robots more useful,” implying that plugging in this very large pre-trained brain yields a robot that can do far more than ones programmed with only task-specific data. Moreover, Gemini was shown to operate “with minimal training” for new tasks – many instructions it can execute correctly the first time, thanks to its prior knowledge. This sets it apart from typical RL or modular systems which often need extensive data or engineering for each new behavior. However, it’s worth noting that alternative approaches are converging. For instance, Robotics Transformer (RT-2), while smaller in scope, was already a VLA model and Gemini is essentially the next evolution. So rather than completely different paradigms, we are seeing a merging: classical robotics brings the safety and control expertise, while new AI brings flexibility. The future likely involves integrating these approaches – indeed Google’s strategy with Gemini-ER is to let developers use their own controllers alongside the AI, marrying learned reasoning with trusted control algorithms.

From an industry perspective, the availability of models like Gemini Robotics can level the playing field for robotics companies. Instead of needing a large AI team to build a robot’s brain from scratch, startups can leverage a frontier model (via API or licensing) and focus on hardware and specific applications. Google asserts that such models can help “cash-strapped startups reduce development costs and increase speed to market” by handling the heavy AI lifting. Similar services may be offered by cloud providers (e.g. Nvidia’s ISAAC system, etc.). In contrast, some companies might prefer to develop their own AI to avoid dependency – as seen by Figure’s move away from OpenAI. Competitors like Tesla are also working on general-purpose humanoid robots (Tesla’s Optimus) but with a philosophy leaning on vision-based neural networks distilled from human demonstrations (as per Tesla’s AI Day presentations) – an approach less language-oriented than Gemini’s, at least publicly. It remains to be seen which approach yields better real-world performance and safety.

In summary, Gemini Robotics exemplifies the large-scale, knowledge-rich, end-to-end learning approach to robotic AI. Alternative approaches exist on a spectrum from hand-engineered to learning-based, and many are gradually incorporating more AI-driven generality. The consensus in the field is that to achieve human-level versatility in robots, some form of massive, generalist AI is required – whether it’s exactly Gemini or something akin to it. As of now, Gemini Robotics stands at the cutting edge, but it will no doubt inspire both open-source and commercial efforts that use similar techniques to push robotics forward.

Gemini Robotics represents a bold leap forward in the quest to endow robots with general intelligence and adaptability. By fusing vision, language, and action, it allows machines to see the world, reason about it with human-like understanding, and act to carry out complex tasks – all guided by natural language. The origins of this model lie in years of incremental progress (from multi-modal models to robotic transformers) that have now converged into an embodied AI agent unlike any before. Technically, Gemini’s VLA model demonstrates unprecedented capabilities: it can generalize knowledge to new situations, interact fluidly with humans and dynamic environments, and perform fine motor skills once thought too difficult for robots without extensive programming. These achievements inch us closer to the long-envisioned general-purpose robot helper.

Yet this is not an endpoint but a starting point for the next era of robotics. Gemini Robotics does appear to chart the future: a future where intelligent robots can be taught with words and examples, not just hard-coded instructions, and where they can work alongside people in everyday environments. Its approach is likely to influence most robotic AI strategies going forward, whether through direct usage or through the adoption of similar large-model techniques by others. At the same time, the necessity of such powerful AI in robots also brings into focus the responsibilities that come with it. Ensuring safety, aligning actions with human values, and considering the societal impact are as integral to this future as the algorithms themselves.

In comparing Gemini to other approaches, we see that while alternative methods exist (from classical robotics to reinforcement learning), the field is coalescing around the idea that scale and generality in AI are key to cracking the hardest robotics challenges. Gemini’s current edge in performance (reportedly doubling previous systems on generalization benchmarks) validates this direction. But it also highlights that cooperation between human engineers and AI will remain crucial – whether for fine-tuning dexterous skills, implementing safety constraints, or adapting the model to new robot hardware. The “brain” may be general, but the implementation details and guarantees still require human ingenuity and caution.

Fun Stuff for more:

Google DeepMind – “Gemini Robotics brings AI into the physical world” (Blog post, Mar 12, 2025) (Gemini Robotics brings AI into the physical world – Google DeepMind) (Gemini Robotics brings AI into the physical world – Google DeepMind)
Reuters – “Google introduces new AI models for rapidly growing robotics industry” (Mar 12, 2025) (Google introduces new AI models for rapidly growing robotics industry | Reuters) (Google introduces new AI models for rapidly growing robotics industry | Reuters)
Wired – “Google’s Gemini Robotics AI Model Reaches Into the Physical World” by Will Knight (May 2025) (Gemini Robotics – Wikipedia)
IEEE Spectrum – “With Gemini Robotics, Google Aims for Smarter Robots” by Eliza Strickland (Mar 12, 2025) (IEEE News – Short cycle degree in Software Development) (IEEE News – Short cycle degree in Software Development)
Ars Technica – “Google’s new robot AI can fold delicate origami, close zipper bags…” by Benj Edwards (Mar 12, 2025) (Alterslash, the unofficial Slashdot digest) (Alterslash, the unofficial Slashdot digest)
MIT Technology Review – “Gemini Robotics uses Google’s top language model to make robots more useful” by Scott J. Mulligan (Mar 12, 2025) (AITopics | Gemini Robotics uses Google’s top language model to make robots more useful) (AITopics | Gemini Robotics uses Google’s top language model to make robots more useful)
Digital Trends – “Gemini might soon drive futuristic robots that can do your chores” by Patrick Hearn (Mar 12, 2025) (Gemini might soon drive futuristic robots that can do your chores | Digital Trends) (Gemini might soon drive futuristic robots that can do your chores | Digital Trends)
Google DeepMind – “RT-2: New model translates vision and language into action” (Blog post, 2023) (RT-2: New model translates vision and language into action – Google DeepMind) (RT-2: New model translates vision and language into action – Google DeepMind)
Google DeepMind – Gemini Robotics Official Page (Gemini Robotics – Google DeepMind) (Gemini Robotics – Google DeepMind)
Google DeepMind – Gemini Robotics Technical Report (2025) (Gemini Robotics brings AI into the physical world – Google DeepMind) (Gemini Robotics brings AI into the physical world – Google DeepMind)
Others (Nature, Gizmodo, CNBC summaries) for additional context (Techmeme: Google DeepMind launches two AI models, Gemini 2.0-based Robotics and Robotics-ER, to help robots “perform a wider range of real-world tasks than ever before” (Emma Roth/The Verge)) (Techmeme: Google DeepMind launches two AI models, Gemini 2.0-based Robotics and Robotics-ER, to help robots “perform a wider range of real-world tasks than ever before” (Emma Roth/The Verge)).

April 8, 2025 (updated April 8, 2025) By Tim Booher 0 Comments

Career & Productivity

I want my amazon data

Tracking and categorizing financial transactions can be tedious, especially when it comes to Amazon orders. With Amazon’s new data policies, it’s harder than ever to retrieve detailed purchase information in a usable format. Amazon is so broad that a credit card charge could be for a digital or physical product, a whole foods purchase. The purchases themselves are complicated, with charges connected to shipments and with gift card transactions involved. You can request all your data, but it’s in a terrible format, basically a big flattened table that doesn’t correlate order_id to invoiced transactions.

I have a well oiled set of code that downloads all my transactions from every bank, credit card, etc, uses machine learning (naive bayes) to categorize transactions and upload them to a google sheet where I can balance everything, check categories and add additional details. My code then downloads 25 years of transactions (every penny I’ve ever spent) into postgres (both locally and cloud based) that allows me to use R and tableau to do a bunch of analysis. It’s a hobby to sit down and build a google slide deck filled with data on where our money is going and how our investments are doing.

Our orders are going up, and it’s time for automation.

This system has evolved over time and works really well. Here, I’m wanted to share how I get amazon transactions to match my bank charges so I can list the products I’m buying.

Step 1: get your amazon order data into a database

This is tricky — google privacy central (the link has changed a couple times in the last year or so) and you can request a zip file of your data. There are two parts to this: request your data and then wait for an email with a link to download it later. It’s surprising that it could take days for what is surely a fully automated process, but it generally takes hours.

Eventually, you get your Orders.zip which has:

├── Digital-Ordering.1
│   ├── Digital Items.csv
...
├── Retail.CustomerReturns.1.1
│   └── Retail.CustomerReturns.1.1.csv
├── Retail.OrderHistory.1
│   └── Retail.OrderHistory.1.csv
├── Retail.OrderHistory.2
│   └── Retail.OrderHistory.2.csv

The file we want is Retail.OrderHistory.1.csv. You can get that into a database with this code:

Step 2: Get Your Amazon invoices data into the Database (via scraping)

That took a lot of work to get right, and that code works well for about 80% of my transactions, but some required matching actual invoice amounts with order_id. To make that work, you have to scrape your orders page, click on the order and download the detail. I’ve written a lot of code that does that before, but it’s a pain to get right (Google Chrome tools is a game-changer for that). Fortunately, I found this code that does exactly that: https://github.com/dcwangmit01/amazon-invoice-downloader

The Amazon Invoice Downloader is a Python script that automates the process of downloading invoices for Amazon purchases using the Playwright library. It logs into an Amazon account with provided credentials, navigates to the “Returns & Orders” section, and retrieves invoices within a specified date range or year. The invoices are saved as PDF files in a local directory, with filenames formatted to include the order date, total amount, and order ID. The script mimics human behavior to avoid detection and skips downloading invoices that already exist.

You get all the invoices, but most helpful is the resultant csv:

cat downloads/2024/transactions-2024.csv
Date,Amount,Card Type,Last 4 Digits,Order ID,Order Type
2024-01-03,7.57,Visa,1111,114-2554918-8507414,amazon
2024-01-03,5.60,Visa,1111,114-7295770-5362641,amazon

You can use this code to get this into the database:

And finally, a script that compares the database to my google sheet and adds the match to uncategorized transactions

This was a bit tricky, but all works well now. Hopefully this saves you some time. Please reach out if you have any questions and happy analysis.

January 21, 2025 (updated January 21, 2025) By Tim Booher 0 Comments

Personal Projects & Interests

Gate Automation

Our gate is the most important motor in our home. It’s critical for security, and if it’s open, the dog escapes. As all our kids cars go in and out, it’s always the gate opening and closing. It matters.

The problem is that we have to open it and we don’t always have our phones around. We use alexa for home automation and iphones.

We have the Nice Apollo 1500 with a Apollo 635/636 control board. This is a simple system with only three states: opening, closing, and paused. The gate toggles through these three states by connecting ground (GND) to an input (INP) on the control board, which a logic analyzer would observe as a voltage drop from the operating level (e.g., 5V or 12V) to 0V.

To automate this I purchased the Shelly Plus 1 UL a Wi-Fi and Bluetooth-enabled smart relay switch. It includes dry contact inputs, perfect for systems requiring momentary contact for activation. It’s also compatible with major smart home platforms, including Amazon Alexa, Google Home, and SmartThings, allowing for voice control and complex automation routines. You can get most of the details here.

There are many ways to wire these switches. I’m setting this up for a resistive load with a 12 VDC stabilized power supply to ensure a reliable, controlled voltage drop each time the Shelly activates the gate. With a resistive load, the current flow is steady and predictable, which works perfectly with the gate’s control board input that’s looking for a simple drop to zero to trigger the gate actions. Inductive loads, on the other hand, generate back EMF (electromotive force) when switching, which can cause spikes or inconsistencies in voltage. By using a stabilized 12 VDC supply with a resistive load, I avoid these fluctuations, ensuring the gate responds cleanly and consistently without risk of interference or relay wear from inductive kickback. This setup gives me the precise control I need.

Shelly Plus 1 UL          Apollo 635/636 Control Board

      [O] --------------------- [INP]
      [I] --------------------- [GND]
      [L] --------------------- [GND]
      [12V] ------------------- [12V]

Settings

The Shelly Plus 1 UL is set up with I and L grounded, and the 12V input terminal provides power to the device. When the relay activates, it briefly connects O to ground, pulling down the voltage on the INP input of the Apollo control board from its usual 5V or 12V to 0V, which simulates a button press to open, close, or pause the gate.

To get this working right, you have to set up the input/output settings in the Shelly app after adding the device. In the input/output settings, the detached switch mode is key here, as it allows each button press to register independently without toggling the relay by default. Setting Relay Power On Default to Off keeps the relay open after any power cycle, avoiding unintended gate actions.

With Detached Switch Mode, Default Power Off and a 0.5-second Auto-Off timer, every “button press” from the app causes a voltage drop for 0.5 seconds. Adding the device to Alexa means I can now just say, “Alexa, activate gate,” which will act like a garage door button press.

November 3, 2024 (updated November 3, 2024) By Tim Booher 0 Comments

Personal Projects & Interests

Quick(ish) Price Check on a Car

So, is it a good price?

With my oldest daughter heading off to college soon, we’ve realized that our family car doesn’t need to be as large as it used to be. We’ve had a great relationship with our local CarMax over the years, and we appreciate their no-haggle pricing model. My wife had her eyes set on a particular model: a 2019 Volvo XC90 T6 Momentum. The specific car she found was listed at $35,998, with 47,000 miles on the odometer.

But is the price good or bad? As a hacker/data scientist, I knew could get the data to make an informed decision but doing analysis at home is a great way to learn and use new technologies. The bottom line is that the predicted price would be $40,636 or 11.4% higher than the CarMax asking price. If I compare to the specific trim, the price should be $38,666. So the price is probably fair. Now how did I come up with that number?

Calculations

Armed with Python and an array of web scraping tools, I embarked on a mission to collect data that would help me determine a fair value for our new car. I wrote a series of scripts to extract relevant information, such as price, age, and cost from various websites. This required a significant amount of Python work to convert the HTML data into a format that could be analyzed effectively.

Once I had amassed a good enough dataset (close to 200 cars), I began comparing different statistical techniques to find the most accurate pricing model. In this blog post, I’ll detail my journey through the world of logistic regression and compare it to more modern data science methods, revealing which technique ultimately led us to the fairest car price.

First, I did some basic web searching. According to Edmunds, the average price for a 2019 Volvo XC90 T6 Momentum with similar mileage is between $33,995 and $43,998 and my $35,998 falls within this range.

As for how the Momentum compares to other Volvo options and similar cars, there are a few things to consider. The Momentum is one of four trim levels available for the 2019 XC902. It comes with a number of standard features, including leather upholstery, a panoramic sunroof, and a 9-inch touchscreen infotainment system. Other trim levels offer additional features and options.

The 2019 Volvo XC90 comes in four trim levels: Momentum, R-Design, Inscription, and Excellence. The R-Design offers a sportier look and feel, while the Inscription adds more luxury features. The Excellence is the most luxurious and expensive option, with seating for four instead of seven. The Momentum is the most basic.

In terms of similar cars, some options to consider might include the Audi Q7 or the BMW X5. Both of these SUVs are similarly sized and priced to the XC90.

To get there, I had to do some web scraping, data cleaning, and built a basic logistic regression model, as well as other modern data science methods. To begin my data collection journey, I decided (in 2 seconds) to focus on three primary sources: Google’s search summary, Carvana, and Edmunds.

My first step was to search for Volvo XC90 on each of these websites. I then used the Google Chrome toolbar to inspect the webpage’s HTML structure and identify the <div> element containing the desired data. By clicking through the pages, I was able to copy the relevant HTML and put this in a text file, enclosed within <html> and <body> tags. This format made it easier for me to work with the BeautifulSoup Python library, which allowed me to extract the data I needed and convert it into CSV files.

Since the data from each source varied, I had to run several regular expressions on many fields to further refine the information I collected. This process ensured that the data was clean and consistent, making it suitable for my upcoming analysis.

Finally, I combined all the data from the three sources into a single CSV file. This master dataset provided a solid foundation for my pricing analysis and allowed me to compare various data science techniques in order to determine the most accurate and fair price for the 2019 Volvo XC90 T6 Momentum.

In the following sections, I’ll delve deeper into the data analysis process and discuss the different statistical methods I employed to make our car-buying decision.

First, data from carvana looked like this:

<div class="tk-pane full-width">
    <div class="inventory-type carvana-certified" data-qa="inventory-type">Carvana Certified
    </div>
    <div class="make-model" data-qa="make-model">
        <div class="year-make">2020 Volvo XC90</div>
    </div>
    <div class="trim-mileage" data-qa="trim-mileage"><span>T6 Momentum</span> • <span>36,614
            miles</span></div>
</div>
<div class="tk-pane middle-frame-pane">
    <div class="flex flex-col h-full justify-end" data-qa="pricing">
        <div data-qa="price" class="flex items-end font-bold mb-4 text-2xl">$44,990</div>
    </div>
</div>

In this code snippet, I used the BeautifulSoup library to extract relevant data from the saved HTML file, which contained information on Volvo XC90 listings. The script below searches for specific <div> elements containing the year, make, trim, mileage, and price details. It then cleans up the data by removing unnecessary whitespace and commas before storing it in a dictionary. Finally, the script compiles all the dictionaries into a list and exports the data to a CSV file for further analysis.

I could then repeat this process with Google to get a variety of local sources.

One challenge from the Google results, was that I had a lot of data in the images (they were base64 encoded) so wrote a bash script to clean up the tags using sed (pro tip: learn awk and sed)

When working with the Google search results, I had to take a slightly different approach compared to the strategies used for Carvana and Edmunds. Google results did not have a consistent HTML structure that could be easily parsed to extract the desired information. Instead, I focused on identifying patterns within the text format itself to retrieve the necessary details. By leveraging regular expressions, I was able to pinpoint and extract the specific pieces of information, such as the year, make, trim, mileage, and price, directly from the text. My scrape code is below.

Scraping Edmunds required both approaches of using formatting and structure.

All together, I got 174 records of used Volvo XC90s, I could easily get 10x this since the scripts exist and I could mine craigslist and other places. With the data I have, I can use R to explore the data:

# Load the readxl package
library(readxl)
library(scales)
library(scatterplot3d)

# Read the data from data.xlsx into a data frame
df <- read_excel("data.xlsx")

df$Price<-as.numeric(df$Price)/1000

# Select the columns you want to use
df <- df[, c("Title", "Desc", "Mileage", "Price", "Year", "Source")]

# Plot Year vs. Price with labeled axes and formatted y-axis
plot(df$Year, df$Price, xlab = "Year", ylab = "Price ($ '000)",
     yaxt = "n")  # Don't plot y-axis yet

# Add horizontal grid lines
grid()

# Format y-axis as currency
axis(side = 2, at = pretty(df$Price), labels = dollar(pretty(df$Price)))

abline(lm(Price ~ Year, data = df), col = "red")

Armed with this data, we can assign a logistic regression model.

This code snippet employs the scatterplot3d() function to show a 3D scatter plot that displays the relationship between three variables in the dataset. Additionally, the lm() function is utilized to fit a linear regression model, which helps to identify trends and patterns within the data. To enhance the plot and provide a clearer representation of the fitted model, the plane3d() function is used to add a plane that represents the linear regression model within the 3D scatter plot.

model <- lm(Price ~ Year + Mileage, data = df)

# Plot the data and model
s3d <- scatterplot3d(df$Year, df$Mileage, df$Price,
                     xlab = "Year", ylab = "Mileage", zlab = "Price",
                     color = "blue")
s3d$plane3d(model, draw_polygon = TRUE)

So, we can now predict the price of 2019 Volvo XC90 T6 Momentum with 47K miles, which is $40,636 or 11.4% higher than the CarMax asking price of $35,998.

# Create a new data frame with the values for the independent variables
new_data <- data.frame(Year = 2019, Mileage = 45000)

# Use the model to predict the price of a 2019 car with 45000 miles
predicted_price <- predict(model, new_data)

# Print the predicted price
print(predicted_price)

Other Methods

Ok, so now let’s use “data science”. Besides linear regression, there are several other techniques that I can use to take into account the multiple variables (year, mileage, price) in your dataset. Here are some popular techniques:

Decision Trees: A decision tree is a tree-like model that uses a flowchart-like structure to make decisions based on the input features. It is a popular method for both classification and regression problems, and it can handle both categorical and numerical data.

Random Forest: Random forest is an ensemble learning technique that combines multiple decision trees to make predictions. It can handle both regression and classification problems and can handle missing data and noisy data.

Support Vector Machines (SVM): SVM is a powerful machine learning algorithm that can be used for both classification and regression problems. It works by finding the best hyperplane that separates the data into different classes or groups based on the input features.

Neural Networks: Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are powerful models that can handle both numerical and categorical data and can be used for both regression and classification problems.

Gradient Boosting: Gradient boosting is a technique that combines multiple weak models to create a stronger one. It works by iteratively adding weak models to a strong model, with each model focusing on the errors made by the previous model.

All of these techniques can take multiple variables into account, and each has its strengths and weaknesses. The choice of which technique to use will depend on the specific nature of your problem and your data. It is often a good idea to try several techniques and compare their performance to see which one works best for your data.

I’m going to use random forest and a decision tree model.

Random Forest

# Load the randomForest package
library(randomForest)

# "Title", "Desc", "Mileage", "Price", "Year", "Source"

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Fit a random forest model
model <- randomForest(Price ~ Year + Mileage, data = train_data, ntree = 500)

# Predict the prices for the test data
predictions <- predict(model, test_data)

# Calculate the mean squared error of the predictions
mse <- mean((test_data$Price - predictions)^2)

# Print the mean squared error
cat("Mean Squared Error:", mse)

The output from the random forest model you provided indicates that the model has a mean squared error (MSE) of 17.14768 and a variance explained of 88.61%. A lower MSE value indicates a better fit of the model to the data, while a higher variance explained value indicates that the model can explain a larger portion of the variation in the target variable.

Overall, an MSE of 17.14768 is reasonably low and suggests that the model has a good fit to the training data. A variance explained of 88.61% suggests that the model is able to explain a large portion of the variation in the target variable, which is also a good sign.

However, the random forest method shows a predicted cost of $37,276.54.

I also tried cross-validation techniques to get a better understanding of the model’s overall performance (MSE 33.890). Changing to a new technique such as a decision tree model, turned MSE into 50.91. Logistic regression works just fine.

Adding the Trim

However, I was worried that I was comparing the Momentum to the higher trim options. So to get the trim, I tried the following prompt in Gpt4 to translate the text to one of the four trims.

don't tell me the steps, just do it and show me the results.
given this list add, a column (via csv) that categorizes each one into only five categories Momentum, R-Design, Inscription, Excellence, or Unknown

That worked perfectly and we can see that we have mostly Momentums.

	Excellence	Inscription	Momentum	R-Design	Unknown
Count	0	68	87	8	9
Percent	0.00%	39.53%	50.58%	4.65%	5.23%

Frequency and Count of Cars

And this probably invalidates my analysis as Inscriptions (in blue) do have clearly higher prices:

We can see the average prices (in thousands). In 2019 Inscriptions cost less than Momentums? That is probably a small $n$ problem since we only have 7 Inscriptions and 16 Momentum’s in our data set for 2019.

Year	R-Design	Inscription	Momentum
2014	$19.99	NA	NA
2016	$30.59	$32.59	$28.60
2017	$32.79	$32.97	$31.22
2018	$37.99	$40.69	$33.23
2019	NA	$36.79	$39.09
2020	NA	$47.94	$43.16

Average Prices by Trim (in thousand dollars)

So, if we restrict our data set smaller, what would the predicted price of the 2019 Momentum be? Just adding a filter and running our regression code above we have $38,666 which means we still have a good/reasonable price.

Quick Excursion

One last thing I’m interested in: does mileage or age matter more. Let’s build a new model.

# Create Age variable
df$Age <- 2023 - df$Year

# Fit a linear regression model
model <- lm(Price ~ Mileage + Age, data = df)

# Print the coefficients
summary(model)$coef

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	61.34913	0.69084	88.80372	2.28E-144
Mileage	-0.00022	2.44E-05	-8.83869	1.18E-15
Age	-2.75459	0.27132	-10.1525	3.15E-19

Impact of Different Variables

Based on the regression results, we can see that both Age and Mileage have a significant effect on Price, as their p-values are very small (<0.05). However, we can also see that Age has a larger absolute t-score (-10.15) than Mileage (-8.84), indicating that Age may have a slightly greater effect on Price than Mileage. Additionally, the estimates show that for every one-year increase in Age, the Price decreases by approximately 2.75 thousand dollars, while for every one-mile increase in Mileage, the Price decreases by approximately 0.0002 thousand dollars (or 20 cents). That is actually pretty interesting.

This isn’t that far off. According to the US government, a car depreciates by an average of $0.17 per mile driven. This is based on a five-year ownership period, during which time a car is expected to be driven approximately 12,000 miles per year, for a total of 60,000 miles.

In terms of depreciation per year, it can vary depending on factors such as make and model of the car, age, and condition. However, a general rule of thumb is that a car can lose anywhere from 15% to 25% of its value in the first year, and then between 5% and 15% per year after that. So on average, a car might depreciate by about 10% per year.

Code

While initially in the original blog post, I moved all the code to the end.

Carvana Scrape Code

Cleaner Code

Google Scrape Code

Edumund’s Scrape Code

March 26, 2023 (updated February 21, 2025) By Tim Booher 6 Comments

Personal Projects & Interests

Kids Lego table: Case study in Automation for Design

[mathjax]

Motivation

I had to upgrade the Lego table I made when my kids were much smaller. It needed to be higher and include storage options. Since I’m short on time, I used several existing automation tools to both teach my daughter the power of programming and explore our decision space. The goals were to stay low-cost and make the table as functional as possible in the shortest time possible.

Lauren and I had fun drawing the new design in SketchUp. I then went to the Arlington TechShop and build the frame easily enough from a set of 2x4s. In order to be low-cost and quick, we decided to use the IKEA TROFAST storage bins. We were inspired from lots of designs online such as this one:

lego-table-example

However, the table I designed was much bigger and build with simple right angles and a nice dado angle bracket to hold the legs on.

table_with_bracket

The hard part was figuring out the right arrangement to place the bins underneath the table. Since my background is in optimization I was thinking about setting up two-dimensional knapsack problem but decided to do brute-force enumeration since the state-space was really small. I built two scripts: one in Python to numerate the state space and sort the results and one in JavaScript, or Extendscript, to automate Adobe Illustrator to give me a good way to visually considered the options. (Extendscript just looks like an old, ES3, version of Javascript to me.)

So what are the options?

There are two TROFAST bins I found online. One costs \$3 and the other \$2. Sweet. You can see their dimensions below.

options

They both are the same height, so we just need to determine how to make the row work. We could arrange each TROFAST bin on the short or long dimension so we have 4 different options for the two bins:

	Small Side	Long Side
Orange	20	30
Green	30	42

First, Lauren made a set of scale drawings of the designs she liked, which allowed us to think about options. Her top left drawing, ended up being our final design.

I liked her designs, but it got me thinking what would all feasible designs look like and we decided to tackle this since she is learning JavaScript.

Automation

If we ignore the depth and height, we then have only three options $[20,30,42]$ with the null option of $0$ length. With these lengths we can find the maximum number of bins if the max length is $112.4 \text{cm}$. Projects like this always have me wondering how to best combine automation with intuition. I’m skeptical of technology and aware that it can be a distraction and inhibit intuition. It would have been fun to cut out the options at scale or just to make sketches and we ended up doing those as well. Because I’m a recreational programmer, it was fairly straightforward to enumerate and explore feasible options and fun to show my daughter some programming concepts.

$$ \left\lfloor
\frac{112.4}{20}
\right\rfloor = 5 $$

So there are $4^5$ or $1,024$ total options from a Cartesian product. A brute force enumeration would be $O(n^3)$, but fortunately we have $\text{itertools.product}$ in python, so we can get all our possible options easily in one command:

itertools.product([0,20,30,42], repeat=5)

and we can restrict results to feasible combinations and even solutions that don’t waste more than 15 cm. To glue Python and Illustrator together, I use JSON to store the data which I can then open in Illustrator Extendscript and print out the feasible results.

Later, I added some colors for clarity and picked the two options I liked:

options

These both minimized the style of bins, were symmetric and used the space well. I took these designs forward into the final design. Now to build it.

final_design

Real Math

But, wait — wrote enumeration? Sorry, yes I didn’t have much time when we did this, but there are much better ways to do this. Here are two approaches:

Generating Functions

If your options are 20, 30, and 40, then what you do is compute the coefficients of the infinite series

$$(1 + x^{20} + x^{40} + x^{60} + …)(1 + x^{30} + x^{60} + x^{90} + …)(1 + x^{40} + x^{80} + x^{120} + …)$$

I always find it amazing that polynomials happen to have the right structure for the kind of enumeration we want to do: the powers of x keep track of our length requirement, and the coefficients count the number of ways to get a given length. When we multiply out the product above we get

$$1 + x^{20} + x^{30} + 2 x^{40} + x^{50} + 3 x^{60} + 2 x^{70} + 4 x^{80} + 3 x^{90} + 5 x^{100} + …$$

This polynomial lays out the answers we want “on a clothesline”. E.g., the last term tells us there are 5 configurations with length exactly 100. If we add up the coefficients above (or just plug in “x = 1”) we have 23 configurations with length less than 110.

If you also want to know what the configurations are, then you can put in labels: say $v$, $t$, and $f$ for twenty, thirty, and forty, respectively. A compact way to write $1 + x^20 + x^40 + x^60 + … is 1/(1 – x^20)$. The labelled version is $1/(1 – v x^20)$. Okay, so now we compute

$$1/((1 – v x^{20})(1 – t x^{30})(1 – f x^{40}))$$

truncating after the $x^{100}$ term. In Mathematica the command to do this is

Normal@Series[1/((1 - v x^20) (1 - t x^30) (1 - f x^40)), {x, 0, 100}]

with the result

$$1 + v x^{20} + t x^{30} + (f + v^2) x^{40} + t v x^{50} + (t^2 + f v + v^3) x^{60} + (f t + t v^2) x^{70} + (f^2 + t^2 v + f v^2 + v^4) x^{80} + (t^3 + f t v + t v^3) x^{90} + (f t^2 + f^2 v + t^2 v^2 + f v^3 + v^5) x^{100}$$

Not pretty, but when we look at the coefficient of $x^{100}$, for example, we see that the 5 configurations are ftt, ffv, ttvv, fvvv, and vvvvv.

Time to build it

Now it is time to figure out how to build this. I figured out I had to use $1/2$ inch plywood. Since I do woodworking in metric, this is a dimension of 0.472 in or 1.19888 cm.

 $31.95 / each Sande Plywood (Common: 1/2 in. x 4 ft. x 8 ft.; Actual: 0.472 in. x 48 in. x 96 in.)

or at this link

So the dimensions of this are the side thickness $s$ and interior thickness $i$ with shelf thickness $k$. Each shelf is $k = 20-0.5 \times 2 \text{cm} = 19 \text{cm}$ wide. All together, we know:

$$w = 2\,s+5\,k+4\,i $$

and the board thickness is $t$ where $t < [s, i]$.

which gives us:

st	width
s	1.20
i	3.75
k	19.00
w	112.40

Code

The code I used is below:

References

February 7, 2016 (updated September 4, 2023) By Tim Booher One Comment

Career & Productivity

Some tax-time automation

I often struggle to find the right balance between automation and manual work. As it is tax time, and Chase bank only gives you 90 days of statements, I find myself every year going back through our statements to find any business expenses and do our overall financial review for the year. In the past I’ve played around with MS Money, Quicken, Mint and kept my own spreadsheets. Now, I just download the statements at the end of year and use acrobat to combine and ruby to massage the combined PDF into a spreadsheet.¹

To do my analysis I need everything in a CSV format. After, getting one PDF, I end up looking at the structure of the document which looks like:

Earn points [truncated] and 1{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} back per $1 spent on all other Visa Card purchases.

Date of Transaction Merchant Name or Transaction Description $ Amount
PAYMENTS AND OTHER CREDITS
01/23 -865.63
AUTOMATIC PAYMENT - THANK YOU

PURCHASES
12/27  AMAZON MKTPLACE PMTS AMZN.COM/BILL WA  15.98
12/29  NEW JERSEY E-ZPASS 888-288-6865 NJ  25.00
12/30  AMAZON MKTPLACE PMTS AMZN.COM/BILL WA  54.01

0000001 FIS33339 C 2 000 Y 9 26 15/01/26 Page 1 of 2

I realize that I want all lines that have a number like MM/DD followed by some spaces and a bunch of text, followed by a decimal number and some spaces. In regular expression syntax, that looks like:

/^(\d{2}\/\d{2})\s+(.*)\s+(\d+\.\d+)\s+$/

which is literally just a way of describing to the computer where my data are.

Through using Ruby, I can easily get my expenses as CSV:

Boom. Hope this helps some of you who might otherwise be doing a lot of typing. Also, if you want to combine PDFs on the command line, you can use PDFtk thus:

pdftk file1.pdf file2.pdf cat output -

The manual download takes about 10 minutes. When I get some time, I’m up for the challenge of automating this eventually with my own screen scraper and web automation using some awesome combination Ruby and Capybara. I also use PDFtk to combine PDF files. ↩

January 12, 2016 By Tim Booher One Comment

Career & Productivity

Family Photo Management

Post still in work

Our family pictures were scattered over several computers, each with a unique photo management application. In an effort to get a good backup in place. I moved all my pictures to one computer where I accidentally deleted them. (Long story.) I was able to recover them all, but I had numerous duplicates and huge amounts of other image junk. To make matters much more complicated. I accidentally moved all files into one directory with super-long file names that represented their paths. (Another long story.) Yes, I should have built backups. Lesson learned. In any case, while scripting super-powers can sometimes get you into trouble, the only way to get out of them is with some good scripts.

We have decided to use Lightroom on Windows as a photo-management application. Our windows machines have a huge amount of storage that we can build out quickly with cheap hard drives. However, you can imagine one problem I have to solve is to eliminate a huge amount of duplicate images at different size, and to get rid of junk images.

Removing duplicates

I wrote the following code in Matlab to find duplicates and bin very dark images. It scans the directory for all images, reduces their size, computes an image histogram, which it then wraps into 16 sections, that are summed and normalized. I then run a 2-d correlation coefficient on each possible combination.

$$
r = \frac{
\sum_m \sum_n \left(A_{mn} – \hat A \right) \left(B_{mn} – \hat B \right)
}{
\sqrt{
\left(
\sum_m \sum_n \left(A_{mn} – \hat A \right)^2
\right)
\left(
\sum_m \sum_n \left(B_{mn} – \hat B \right)^2
\right)
}
}
$$

The result are comparisons such as this one.

comparison

And a histogram of the correlation coefficients shows a high degree of correlation in general.

histogram

My goal is to use this to keep the biggest photo and put the others in a directory of the same name. More to come, after I get some help.

November 17, 2014 By Tim Booher One Comment

Engineering & Code

Excel Sorting and Grouping

I had two tables downloaded from Amazon:

Items

Order Date  Order ID    Title   Category
1/26/14 102-4214073-2201835     Everyday Paleo Family Cookbook
1/13/14 115-8766132-0234619     Awesome Book A
1/13/14 115-8766132-0234619     Awesome Book B

and

Orders

Order Date  Order ID    Subtotal
1/6/14  102-6956821-1091413 $43.20
1/13/14 115-8766130-0234619 $19.42
1/16/14 109-8688911-2954602 $25.86

I’m building our Q1 2014 taxes and needed rows in the following format:

1/13/14 115-8766132-0234619 $22.43 Awesome Book A, Awesome Book B

In order to do this without using SQL, I did the following. If columns B corresponds to Order Id and C corresponds to the item Title, then I put the following formula in column N3

=+IF(B3=B2,N2 & " | " &C2,C3)

and in column O3 a column which might be named: “last before change?”:

=+IF(B3=B4,"", TRUE)

Then I could easily sort out the unwanted values. Done. Still, I would like to better automate this. Any thoughts appreciated.

June 6, 2014 By Tim Booher 0 Comments